Stress Test Server Issues


  • DymStudios - CEO

    Hi everyone,

    as you likely know already, we've been experiencing serious server issues since we launched our stress test. The issues are not on our side, but within the SpatialOS runtime.

    To give you some context, the current game world is simulated by several different instances of Unity (called "workers"), each controlling a portion of the map. The runtime is the process that stores data and allows for communication between workers, both servers and clients. In short, it's the process that stitches everything together.

    To say it in very non-technical terms, after some time the deployment has been running, the runtime just goes nuts. Some bottleneck is born within it that causes a long queue for one type of message (called "component updates"). You cut a tree, the logs stay piled for 30 seconds, then fall? That's because there was a 30-seconds queue for position updates for the logs. If there are more players connected it happens faster, but even if we cap the number of concurrent users to a low value (like last night - 250), it eventually happens anyway.

    So, what is causing it? We have no idea (the runtime is out of our control) and neither do the Improbable engineers so far. It might be a bug in the runtime, or more likely there is something wrong we're doing with it we're not aware of. We've already tried two different fixes which didn't make a real difference. Today we're hard at work at another larger server patch, we'll let you know when servers are back online.

    We're so sorry for the mess. Yes, this is what a stress test is about, but we did expect it to be a lot better. The only good metric we've had so far is that server workers are never overloaded, even if there are a lot of players fighting monsters and such. This is important, but with the runtime failing, irrelevant.

    A word of warning: we'll likely have to wipe the game world before dropping the patch.

    Thank you for your patience.

    Best,
    Jacopo


  • TF#4 - EMISSARY

    Cheers for the update. All part of the development process πŸ™‚


  • TF#1 - WHISPERER

    Well I never had such a problem with spatial os. You are probably doing something wrong. Debugging such a problem is a pain in the butt. I wish you good luck and patience πŸ™‚


  • DymStudios - CEO

    @yvz5 said in Stress Test Server Issues:

    Well I never had such a problem with spatial os. You are probably doing something wrong. Debugging such a problem is a pain in the butt. I wish you good luck and patience πŸ™‚

    Did you ever have 300+ players connected in a 60km2 world with both high entity count and high entity density (500,000+ resource nodes and 25,000+ creatures in total, which means 8000+ resources and 400+ creatures per km2), or something comparable to that? If so, PM me! πŸ™‚



  • @Prometheus

    What does the worker metrics say? How is your bridge to engine latency?


  • DymStudios - CEO

    @Aristeaus said in Stress Test Server Issues:

    @Prometheus

    What does the worker metrics say? How is your bridge to engine latency?

    Oh it's nice to see other devs using spatial in the Fractured community! πŸ™‚

    Runtime to server worker latency is always low and stable (30ms), before and after the issue takes place. There is nothing anomalous in the metrics, not in latency nor in the number of commands, component updates, CPU and RAM usage, etc. No workers are being killed and recreated - workers metrics are great, all server workers are stable at their intended framerate (20 FPS) and with a large margin to receive more pressure.

    The only anomalous metric is memory usage by the runtime which about 20-30 minutes after the issue starts manifesting jumps up and eats up all the RAM of the deployment. That happens when the deployment is already FUBAR though, likely because of the crazy amount of messages queued up.


  • TF#1 - WHISPERER

    could be a GC problem. message queue shouldnt be a bottleneck if the message length is not enourmous. it could be the place to look after to solve the issue


  • TF#1 - WHISPERER

    Don't sweat it too much. It's Alpha and that's what Alpha's are for. I would however like to add two things... keep in mind that people logging in are doing so on there time. A reward for taking their time to identify issues like this wouldn't be a bad idea... It would also be benificial to open up a free Alpha weekend again once everything is sorted out.

    GL on the bug identification and I look forward to playing in the future!

    Cheers,

    Blitz



  • @Prometheus said in Stress Test Server Issues:

    @yvz5 said in Stress Test Server Issues:

    Well I never had such a problem with spatial os. You are probably doing something wrong. Debugging such a problem is a pain in the butt. I wish you good luck and patience πŸ™‚

    Did you ever have 300+ players connected in a 60km2 world with both high entity count and high entity density (500,000+ resource nodes and 25,000+ creatures in total, which means 8000+ resources and 400+ creatures per km2), or something comparable to that? If so, PM me! πŸ™‚

    Darkfall Online had 7-10k players in a 3d open world without instances, with actual projectile/other physics etc, 10 years ago.

    Of course it used a completely custom engine instead of the severely-limited Unity and whatever spatialOS' advertising claims to do.

    If a devteam actually wants to make a proper unique game, they would make a custom engine like the devs of Darkfall or Starbase have done.

    Also I saw a post somewhere saying that all of Fractured's spells are just D&D spells copypasted over instead of making your own, so are you going to replace them with original spells later or do anything original at all?


  • TF#10 - CONSUL

    Thanks for the update! Keep up the good work.


  • TF#10 - CONSUL

    Thanks for the update and explanation!
    Good luck on tracking the issue down, that can be such a pain in the rear.


  • TF#3 - ENVOY

    @Prometheus Thanks for updates you and your team are doing great loving game so far. It’s alpha issues are expected along the way will only get better from here. Definitely would recommend another stress test soon after this weekend done just to make sure you sorted and corrected current problem and servers can handle increased volume. Have a great day happy testing all.


  • TF#1 - WHISPERER

    @yvz5 @Prometheus That is the first thing I thought of is how big is the message or packets that are being transferred? There shouldn't be any bottleneck issues if the packets are small enough and being pushed right through without any kind of cache buildup for any reason. Does the game push packets for the entire chunk at once or is it pushed through per client action? Like is the actions of players in a certain part of the map all compiled into one packet then pushed through or is each action of each player pushed through on its own? Many cases, 100k tiny packets will process 10x faster than 10k larger packets.



  • Extend the open alpha stress test? πŸ™‚


  • TF#2 - MESSENGER

    @BluntedJ said in Stress Test Server Issues:

    Extend the open alpha stress test? πŸ™‚

    Agreed, that would only be fair considering many people including myself can't access the game for 2nd day.



  • @Prometheus

    Thanks for the update and all the hard work, the team seems engaged and cares!


  • TF#10 - CONSUL

    Seems like a big inconvenience. Hope you can troubleshoot it and get it fixed during the weekend. Best of luck.



  • @Prometheus If it's the RAM that's getting all chewed up, then I'd agree with yvz5 that it sounds like stuff isn't being released from memory somewhere like it should. Basically a memory leak of something being created a lot but never properly disposed, which would lead to it piling up in the GC and still sitting in RAM as it's trying to sort through it all to see what's actually still in use or being referenced during runtime. With so many resource nodes and resource items being created, creatures spawning and dying, player-built stuff being created and deleted, it could very easily be a memory leak.


  • TF#10 - CONSUL

    @Prometheus
    i wished i could help more, but never worked with SpatialOS. but since i build many different server base architecture i would agree it sound like GC, I suggest, see the timing of the queued up messages and check what processes should have run, and see if they finished runing without memory left usage.


  • TF#3 - ENVOY

    @BluntedJ

    Since this seems like it could take time to fix, I'd say try to fix it first, then open a new stress test to see if everything works as intended. (and possibly find a new problem to fix πŸ˜‰ )


 

Copyright Β© 2020 Dynamight Studios Srl | Fractured