Hi everyone,
as you likely know already, we've been experiencing serious server issues since we launched our stress test. The issues are not on our side, but within the SpatialOS runtime.
To give you some context, the current game world is simulated by several different instances of Unity (called "workers"), each controlling a portion of the map. The runtime is the process that stores data and allows for communication between workers, both servers and clients. In short, it's the process that stitches everything together.
To say it in very non-technical terms, after some time the deployment has been running, the runtime just goes nuts. Some bottleneck is born within it that causes a long queue for one type of message (called "component updates"). You cut a tree, the logs stay piled for 30 seconds, then fall? That's because there was a 30-seconds queue for position updates for the logs. If there are more players connected it happens faster, but even if we cap the number of concurrent users to a low value (like last night - 250), it eventually happens anyway.
So, what is causing it? We have no idea (the runtime is out of our control) and neither do the Improbable engineers so far. It might be a bug in the runtime, or more likely there is something wrong we're doing with it we're not aware of. We've already tried two different fixes which didn't make a real difference. Today we're hard at work at another larger server patch, we'll let you know when servers are back online.
We're so sorry for the mess. Yes, this is what a stress test is about, but we did expect it to be a lot better. The only good metric we've had so far is that server workers are never overloaded, even if there are a lot of players fighting monsters and such. This is important, but with the runtime failing, irrelevant.
A word of warning: we'll likely have to wipe the game world before dropping the patch.
Thank you for your patience.
Best,
Jacopo