A team was formed to address the recent issues in EUW. The team included platform engineers, live producer, network engineer and me.Sigh .....
All of a sudden the majority of the frontend systems that the players connect to, timed out and dropped players from the platform causing a Login queue. There are other side effects such as game starts slowing down significantly.
Why does this problem occur?
One or more of nodes in the memory caching cluster encounter a “out of memory” error for their java heap. This for some reason freezes up the whole cluster for a couple of minutes causing a cascading failure.
The “nitty-gritty” details:
The front end servers, backend services and databases communicate through an intermediary memory caching layer. This is a fairly standard design. What makes us unique is that we are probably the only shop that runs more than normal number nodes in that cluster design. The node that goes out of memory for some reason was shoring up a ton of data instead of sending it over the network.
We have an automated process that restarts the nodes that run out of memory. We have been doing it for a while without an issue, however in the past few days, the restart of a node fails to happen cleanly and it ends up joining the cluster in a bad state that causes it request a ton of data causing further issues.
The steps of nodes leaving and joining the cluster is a fairly normal operation, so why was it causing issues?
We reached out to the vendor of the memory caching software. It turns out we are running into a bug that might be fixed in the newer version. We have few leads on what is causing this bug and we are verifying them.
In the mean time we are doing the following.
1) Disabling the auto restart of nodes that run out of memory to prevent this.
2) Load testing the newer version of the software.
3) Separating End of Game Stats into a separate queue to enable players to get back into game faster.
1) “Add more servers Rito don't be stingy!”
Add more servers can/will probably exacerbate the problem. We added servers on March 6th in anticipation of the growth of EUW while Amsterdam is getting ready. There is a theory that the new systems might actually contribute to the problem and we might consider shutting them down as they were added to handle future growth for the month of March.
2) Why is this taking so long?
It does take a while to analyze heap dumps and stack traces and tie that to the appropriate root cause especially when the software is provided by a 3rd party and not something written in house.
3) If you are the shop running the largest number of nodes of this caching software is that good or bad?
Its neither good or bad. It just works, what it probably means is that we need to revisit some of the configuration values for the system. We have run loadtests before going live with some configurations. This problem in particular seems to have a hard fail somewhere in the boundary of 48-72 hours with an occasional blip in between.
4) If the node is not able to send data over the network then the network is bad!
We looked into this. The network engineers and the expert from the company that provides the memory caching layer confirmed that. Since the systems of the cluster are on the same network, they are no routing problems for the nodes in this cluster.”
5) But why does ONLY EUW suffer?
We have seen this problem show up in Vietnam, NA and in Latin America. However the frequency is higher in Vietnam and EUW.
Edit: Working on the issue that is happening right now.