At King, we routinely execute automated tests into one or two devices at the same time. The process is pretty simple: take the devices you want to test against, create a build with the game and the test, press play, and grab a coffee.
This has worked well, but we felt that in order to ensure our games work in the maximum number of devices, we needed a system to execute our tests across as many devices as we want. Less buggy games, more happy users.
Here’s how we developed a Cloud Device solution for King.
Under the Device Cloud, we were essentially pretending to run test jobs in multiple distributed devices. This is similar to running map/reduce jobs in multiple distributed machines. In this sense, we needed to distribute the execution and then aggregate the result. So, we end up taking a look at the architecture of a well-known probe map/reduce framework, the YARN Hadoop architecture:
From our point of view it seems reasonable to start by using the same module structures and a similar assignment of responsibilities to them. Mainly because, as we stated before, the problem at hand has some important analogies.
So, we end up with three main modules:
This module is responsible for:
- Handle Resource Discovery and Assignment, in our case the resources refer to the devices connected to the cluster
- Schedule runs
The main difference with the YARN architecture is than we can create as many instances we need for the cloud.
Each test running in the system is managed by a dedicated Application Manager. So, the main responsibility of this module is to manage an execution of a test (Please note that a test might run on several devices, so an execution of a test is actually spawning several individual tests for each selected device, which are actually executed in different Node Managers.)
- Negotiate Resources with Resource Managers
- Coordinate test execution with Node Managers
Each device needs to be connected to a physical machine. Each of these machines contains a node manager which is responsible for:
- Executing a test in any of the connected devices
- Coordinate the device test execution with the Application Manager
- Gossip the actual available devices to the Resource Manager
Building distributed systems
Before we get to how we built a distributed system, here’s a little background about distributed systems, so you can understand our initial concerns and our thought process.
Our system is basically a bunch of processes talking to each other in order to coordinate themselves. Just like any other cluster of processes. These processes may be in the same machine or may not. In fact, most of the time they are in different machines. So, we needed interprocess communication.
Everyone wants a big, powerful, nice-looking cluster. But, as long as it grows, the more often the cluster begins to fail. In fact, the probability of failure increases with the number of processes. A process may fail because a programming error or it may fail because the machine running the process had a problem, because of network issues, or for whatever reason.
Here’s an example: Imagine we have a machine with a failure rate of three months running one process. That means, it is likely to have a problem within the next three months.
Now, take a look at this table:
|1 machine running one process||1 fail in 3 months|
|100 machines running one process||1 fail every 21,6 hours|
|1000 machines running one process||1 fail every 2,16 hours|
|10000 machines running one process||1 fail every 12,96 minutes|
So, we really need a failure detector in order to be aware of the failing process and take action and also a recovery system in order to restart processes.
Building it: choosing the right tool
So, how did we build our distributed system?
Due to the technical background of the team and time constraints, we wanted to create the cluster with the Java language and use a framework that would speed up our development.
Taking all this into account, there were several framework that we considered:
All of them are capable of being part of a cluster and can be used with Java.
We compared the four candidates under four criteria: interprocess communication, failure detector, recovery system, and programming model, awarding each candidate a score of one to five for each criterion.
Quasar and Akka send messages between actors in a completely transparent way. Akka also ensures FIFO message ordering between two actors, as well as sending messages to a concrete actor in the hierarchy. It also supports eventbus-like communication.
Vertx uses a central EventBus with publish/subscribe semantics.
Norbert needs a client/server approach, so no point-to-point message, not very suitable for our purposes.
We need point-to-point communication as well as eventbus-like communication. That lead us to score:
- 5 points for Akka
- 3 points for Quasar
- 2 points for Vertx
- 1 point for Norbert
Norbert uses Zookeeper.
Akka uses a Gossip-based Membership Protocol.
In all models, we can subscribe to Cluster State and react in consequence. But Akka does not need a third-party software and also provides cluster-aware routers and metrics.
So we scored 5 points for Akka and 4 points for the others.
Quasar uses Hazelcast to keep internal state of actors and provides ‘Migrating Actors’ in case of failure.
Akka can use a number of persistence systems (via plugins) to store state information with Persistent Actors.
Neither Vertx or Norbert offers a recovery system.
Akka has the most flexible approach and easy of use recovery system. This leads us to 5 points for Akka, 4 points for Quasar, and 0 points for the others.
Quasar and Akka are based on a hierarchical actor model.
Vertx uses a actor-like model without hierarchies and Norbert uses Java classes with an embedded Cluster Client.
The actor system paradigm simplifies concurrence issues, so it will speed up our development process. It also fits very well into our architecture, since we can easily map the main application modules: Resource Manager, Application Manager, and Node Manager, to an actor Hierarchy.
So we scored 5 points for Quasar and Akka and 2 point for Vertex and 1 point for Norbert.
Quasar are Akka are very close in our evaluation. However, since Akka is more mature, it broke the deadlock, and we decided to choose Akka.
Building it: first architectural design
Once we chose Akka, we laid out our first architectural layout. In the following Architecture diagram, you will see a mix of a runtime component views with a logical deploy view.
If you recall the modules from the ‘Main Software Modules’ section, you will see in the above diagram that we have a Resource Manager node coordinating with a ApplicationManager who is actually running a Test Execution. The Application Manager itself is coordinating the execution with the Node Manager, and finally the Node Manager is executing the test and sending the device info to the ResourceManager. This is the basic interaction within in the system.
After six months, we ended up with a working proof of concept. That is, we built the core systems and a simple test (without third-party platform integration) that runs with almost no failure in Android and only some failures in iOS. It should be noted that iOS is a more restrictive platform than Android. So more variables might fail when running a test on that platform. The most significant one is the rights restriction and management, based on certificates that iOS imposes in what we can do in order to ensure the successful execution of a test.
Tests for Candy Crush, Bubble Witch 2, and Pyramid have been already launched and successfully executed.
Next steps are to make the system more reliable and robust, as well as add some essential features for game studios. So, we plan to add:
- Facebook, Apple Store, and Google Store integration: So the test can include some parts which interact with these platforms.
- Configuration management: So we can choose which framework to use for executing the test. Right now we only support TestNG test.
- Monitoring System: In order to track the cluster status, both visually and programmatically.
- Audit System: In order to track Test executions and be able to react to failures.
- Eventual Messaging Consistent System: Akka guarantees at most one message delivery semantics. That means, an Actor might receive a message or not. We need to make sure than at some point the Actor receives the message. So we need an Eventual Consistency message model.
- Improve execution engines in the NodeManager, especially for iOS devices.
A note on licences
“…the license allows the user of the software the freedom to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software, under the terms of the license, without concern for royalties.”