Real-Time-Monitoring-in-the-Device-Cloud-9

Real-Time Monitoring in the Device Cloud

At King, the Device Cloud services multiple teams, saving them time when creating and maintaining automated tests on real devices. It provides an infrastructure that supports multiple test engines and provides more than 100 devices to the entire company with out-of-the-box features  such as Automatic App Crash Detection, Screenshots, and Monkey Test.

Being an IaaS system, it is critical to be able to detect and act on issues as quickly as possible so as to prevent blocking the delivery pipelines of the various development teams as well as to facilitate post-mortem analysis when things went wrong. This requires both development and operations efforts; for this purpose we have set up a whole monitoring infrastructure for the cloud.

Real-Time Monitoring in the Device Cloud

There are mainly three systems we use to monitor and track the availability of services:

  • Zabbix to monitor infrastructure (services running, network conditions, servers configuration). This is the only one that doesn’t require some kind of code inside the application.
  • Audit to track internal execution events so you can reconstruct a sequence of steps (e.g. how long did it take an execution to run in multiple devices?, when did it happen?, how long did it take to get a hold on a device?). This is an excellent companion for logs gathering as well.
  • JMX and Prometheus Live Monitoring to gather real-time performance metrics specific to the system behaviour (e.g. CPU, memory, devices online, current tests executing, success rates). Like Audit, this one is integrated as part of the code.

There are other monitoring systems we have in place, such as Graylogs for semi-real-time log monitoring and Jolokia for exposing JMX beans from the application remotely, but for the purpose of this article we’ll focus on the above three, with special emphasis on the Prometheus Stack.

Zabbix

Zabbix is an enterprise open source monitoring software for networks and applications that is provided and maintained by the Engineering Systems Support team at King. We use it to trigger alarms in the event of infrastructure issues, such as bad system configuration, network problems, and devices monitoring.

Real-Time Monitoring in the Device Cloud

For example, a prolonged failure in WiFi connectivity from a device can cause an automated test to fail, so we need to be sure that devices are always connected and responding  in a timely fashion to network requests. Due to the number of devices and servers it’s helpful for us to create devices maps that track devices with their corresponding server. When an issue is detected we can run remote operations on the device to try fix it, and in extreme situations, physically check the device in its rack. The maps are generated with the help of automated scripts that generate the inventory.

Real-Time Monitoring in the Device Cloud

Zabbix is the only system of the three discussed that is exclusively an operations tool, without any development effort.

Audit

Now entering into the development area, Audit is a system we built to track events and reconstruct sequences of actions that happened for the main purpose of debugging or reporting KPIs. This infrastructure goes on top of our Akka-based solution by making actors extend an AuditableActor interface. You can read more about Akka here. If you are new to the Actor Model concept, you can get an overview here.

The way it works is by intercepting messages sent to actors in the Device Cloud Cluster that extend AuditableActor interface and, if the event is auditable, then it’s published to a queue in a message broker. This is later read by an audit system which registers the event in a MongoDB database.

Real-Time Monitoring in the Device Cloud

The database can then be queried to be able to reconstruct the sequence of events for a test execution, monitor exceptional events, or extract devices usage time. We can even graph it to show the event sequences.

Real-Time Monitoring in the Device Cloud

JMX and Prometheus Live Monitoring

Finally, what connects everything is our live monitoring solution based on JMX and Prometheus.

The general concept is straightforward: we publish metrics to JMX, run Prometheus’ JMX Exporter as a Java Agent in all the nodes to publish the beans in a format understandable by Prometheus, and configure at least one Prometheus server to pull metrics from these nodes. But the devil is in the details: to accomplish it and make it easy for future expansions, here is what we did.

Real-Time Monitoring in the Device Cloud

Publish metrics to JMX (with Akka)

This is very well documented, what we had to do was to add MBeans to JMX according to our needs. These MBeans can then be exported or monitored with different JMX monitoring tools, or using Java agents such as Jolokia. Depending on your needs you would have to implement custom MBeans but if you want better interoperability with external systems together with simplicity, you can use MXBeans. Since we just needed counters or gauges (based on Prometheus’ metric types), we were able to implement simple MXBeans with some properties.

Basically we have a BeanWithProperties interface which has a method to retrieve a bean’s name and a Map with <property-name, property-value> tuples.

1
2
3
4
public interface BeanWithProperties {
Map<String, String> getPropertyKeys();
String getName();
}

We extend it with the type of beans you need. For example, monitoring a device availability in the system:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
public class PhysicalDeviceAvailabilityMetricBean implements BeanWithProperties {

private final Map<String, String> properties;

public PhysicalDeviceAvailabilityMetricBean(PhysicalDevice device) {

ImmutableMap.Builder<String, String> builder = ImmutableMap.builder();

builder.put("host", device.getDeviceAddress().getHost());

builder.put("manufacturer", device.getManufacturer());

builder.put("model", device.getModel());

builder.put("os", device.getOs());

builder.put("osVersion", device.getVersion());

builder.put("serial", device.getSerial());

this.properties = builder.build();

}

@Override

public Map<String, String> getPropertyKeys() {

return properties;

}

@Override

public String getName() {

return this.getClass().getSimpleName();

}

}

<b></b>

Each of these properties will then be available as dimensions in Prometheus and can be used for queries or reporting (e.g. get all the Android devices with osVersion 5.0.1 currently connected to the system).

Based on the concurrent nature of the cluster, and the isolation that the Actor Model needs as pre-condition, we created a MetricActor which sole responsibility is to publish “metrics” to JMX. It listens for MXBeanMetricEvent which contains a BeanWithProperties, a metric name and (optionally) a value.

This is then wrapped in a MXBeanMetric bean, which just wraps the BeanWithProperties to have a Map<String, BigInteger> so it can have multiple metrics and converts the MBean name to follow JMX Exporter’s convention:
domain<beanpropertyName1=beanPropertyValue1, beanpropertyName2=beanPropertyValue2, …><key1, key2, …>attrName: value

Since code is more expressive than words. Here’s the DefaultMXBeanMetric implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
public class DefaultMXBeanMetric implements MXBeanMetric {

private final Map<String, BigInteger> attrs;

private final String name;

private static final BigInteger INIT_VALUE = BigInteger.ZERO;

public DefaultMXBeanMetric(BeanWithProperties mbean) {

attrs = new HashMap<>();

name = getNameWithKeys(mbean);

}

@Override

public String getName() {

return name;

}

private String getNameWithKeys(BeanWithProperties mbean) {

String type = "type=" + mbean.getName();

return Stream.concat(Stream.of(type), //

mbean.getPropertyKeys().entrySet().stream() //

.map(e -> e.getKey() //

+ "=" //

+ quote(e.getValue()))) //

.collect(Collectors.joining(","));

}

@Override

public synchronized void setAttr(String attr, BigInteger value) {

attrs.put(attr, value);

}

@Override

public BigInteger getAttr(String attr) {

return attrs.computeIfAbsent(attr, k -> INIT_VALUE);

}

@Override

public void incAttr(String attr) {

attrs.compute(attr, (k, v) -> {

final BigInteger val = Objects.isNull(v) ? INIT_VALUE : v;

return val.add(BigInteger.ONE);

});

}

@Override

public Map<String, BigInteger> getValues() {

return ImmutableMap.copyOf(attrs);

}

}

<b></b><span style="font-weight: 400;">
</span>

The MXBeanMetricEvent is defined in Scala as a case class:

1
case class MXBeanMetricEvent(mbean: BeanWithProperties, attr: String, value: Option[BigInteger] = None)

In case the value is provided the metric will be a gauge, if not, it will be a counter that will increase by one the attribute attr value from the bean mbean.

And this the receive method for the MetricActor (Scala):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
override def receive: Receive = {

case MXBeanMetricEvent(counter, attr, value) => {

try {

val mbean = new DefaultMXBeanMetric(counter)

if (!existsMBean(getBeanName(mbean))) {

registerMBean(getBeanName(mbean), mbean)

}

val proxy = getMBean(getBeanName(mbean), classOf[MXBeanMetric])

value match {

case None => proxy.incAttr(attr)

case Some(number) => proxy.setAttr(attr, number)

}

} catch {

case e: Exception => log error(e, s"Unexpected error while processing metric ${counter}")

}

}

}

This is the core of the metrics publishing infrastructure. After this is set up, all new metrics that we want to report to JMX that can be read by JMX Exporter just require:

  1. Extending BeanWithProperties with a CustomBean (easy peasy).
  2. This line from an Actor context:
    1
    2
    3
    <span style="font-weight: 400;">context.system.eventStream.publish(new MXBeanMetricEvent(new CustomBean(params), MetricName, [optional value])
    )
    </span>

Configure JMX Exporter

After publishing metrics to JMX with the JMX Exporter expected format as we did above, you just need to make JMX Exporter able to connect to your JVM. There are a couple of way to do it and they are very well explained in their documentation. After it is set up, you are good to go. Nevertheless, the metrics that will be reported to Prometheus might not be following the Metrics Naming and Labelling Best Practices.

To change this, all that’s needed is a jmx_exporter.yml configuration file. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
lowercaseOutputName: true

rules:

# Special pattern for DeviceAvailability (PhysicalDeviceAvailabilityMetricBean.java)

- pattern: '<jmx-domain><type=PhysicalDeviceAvailabilityMetricBean, host=(.+), logicalId=(.+), manufacturer=(.+), model=(.+), os=(.+), osVersion=(.+), serial=(.+), key=(.+)><>Values'

name: <jmx-domain>_device_status_$8

help: 'Metric to keep track of online devices. See: PhysicalDeviceAvailabilityMetricBean.'

labels:

host: $1

logicalId: $2

model: $3

manufacturer: $4

os: $5

osVersion: $6

serial: $7

# Exporting all metrics using the default formatter

- pattern: '.*'<b>
</b>

Doing this mapping we are able to get the metrics with a nicer name in Prometheus. You can configure it for all the metrics that you export.

Setting up Prometheus (plus Grafana and Alert Manager)

This is the last step, you just have to set up a server with Prometheus to pull the metrics from your nodes. This is well explained in Prometheus documentation. The same applies for Grafana and Alert Manager.

We did it configuring the deployment through Ansible and using Docker machines to make it easily recoverable/replicable, but more on that in another post.

At this point we are all set up.

Real-Time Monitoring in the Device Cloud

Having the monitoring configured this way with Prometheus and Grafana allows us to track specific health indicators from our cloud including:

  • Number of devices online.
  • Which devices and nodes are currently executing tests.
  • Success rate for each device and type of test.
  • Type of errors reported by each device in case of test failures.
  • Performance metrics from each of the nodes in the system.

Additionally, Grafana provides integration with Zabbix, so we are also able to track all the alerts in just one Dashboard, and we could also add more fine-grained monitoring on top of Audit if we need it.

Real-Time Monitoring in the Device Cloud

Managing a Device Cloud that supports multiple teams at King is not a trivial task. Any team at King (not only games) can use our infrastructure based on their needs. Some teams execute punctual tests, triggering them manually, others have it configured to execute nightly via Jenkins or on every commit. Some teams run startup tests (making sure the latest app build can actually run in different devices configurations), others use bots that play custom builds, or monkey tests that try to make the app crash. Some don’t require internet connection, by just using the devices to compile C/C++ libraries using different processors/hardware, others, like Ads, rely heavily on WiFi reliability. Some teams even run tests for hours and execute performance tests.

By using our current monitoring solution, we have been able to identify issues like:

  • WiFi reliability – we track devices constantly and have been able to review different racks options and device distribution inside our server room for improved WiFi connectivity.
  • Devices trends – we have been able to identify battery issues with devices or increased error rates on devices or execution engines.
  • Redundancies – we keep track of how many physical devices we have online in the system and are able to detect immediately when they drop.
  • Hardware issues – we can keep track of devices usb connection events and track when a device is having issues due to problems with the cable, usb hub or port connections.
  • Bugs – We have even discovered bugs in the system before our users do, like high memory usage on our servers linked to failures on test executions, or resource leaking not allowing devices to finish test executions.
  • Recreation of events – we can easily recreate and time all the events that happened while a test was running. From Grafana we can take a snapshot of what did the system looked at a certain point, and with Audit we are able to see the sequence of events that happened to get into that state.

Since our team manages both software and hardware, a lot of things can go wrong; we need to be ahead of the issues, identify them as quickly as possible, and fix them (hopefully) before they impact our users. Our current monitoring solutions allow us to be Fast and Fluid identifying problems and reacting to them, so the Kingdom can focus on delivering the high-quality, memorable games our users enjoy.

Christian Vielma

About Christian Vielma

Christian is a software developer working in the Device Cloud team at King. Originally from Venezuela, he is a skilled developer that has worked in different levels of the tech stack with main focus on the backend and Java. He believes excellence is achieved by everyday acts and to have a better world you need to think globally and act locally. On his free time he enjoys traveling, outdoor activities, martial arts, playing the guitar, watching movies, and volunteering (including contributing to Open Source projects).

Leave a Reply

Your email address will not be published. Required fields are marked *