“11:15 Restate my assumptions:
1. Mathematics is the language of nature.
2. Everything around us can be represented and understood through numbers.
3. If you graph these numbers, patterns emerge. Therefore: There are patterns everywhere in nature.”
– Max Cohen (played by Sean Gullette, in Pi, a film by Darren Aronofsky)
There is a general belief among mathematicians that every problem can be introduced as a graph of a certain type. A graph representation should make a problem visually appealing and much easier to understand and hopefully solve. However, as everything in life it comes with the cost. Graph algorithms are iterative in their nature and require a lot of technical resources and can very costly. To use existing frameworks, one is faced with the challenge of partitioning the graph across cluster nodes. Finding efficient graph cuts that minimise communication between nodes, and are also balanced, is a hard problem. More generally distributed systems and their users must deal with managing a cluster, fault tolerance, and often unpredictable performance.
From the perspective of programmers, debugging and optimising distributed algorithms is hard. The scalability achieved with the use of distributed systems brings as well a lot of other problems, such as extensive engineering support, a high cost of setting up and support. Nevertheless, while the number of vertices in graphs are very high and might reach several billion, the files sizes are still relatively small: just several gigabytes. So the question is: Can we do it from a desktop or even laptop? Apparently the answer is yes – and in order to achieve this, one might use the framework developed by several guys from Carnegie Mellon and Washington University.
At the heart of the framework called GraphChi, there is a Parallel Sliding Window (PWS) algorithm. Apparently, by using a well-known method to break large graphs into small parts, and the PSW algorithm, GraphChi is able to execute several advanced data mining and machine learning algorithms on very large graphs, using just a single consumer-level computer. That seems advantageous and so we decided to run some tests in order to verify these claims. However, to start with, let us first explain the PWS algorithm. As you might know, a graph contains vertices and edges, so two dots – vertices connected with each other a line – edge would be a simple graph containing two vertices and one edge. Under PSW, the vertices of a graph are split into a number of disjoint intervals. For each interval, the algorithm creates a shard, which stores all edges that have a destination (vertices) in the interval. Edges are stored in order of their source, for instance, based on some ID number assigned to each vertex. A number of intervals are chosen to balance the number of edges in each shard; the number of intervals is chosen so that any shard can be loaded completely into memory.
The framework works with two different graph formats: edge and adjacency lists format, where adjacency list contains the full path from one vertex to the last one in one line and edge list has got the only pair of connected vertices in one line. Actual computing is performed by the custom implemented function update. The function can access vertices and its incident edges and can be executed for each of the vertices iteratively. Every vertex and edge might have a value stored on them the update function will update those value and pass further to the next vertex or edge and like this till the whole graph is traversed in defined number of iterations. This way the loop would have a number of iterations by the number of intervals on each iteration spawning a number of parallel processes evaluating shards belong to this particular interval.
The problem we wanted to solve was calculating number of triangles using Facebook graphs underlying our user base. The triangle-free graph would the one where no three vertices for a triangle of adjacent edges. The graph without triangle called complete-bipartile graph and it is a graph where every vertex of the first set connected to every vertex of the second set. Intuitively, we would believe that our graph contains a lot of triangles as our users would be connected to each other through some proxy user between them and that would help channel viral messages in the right direction, having reached every last user of the underlying graph. The user having the largest number of triangles associated to her would be the most viral user.
Certainly, there are several ways to solve this problem and many different graph algorithms would be applicable; however, to start with we have chosen the triangle algorithm. After some weeks of coding and customising the framework for our needs, we were ready to start the tests. We had an HP laptop with 16 GB of RAM, CPU email@example.com x4, SSD 160 GB, graph size 7,294,535 vertices. Decent graph size which equates to a 1.5 GB file. It took about 61 seconds to generate shards and then about 178 seconds to run the analysis – if you don’t think it is impressive you should calculate how much does it cost to buy and support a distributed cluster of machines, as opposed to running it on your laptop. The creators of the framework have reported similar results for their tests, thus, for instance, it took 37 minutes to analyse the whole of Yahoo-web, containing 1.4 billion of vertices and 6.6 billion edges on Apple Mac Mini. Impressive indeed!
The PWS algorithm enables processing of large graphs with very few non-sequential disc accesses. For researchers, GraphChi is a solid tool for systems evaluation and can be comparable with big distributed systems. Generating appropriate data structure sometimes might be a good alternative to scaling up.
Reference: Aapo Kyrola, G. Blelloch, C. Guestrin. GraphChi: Large-scale Graph Computation on Just a PC. OSDI 2012.