Scaling up for big data is hard, scaling up a big graph based clustering system is even harder, especially if you are not Google or Facebook, and don’t have an unlimited budget.
We learned the lesson the hardest way. At Inome, one of our biggest challenges is the clustering of 20+ billion records into about 250 million people profiles. We need to put the 20 billion records into a similarity graph and partition them into clusters by the values from carefully tuned and engineered machine learning based similarity models. It does not take much to describe the approach, we just have a couple of papers on it, but to build a system that implements it is no simple task.
We started from making our system work for a couple of billions of records, and scaling up a couple of billions or hundreds of millions a time, all the way to 20 billion. It has been a uphill struggle, how to optimize, best parallelize each step of the pipeline, especially the steps iterating over a large part of the graph, or synchronizing over the entire graph. Every step in the scaling up process was pure pain, sometimes we had to completely rewrite key algorithms and develop new ways to implement and improve them. That is not all. The more you parallelize the algorithm, the more stages your pipeline contains, and making the entire pipeline running robust and smoothly becomes all the more important. To do it, more stages of validation, normalization and quantative analysis on the quality of data input and output of each stage have to be added, before we finally have a robust and solid system.
It is only natural that we feel a lot of pride in pulling all of this through–the kind of joy NASA scientists and engineers experience when they watch rockets rolling off a new production line, rockets that are powerful enough to send payloads to Mars. You can imagine the shock we feel when we have to scale our system down to meet some new customer needs. We never realize how much we have done until we have to link only a couple of thousand records to our corpus. Guess you never fully appreciate how sophisticated your rocket production pipeline is unless you try to use it to make a firecracker for your daughter on the Independence day.
But unlike making a firecracker, we still have 250 million profiles on the other side of the bipartite graph. To make the new scale-downed pipeline robust and working well for small data (to us, it means anywhere between thousands to tens of millions), there are still a lot of optimization to go. We have imagined during our days of struggling to scale up, that one day, after we finally make it, we might have to scale it all the way back and re-adapt ourselves to the world of “small data”.