Address Parser and the Sexiest Profession of the 21st Century

Address Parsing is an issue of crucial importance to record linkage of people, and thus central to our business. Last Tuesday, during my boss’s presentation at the Seattle Marketing and Analytics Group meetup, a couple of data science practitioners from the retailers’ side asked us the question. Yes. We use our own address parsing software, and we have no choice but do it ourselves.

On one hand, we just have too many different data sources that we want to link and aggregate. Each has their schema and their own propensity of making errors in their addresses. We have to have something specific for each data source instead of just use the standard government issued or open source GIS address parsers to clean all of them up. On the other hand, to process the 20 plus billion records we have, pinging Google Map or Bing Map to parse the addresses is simply out of the question. Even if we can make it possible, they are still not good enough. When an ill-parsed address leads a Google Map user to the opposite side of the city, all he or she can do is giving a tirade among his family/friends on how bad Google Map is. It does not really matter to Google because nobody can really quit them. But it is not the same with us, when we get an address wrong, the angry customer will actually be on the other side of phone taking it out on our poor customer representatives, and demanding a full refund if not compensations.

But there is reason why it seems that Google Map and Bing Map have been so “deaf” to the cries of their users for improving their address accuracy. It is not they are too arrogant, address parsing is just such a hard problem. Over the years, many brilliant papers have been published in the area, promising great performance improvements. But really making these great algorithms working at an industrial scale and at a customer acceptable accuracy is not a simple task. You need to cover so many edge cases, collect so much training data and have to do so much tweaking, you have to have researchers and scientists who are not only experts at the algorithms, understanding them thoroughly and knowing how to adapt them to the problem at their hands, improving them to meet their needs, but also have the patience and dedication of sitting there through long hours of sifting through the data, cleaning them up, patching up all the edge cases, to make them work.

Unfortunately, these days, some data science practitioners understand the basics of these algorithms, use them off the shelve as if they are a black box, but never have the patience to make them really work. If one algorithm does not work, they will try another one, and if none of the off the shelves algorithms work, they will blame it on the academic and move on. They are more keen on building systems, than in actually making one simple system working elegantly and smoothly.  Some even complain that the hard work of making a machine learning based system work is just too menial and below them–they will not waste their life on them, leaving the real stake-holders (the business side), shaking their collective head, say, ‘so this is all machine learning/datamining/data science can do? It is not any better than the heuristic rule-based system that we are using.’

When I was a fresh graduate student, I asked my Ph.D. thesis adviser, Dr Thomas G Dietterich, the founding president of the International Society of Machine Learning, the secret of getting the great algorithms we were learning and developing work for real world problems. I still remember that he paused for a moment and told me, ‘the most important thing is to become a real domain expert, spending hours and hours designing and engineering your features, cleaning up your data, and tweaking the parameters. If it still does not work, write your own algorithms and beat them as hard as possible till they finally fit the shape of your problems.’

This is how we were able to build a successful record linkage system with both a high precision and recall. When I told some of my colleagues that our models usually have precision of 99.6% and recall typically in the range of 80% to 92%, most of them could not believe me. I did not lie. Nothing we used was truly magically. The only miraculous thing we did was hours and hours of feature engineering, training data cleaning, and parameter tuning.

They say that Data Science is the sexiest profession of the 21st century, but as we all know, being sexy is not that easy. It takes a lot of work, hours and hours of grueling work in the gym, all the cardio, weight lifting, squats, contouring… You have to pay meticulous details to nutrition, even to the point of being tyrannical to yourself. For us girls, we also have to spend hours and hours working on our skin, hair, make-up, clothes, …Same for the sexiest profession of the 21st century, beneath that glory outlook is hours and hours of grueling meticulous work.

Scaling For Big Graph

Scaling up for big data is hard, scaling up a big graph based clustering system is even harder, especially if you are not Google or Facebook, and don’t have an unlimited budget.

We learned the lesson the hardest way. At Inome, one of our biggest challenges is the clustering of 20+ billion records into about 250 million people profiles. We need to put the 20 billion records into a similarity graph and partition them into clusters by the values from carefully tuned and engineered machine learning based similarity models. It does not take much to describe the approach, we just have a couple of papers on it, but to build a system that implements it is no simple task.

We started from making our system work for a couple of billions of records, and scaling up a couple of billions or hundreds of millions a time, all the way to 20 billion. It has been a uphill struggle, how to optimize, best parallelize each step of the pipeline, especially the steps iterating over a large part of the graph, or synchronizing over the entire graph. Every step in the scaling up process was pure pain, sometimes we had to completely rewrite key algorithms and develop new ways to implement and improve them. That is not all. The more you parallelize the algorithm, the more stages your pipeline contains, and making the entire pipeline running robust and smoothly becomes all the more important. To do it, more stages of validation, normalization and quantative analysis on the quality of data input and output of each stage have to be added, before we finally have a robust and solid system.

It is only natural that we feel a lot of pride in pulling all of this through–the kind of joy NASA scientists and engineers experience when they watch rockets rolling off a new production line, rockets that are powerful enough to send payloads to Mars. You can imagine the shock we feel when we have to scale our system down to meet some new customer needs. We never realize how much we have done until we have to link only a couple of thousand records to our corpus. Guess you never fully appreciate how sophisticated your rocket production pipeline is unless you try to use it to make a firecracker for your daughter on the Independence day. :-)

But unlike making a firecracker, we still have 250 million profiles on the other side of the bipartite graph. To make the new scale-downed pipeline robust and working well for small data (to us, it means anywhere between thousands to tens of millions), there are still a lot of optimization to go. We have imagined during our days of struggling to scale up, that one day, after we finally make it, we might have to scale it all the way back and re-adapt ourselves to the world of “small data”.