By Raquel Rodriguez & Ryan Munro – Machine Learning Engineers
Since inception, CircleUp has taken on a meaty technical challenge that we call entity resolution – sounds boring right? It’s not. Entity resolution is the process of distinguishing between unique brands or “entities” and accurately assigning data from dispersed sources to each brand. But the scope of the challenge is bigger – we don’t always know that the representations we are ingesting match to the relevant entity. This process is foundational to everything we do. Helio is the platform that identifies, classifies, and evaluates early-stage companies to generate investment insights. Without entity resolution, there are no insights. The Helio knowledge graph exists at three levels:
I. Brand to Brand Resolution – e.g. Native Shoes is distinct from Native Deodorant which is distinct from Native Eyewear, even though they are all called “Native” by the end consumers. This brand to brand resolution occurs across source types (social, distribution, reviews, etc.).
II. Product to Brand Resolution – e.g. a deodorant product review for Native Unearthed (which by the way is a distinct from Native Deodorant) should not be attributed to Native Eyewear, a mistake made by Amazon below (you get a sense of how massive this technical problem is when even Amazon gets it wrong sometimes).
III. Product to Product Resolution – e.g. Native’s Sixty-Six Polarized glasses are sold in different places, with different naming conventions, different descriptions, different rating, and at different price points. Despite these variances, the data will be processed to recognize it as a single product.
Traditionally, this problem of entity resolution has been solved using a deterministic, rule based approach that generates links between identities from different information silos. But what happens when there are not enough reliable identifiers across all sources?
We have created a pipeline that cleans product and brand level data, creates unique identifiers, and resolves the entities in a scalable way. We are talking about over a million brands (compared to just tens of thousands tracked by the best retail-level sales providers), tens of millions of products, each with hundreds or thousands of features, many of which are in time series. To put this into perspective, we have used over 100 years of computing power (1 computer for 100 years) in just the last 6 months. Chew on that.
The overall success of our pipeline is measured using an F1 score against a human-tagged dataset. The F1 score is a statistical measure of accuracy falling between 0-1 (the harmonic average of precision and recall). We measured precision and recall at each stage of the pipeline as we continuously iterate and improve our system.
Of the five “Natives” listed above, one recently sold for $100M after 2 years in business – a winner you wouldn’t want to miss. Our system is designed to find that brand, classify it, and predict success at the earliest signals. Identifying which Native is the breakout brand requires coverage across hundreds of dispersed sources (otherwise we’d miss it), and entity resolution (otherwise we don’t know the difference between this Native and the next). Take this real product review: “super comfortable but broke on the first day I wore them between the toe, bummer.” – this is (obviously) for Native Shoes, but if mis-linked, it could muddy the data for Native Deodorant. Even seemingly trivial connections make a difference.
This is the task at hand, and it’s hard. Below we will offer up both a technical and non-technical take of what is happening at each of these three levels and why it is important – pick the column that best suits you or go for both.
I. BRAND TO BRAND RESOLUTION
Take five brands with the identical name – e.g. Hum – which span skincare, electronics, restaurant supply, spirits, and beauty supplements. How do we, as humans, distinguish between them and determine whether they are distinct entities or duplicates?
More interestingly, how would we recreate this process to scale across millions of companies? The challenge is massive – there is no standardized or unique identifier for private companies or the brands that roll up into those companies. The public markets benefit from unique tickers for trading. At CircleUp we have taken on the task of assigning unique identifiers in the private markets. The difference is, there are 6,000,000 private companies in the US alone – quite a few more than than the 4,000 public companies. Additionally, we are dealing with messy data as compared to the standardized metrics on public companies.
The implications of this task are tremendous. Only with a complete picture of the private company landscape can we build out a true competitive landscape and identify specific points of product differentiation. Most data sets are not representative of the broader landscape because they exclude emerging brands and skew towards larger, more established names and the corresponding big retailers. We are changing that.
CircleUp’s Growth Partners saw reasons to lead an investment in HUM Nutrition. This wouldn’t have happened without Helio.
Without the existence of standardized identifiers (e.g tickers in the public markets, social security numbers for individuals) we are left with brand identity ambiguity, which we tackle head on.
We use unstructured text to link brands and split brands (e.g. if one Hum is associated with a higher frequency of “wine”, and another with “face,” we split), then assign each a unique identifier. Perfection is aspirational in these ambiguous areas. Today we’re using TF-IDF and doing a dot product between two TF-IDF matrices to generate “similarity” scores. We are moving towards a deep learning (DL) recurrent neural network (RNN) architecture where we use an attention block to read text, and classify it. This has the ability to learn temporal patterns in text. TF-IDF alone will not capture the subtleties in “chocolate color paint” being “paint” and not “chocolate.”
We normalize all of our data, which lets us ingest data from a new source without having to manually resolve it against hundreds of other sources – we literally “plug in” new sources to our system then re-calibrate the weightings of each data set.
The system makes heavy use of distributed computation via Apache Spark, and the model trains in 5–10 minutes from scratch. By utilizing a custom version of distributed randomized hyperparameter search, we’re able to save hours (and sometimes days) with 40 machines. If you have any idea what this means, you’ll be impressed with the speed of iteration it allows on the algorithm for engineering & data science teams.
II. PRODUCT TO BRAND RESOLUTION
Once we have identified specific brands, unique products must be mapped to the correct brand, which includes all the information that comes with that product – distribution, reviews, pricing, sizing, packaging, ingredients, claims, and other attributes. It’s one thing to differentiate between an electronics company and a supplement company, but an entirely different process to make sure that a craft brandy with the same name hasn’t been incorrectly attributed to either brand. And don’t be fooled by the image above – the majority of product mapping occurs without a clean UPC tag.
The prerequisite to this mapping is categorization. When it comes time for predictive modeling, having an accurate category ensures that products are being compared to like products and brands are being compared to like brands. We wouldn’t want to compare the social growth of a D2C color cosmetics company with that of a 30-year-old water company. So we implement a model to classify at the brand level as well as the product level.
An ongoing business challenge is the evolution of product categories and the implied taxonomy changes. We are in the process of developing dynamic categories to ensure we can spot a new category before it becomes mainstream. Syndicated data providers introduce new categories years after they hit the market – it’s awfully hard to build a competitive set in the Kombucha space if that category hasn’t been manually tagged yet. We are changing that.
Most of the raw data that goes into the system is unstructured, so the first step in the pipeline is normalization to put all the existing sources into a common schema. The aim is to extract different characteristics (product size, flavor, color, attributes, etc.) to make them available for the later stages of predictive modeling.
Two steps follow this normalization, a text extraction phase and a unique identifier search phase. The text extraction phase ensures that the minimum requirements to search for a unique identifier are met (e.g. brand + product type). If a product is missing a clear brand name, we scan for the brand name inside the product and extract that name. We use a gradient boosting decision tree algorithm and natural language processing techniques to know the position of the brand name in the text string.
The last phase of data preprocessing is to search for the unique identifier. In this task we have a more complex machine learning algorithm that ties each entity to the universal unique identifier that we generate, described in section I.
III. PRODUCT TO PRODUCT RESOLUTION
With almost 100 million products in Helio, we first must dedupe records to generate a list of unique items. Imagine we were working with just the 26 Hum products shown above from three sources – we’d simply compare each one with the next and determine which are the same. This systematic computation results in a 676 cell table (262). Extrapolate that method to 100 million products and we have a ten quadrillion cell table (100 million2 has 16 zeros). A different approach is therefore needed.
If the system fails to identify which products are unique and which are the same, then the predictive models won’t have an accurate picture of each product’s current aggregate distribution or the full mosaic of consumer product-level sentiment. Five products existing in 20 stores is very different from a single product in a total of 100 stores and would be interpreted differently by an algorithm.
One model, for example, predicts future distribution growth at the product level, which is based largely on historical distribution. This aggregation process is what builds the knowledge graph and fuels this sort of prediction. Nobody else does this.
From an algorithmic complexity perspective we have an exponential calculation O(N2) if we attempt to compare each record against every other record. To reduce the computational complexity of this problem, we use a technique called “blocking,” which minimizes the number of comparisons. For example, product records that map to the same brand will be linked in the same block. Later the blocks will be distributed across multiple systems. This results in just billions of comparisons, rather than quadrillions.
With billions of product comparisons setup, we apply a similarity measure that produces a score (0-100) to measure how likely it is that two products are the same. The similarity measures we use are based on variations of the levenshtein distance, a common distance metric used when comparing pairs of words.
The last phase is to verify whether our similarity measures are accurate, which relies on a rigorous phase of human tagging on a sample of products with multiple “judgements.” This leads to confidence measures to decide which products are truly the same.
Historically, we have mapped products to brands because most emerging companies operate a single brand. As we work with larger companies with multiple brands in the market, we are now also mapping to the company. But this model isn’t as simple as you might think. An example: In 2012 Kraft Foods spun out into two groups, Mondelez International (specializing in snack foods) and Kraft Food Group (specializing in grocery items). In 2015 Kraft Food Group merged with Heinz, now Kraft Heinz. Does that makes the Mondelez Oreo a step sibling to Heinz ketchup? Beyond entity resolution, there is the complication of relationships, and the evolution over time.
It should now be clear that the problem of entity resolution is tremendous. And in the context of investing, it is essential to glean a representative picture of the whole market and identify breakout brands. We are testing the boundaries of what is possible and there is much more to solve. We feel up for the challenge.