Industry Classification – The Technical Challenge

Why is category classification important?

Take a look at the products below and see if you can categorize them into specific sub-industries. This task is relatively simple for a human. Now imagine it is a task you need to perform frequently and instantaneously across 1.4 M companies. The human eye becomes insufficient.

Enter CircleUp’s Industry Classification model.

CircleUp’s machine learning platform Helio tracks 1.4M companies and assigns each company a parent category (e.g. Food) and a subcategory (e.g. Cheese Alternatives). It is a hierarchical taxonomy system consisting of two conditionally dependent multi-class text classifiers. Companies are assigned into 13 parent categories and >100 subcategories. These category assignments are hugely important for sourcing and evaluation. They allow us to run evaluative models at the category level and define competitive sets of similar companies at similar sizes. Imagine comparing the brand resonance of an infant food to that of a shampoo or granola bar. Not helpful. Instead, we make evaluative comparisons within a single category.

The best ways to optimize these models are to collect a wide range of features, drawn from a diverse range of sources, and refresh the training sample regularly to stay on top of drifting signals. Black box architectures such as support vector machines, gradient-boosted decision trees, and deep learning models work well for this use case. Take the shown descriptive text as an example. We use NLP techniques to define the parent category (or multiple categories), in this case, Food.

The classification is probabilistic, and therefore not perfect, but we are able to quantify the error rate and improve it over time to minimize misclassification.

How do we know we’re any good?

Like all of our models, we constantly test for performance. We use precision and recall to assess the effectiveness of the model. Precision is the fraction of total brand assignments to a category that are correct (minimize false positives). Recall is the fraction of total brands in a category that are tagged correctly (minimize false negatives).

In the parent category of ‘Food’ precision is 91%, and the recall is 96%.  The model has been evaluated on a random sample of brands representative of the whole universe of brands we track. As we add more training data for each category, these performance metrics continue to improve.

The system is evolving and will eventually include the ability for flexible, ad-hoc classifications to accommodate the ever-changing landscape without sacrificing accuracy of classifications. Stay tuned.

%d bloggers like this: