In my time at CircleUp I’ve seen tremendous transformation. I’ve seen the construction of our data pipelines from the ground up and I’ve watched us evolve into a data-obsessed organization – from ingesting megabytes to gigabytes to terabytes of data, from a single schema-on-write Postgres database to a schema-on-read data lake in S3.
CircleUp’s key asset and differentiator is our data. The quality of our data is directly tied to the success of Helio and the success of the investments and loans we make. The “garbage in, garbage out” maxim has never been more true.
Data quality begins as soon as data enters the system, and data ingestion is Helio’s foundation – a key technical capability. By “capability” I mean a defined, tested, and repeatable process that has stand-alone value for CircleUp and others. Defined implies it doesn’t change significantly over time, tested means it has been shown to work, and repeatable allows the capability to be applied again and again. We have built the capability to efficiently evaluate, test, ingest, transform, validate, and publish data at scale. If you don’t have a similar capability it may not be obvious why this is so difficult. It is. This is a serious technical process given our unique use case.
I’ll dive into three of the biggest data challenges we face at CircleUp and the technical approaches we have taken to architect a data pipeline to overcome these challenges:
- Unstructured Data: CircleUp is an investment platform powered by technology. We focus on early-stage companies and we collect information about these companies before we’ve even spoken to them. The data we ingest is often highly unstructured data that we obtain publicly or through partnerships, including everything from distribution zip codes to reviews to price points to ingredient lists to images of products. We’ve found that 80% of the predictive power in CircleUp’s model comes from the attributes that are the most difficult to get (this is made up of dispersed sources that are hardest to collect, resolve and decompose)
Ingredient and nutrition data comes in a multitude of formats, or, more often, no structure at all.
Many companies self-generate the data they process. Uber, for example, creates a tremendous amount of data around ride distances, ride times, ratings, and pricing. But this is all data that is generated internally, which means they control the structure of the data at the source. It is clean and consistent. We don’t have the luxury of defining our data before we collect it.
2. Dispersed Data: To create a complete knowledge graph of the early-stage CPG landscape we are gathering information from hundreds of different sources. This implies that we’d need hundreds of unique schemas to process the data, but we’ve found a way around that by integrating numerous sources into a cross-source data schema. That said, there is a lack of standardization across sources (illustrative source 1 may have product dimensions while illustrative source 2 has ingredients), and we also see a lack of standardization within sources.
We have developed an entity resolution technology to map each feature to the right product and each product to the right brand. Entity resolution relies on machine learning models built in Python using building blocks from Pandas and scikit-learn. Since the last time we wrote about entity resolution, we have begun to test a deep learning recurrent neural network that will further help improve the precision of our entity resolution system.
We’ve found that no single data source is particularly valuable in isolation, so we rely on our aggregation of these dispersed sources to create a valuable data mosaic. Building a robust infrastructure has been key for us to scalably aggregate and resolve all of these data sources.
3. Continuously Changing Data: Early-stage companies are always changing and growing, which is what makes for an exciting investing landscape. To capture this growth in real time, we regularly ingest new data about each brand. Additionally, our data sources regularly change how information is organized. Our challenge is to develop a system that is flexible enough to endure such changes without needing to react to each and every change thrown our way.
Our data pipelines are built on top of Amazon’s AWS ecosystem, and we ingest our data directly into an S3-backed data lake. To efficiently process all of our datasets, we leverage the distributed computation provided by PySpark, the Python API to Apache Spark. Spark is an open-source, unified analytics engine that is optimized for large scale data processing, and we take advantage of its in-memory computation, parallelization, and fault tolerance capabilities.
Airflow has been a critical part of our visibility into the state of our data pipelines and is particularly well-suited to handle challenges of complex dependencies, resilience, and monitoring that arise with the extraction and transformation of large amounts of raw data from disparate sources. Throughout the system, our workloads are designed to be fault tolerant and handle resource availability and inter-task dependencies.
Airflow gives us the ability to schedule and monitor our data pipelines, which are often a series of interdependent Spark jobs run on EC2 clusters.
The Data Lake
We have built an infrastructure to scalably onboard and maintain unstructured data from hundreds of constantly evolving data sources. The final piece of the puzzle lies in how we store and maintain our data. This requires standardized schemas that are also dynamic, which seems like an oxymoron. Not easy, but that is the point.
We have evolved from a schema-on-write data warehouse structure (left) to a schema-on-read data lake structure (right). Schema-on-write connotes that you first define your schema, then write the data, then read the data based on your established schema. Data is organized based on predetermined rules, so all the green data points settle in the left-most column. If the data doesn’t fit into those rules, it isn’t relevant, as represented by the black dots which are left without a designated column. Schema-on-read means you put all of the data into an unstructured “lake” and only organize it later upon reading.
We see several benefits to standardized, dynamic schema-on-read for our use case:
- In contrast to an “everything for everyone” solution, we are able to create unique outputs (sometimes called “lakeshore marts”) for different users. At CircleUp, both business teams and data scientists are data consumers, and the two groups value very different things in the data. Schema-on-read lets us account for this with distinct outputs, which aren’t necessarily mutually exclusive.
- As yet-to-be-defined data sources enter our system, represented by the white dots on the right, they are able to flow into the data lake in whatever raw format they happen to comprised of at that moment, and we can extract the data into a more specific structure at a later point in time. We don’t have to discard unrecognizable data, which could be invaluable in the future. Given that our data sources change attributes and the attributes we care most about often evolve, the flexibility to work with unrecognizable data is crucial for us.
- Finally, the standardized schema allows us to very easily join data across sources and over time, which is critical for us to be able to efficiently make use of disparate data sources at scale.
At CircleUp, we have developed data ingestion capabilities that allow us to efficiently collect and process highly unstructured, disparate, and continuously evolving data sources. We have built a robust and scalable data infrastructure upon which we aggregate and piece together those hundreds of sources. Our data pipelines are key enablers that allow our engineering team to continually feed high-quality data into Helio.