The Grace Hopper Celebration is an annual 3-day event celebrating women in technology. It is a really big deal – the world’s largest gathering of women technologists which sells out in a matter of minutes. It presents topics including FinTech, ML, VR, IoT, HCI, and hardware and explores the ways in which women fit into said industries.
CircleUp’s Mal Sridhar will be presenting this year (September 2018) and we couldn’t be prouder. But rather than letting the cat out of the bag on Mal’s presentation, we decided to collect proposed topics from a handful of other women at CircleUp. We have compiled the overviews below.
The intention of this piece is to showcase and celebrate CircleUp’s thought leadership from women technologists and to illustrate our obsession with all things data. These overviews just scratch the surface on topics that we are collectively very passionate about.
If CircleUp had a Women in Technology conference, the speaker series might look something like this:
Raquel, Kelsey, Mal, and Ashlee (left to right) are just a few of the women tackling key technological challenges across CircleUp
1. WHAT IS MY MODEL REALLY TELLING ME?
Sana Ghazi, Data Scientist
The data science domain is no longer restricted to expert mathematicians and computer scientists who have spent years learning the intricacies of non-linear transformations and convex optimization. Instead, software packages like H2O enable users with minimal statistical backgrounds to build recursive neural networks with the click of a button.
Despite the undeniable benefits of software facilitating innovation in data analysis, it is important to understand the consequences. There are pitfalls that result from over-simplified technology in three distinct stages: data collection, methodology and interpretation.
Pitfall #1: Data
- Unreliable Data – Without randomly collected data it is impossible to obtain a true representation of the population we are trying to understand. As a result, patterns we think we are seeing may be completely baseless. Polls reported during the 2016 presidential elections are a great example of this. Supporters of both parties inadvertently conducted surveys that predominantly sampled individuals who supported their candidate.
- Dissimilar and Dependent Data – Most machine learning models, Naïve Bayes being a classic example, make the strict assumption that the data is identical and independently distributed (IID). This can be a dangerous assumption.
Pitfall #2: Unreliable Methodology
- Manipulation of Missing Data – Incorrectly deleting missing data may result in dropping important signals crucial to the model-building process. For example, assume an organization sends out 500 emails asking users if they like its service and 100 respond, with all the 100 being positive responses. Disregarding the 400 data points with a missing response variable may lead the organization to conclude that it has a 100% success rate. Whereas the reality could be as low as 20%.
Pitfall #3: Unreliable Conclusions
- Imbalanced Data, Recall & Precision – Recall and precision are highly accepted evaluation metrics used in classification models. Such statistics help assess how much ‘cover’ a particular model has over the problem space and how rigorous its predictions are. However, relying on them alone is almost never enough. An example of when such measures can be particularly misleading is when working with imbalanced data. Assume we build a model that predicts if an individual has cancer or not, in a population where the cancer-free rate is 98%. Imagine we have a model that makes predictions regarding the cancer status of 100 individuals, of whom 2 actually have cancer. Subsequently, the measured recall is 100% and precision is 98%. At first glance this seems incredible. However, a closer look may reveal that it predicted all 100 individuals to be cancer-free and is therefore completely ineffective in detecting cancer patients.
At CircleUp we are dedicated to reliable data, methodology, and robust evaluation metrics to ensure that our models output insights that are correct and actionable.
2. MAINTAINING HIGH DATA QUALITY ACROSS A DATA-DRIVEN ORGANIZATION
Kelsey Tripp, Engineer
What does data quality actually mean, and how can we obtain it? Like oil, data in its raw form is not very useful. Through proper drilling and refinement, oil is transformed into kerosene, gasoline, and other downstream consumables. Similarly, data must be refined, cleansed, standardized, validated, and monitored before useful byproducts can be extracted. We fill our cars with high quality gasoline so that we can keep our engines running cleanly for years to come. Data must be treated in the same way; the presence of data in itself isn’t always positive, and, in fact, poor quality data can actually lead to future breakdowns. It is through a clearly defined framework of processes and ongoing maintenance that we can obtain high data quality.
While organizations will see high-impact benefits from a commitment to maintaining high quality, the expense of poor-quality data may be even higher. Beyond costing businesses an average of $9.7 million per year, poor data quality can lead to reputational damage, missed opportunities, and undermined confidence in decision making within an organization (Pitney Bowes 2017).
There are several components of good quality data: accuracy, timeliness, completeness, standardized, and authoritative. To understand each of these pieces, we will apply them to an example database used by an organization to track active and potential customer addresses.
- Accuracy refers to the data being correct. In our example, accurate data means that the addresses stored in the database guarantee mail delivery.
- Timeliness refers to the data being refreshed at a cadence that ensures the data remains accurate. For example, a timely data refresh will ensure that a customer’s change of address is reflected in time for the next mailing to be delivered correctly.
- Completeness refers to the presence or absence of data, or how well the data that you wish to capture is populated. If an organization is missing mailing addresses for half of its potential customers, it may be missing out on a whole segment of business or draw inaccurate conclusions in identifying its highest-value customers.
- Standardized refers to an ability to compare diverse data sets. For example, customers in the United States may have addresses structured in a different way from those in another country like Japan. The ability to standardize the input addresses and understand their differences without any loss of data is important, and also helps to identify duplicate records. One real-world implication of duplicate mailing addresses is an increase in postage costs.
- Authoritative refers to a source’s credibility to provide accurate and complete data. If our example organization obtains a list of customer addresses from ten years ago, the data will likely no longer be accurate and will definitely not be complete, as new neighborhoods have been constructed since the original data was generated.
Data is our bread and butter at CircleUp and we are constantly iterating on ways to ensure our data is the best it can be. We believe it is critical for an organization to constantly evaluate the quality of its data, and identify and address issues when they arise.
3. DATA VALUATION: WHY & HOW
Anjali Samani, Data Scientist
Data has become a key input for driving growth, enabling businesses to differentiate themselves and maintain a competitive edge. Increasingly, academics and business leaders are grappling with the why and how of data valuation and debating whether it is even possible to value such an amorphous asset. Many alternatives are available, but none are generally accepted or completely satisfactory. There are three basic reasons organizations want a good way to understand the value of their data:
Direct Data Monetization
Many organizations are keen to monetize data directly by selling it to third parties or marketing data products. Inability to understand data’s value can result in mispriced products, loss of competitive advantage and actually devaluing the data. Understanding the impact of monetization on the value of a company’s data can help guide the decision on whether to pursue explicit monetization.
Understanding the value of both current and potential data can help prioritize and direct your investments in data and systems. Surveys report that only about 30% to 50% of data warehousing projects are successful at delivering value. Understanding how data drives business value can help you understand where you should be minimizing costs, and where you should be investing to realize potential ROI.
Mergers & Acquisitions
Inaccurate valuing of data assets can be costly to shareholders during mergers and acquisitions (M&A). Steve Todd, an EMC fellow, argues that data valuations can be used both to negotiate better terms for IPOs, M&As, and bankruptcy, and to improve transparency and communication with shareholders. The assumption that data’s value is captured only by sales and revenue figures may understate the overall value of a transaction to the benefit of the buyer — and to the detriment of the seller.
Current generally accepted accounting practices (GAAP) do not permit data to be capitalized on the balance sheet. This leads to considerable disparity between book value and market value of these companies, and a possible mispricing of valuation premiums.
At CircleUp, we are building a data moat in the realm of early-stage Consumer & Retail businesses. Without the data we obtain (public, partnership, and practitioner) there is very little we could accomplish as a Data Science team. Our core asset is in the data.
4. BUILDING MACHINE LEARNING PIPELINES TO SOLVE ENTITY RESOLUTION
Raquel Rodriguez, Machine Learning Engineer
We love the topic of Entity Resolution so much that it has already been written about in this blog post. Dig in.
For a bit about Admiral Grace Hopper – Grace was a Rear Admiral in the US Navy and an accomplished computer scientist who paved the way for modern programming languages by developing the first compiler in 1949. She is also believed to have popularized the term “bug” when she found a large moth inside Harvard’s Mark II computer, performing the first known case of “debugging” when the insect was removed from the equipment. Grace is a role model to many female technologists.
CircleUp will have a presence at the Grace Hopper Celebration in Houston this September. Find us there if you’d like to continue the conversations – we’d be excited to dig deeper into all of these topics.