Careful: Your big data analytics may be polluted by data scientist bias

Expectations surrounding the future of big data range from the just huge to absolutely enormous – a reflection perhaps of both its real inherent potential and all the massive hype. Certainly though there is no dispute that companies can reap big benefits from exploring patterns found in the data they already generate and collect. Further, depending on the algorithms used, machine learning can even serve as a real world crystal ball: There are countless examples, but the story about Target’s ability to predict pregnancies by analyzing customer consumption patterns, or how well known mathematician Nate Silver predicted the winner in all 50 states during last November’s presidential election are two poignant examples.

But the fact remains that big data can only ever be as good as the machine learning that is used to provide insight, and even the most sophisticated machine learning techniques aren’t omniscient – the old adage “garbage in, garbage out” sums up this dilemma perfectly. Businesses planning to invest in big data science, with the hopes of reaping the potential wealth of insights available, must at all costs avoid introducing bias into the process – or risk jeopardizing everything.

Data bias syndrome

Data bias comes in many forms. It can come from poorly defined business domain objectives. Or, it can come from opting to gather data that are easy to collect rather than data that are most informative. Data scientists can also receive data that have been biased by incorrect assumptions by the domain experts. (And as a footnote, the recent example of the austerity economics Excel scandal shows how a minute data error can have cascading and devastating effects.)

Likewise, data scientists themselves are not immune to bias. Some can run afoul of their own preconceived notions about business domain – too much knowledge can cause one to filter out data that may actually be helpful. Scientists with deep experience in a particular data set may develop too much reliance on pre-existing algorithms without re-examining validity for a particular use case.

Finally, data quantity is a common problem. Intelligent learning requires abundant data, and often the data available are not sufficient to draw accurate conclusions – a problem known as data sparsity. This may sound unbelievable considering that data volume is doubling every two years according to an EMC study, but there’s a difference between a dense data set populated by similar data points, and the far more diverse sets of user data points we find in the real world. In these cases, the gaps in the data are filled by machine learning algorithms that may inherently be biased, based on assumptions made by the data scientist when designing the algorithm. The trick is to find the right balance between unbiased data exploration and data exploitation.

Removing bias

As companies bring data science in-house or purchase tools that act as a data abstraction layer, the need to address data bias becomes more immediate. The smart move is to build bias-quelling tactics into the data science process itself. Here’s how:

Employ domain experts Rely on them to help select relevant data and explore which features, inputs and outputs produce the best results. If heuristics are used to gain insights into smaller data sets, the data scientist will work with the domain expert to test the heuristics and ensure they actually produce better results. Like a pitcher and catcher in a baseball game, they are on the same team, with the same goal, but each brings different skill sets to complementary roles.
Look for white spaces Data scientists who work with one data set for periods of time risk complacency, making it easier to introduce bias that reinforces preconceived notions. Don’t settle for what you have; instead, look for the “white spaces” in your data sets and search for alternate sources to supplement “sparse data.”
Open a feedback loop This will help data scientists react to changing business requirements with modified models that can be accurately applied to the new business conditions. Applying Lean Startup like continuous delivery methodologies to your big data approach will help you keep your model fresh.
Encourage your data scientists to explore. If you can afford your own team of data scientists, be sure they have the space and autonomy to explore freely. Some equate big data to the solar system, so get out there and explore this uncharted universe!

Whatever you do, don’t ignore the issue: The last thing you want to do is implement a system that develops and propagates data, only to learn it’s hopelessly biased. If you don’t solve this problem sooner rather than later, your organization will miss out on what many analysts are calling the next frontier for innovation.

Haowen Chan is currently a principal scientist at Baynote, a provider of personalization solutions for online retailers. Robin D. Morris is a senior data scientist at Baynote; he is also associate adjunct professor in the Department of Applied Math and Statistics at the University of California, Santa Cruz.

Have an idea for a post you’d like to contribute to GigaOm? Click here for our guidelines and contact info.

Photo courtesy pzAxe/Shutterstock.com.