The Story of Big Data and Data Quality

A first in the series of several conversations about Big Data. Bigger is Better. There is a big buzz regarding big data and big data analytics.  Wikipedia defines big data, as a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.  Most discussions are focused on the power and the glamour of big data with an emphasis on the tools and technologies to process, aggregate, or store these data. As a result, there is almost no discussion about the quality of the data or the process of identifying relevant data from irrelevant data.  It is true that we generate data at an alarming rate (exponential growth curve)[1], but it is also true that not all data is of equal value.  There is a lot of noise in these big datasets.  Therefore, it is important to understand the extent of this noise exists and be able to assess and report on this noise when presenting results using small or big data.

I have worked with big data when “big data” was not the norm. Very early on I realized that if big data are to be useful, effort has to be dedicated to cleaning the data continuously.  As a director of research and evaluation for the then Department of Mental Health, Mental Retardation, and Substance Abuse Services (now the Virginia Department of Behavioral Health and Developmental Services) we reviewed, analyzed, and reported on patient level data that was sent to the department monthly.  We set a threshold of missing data at 10%.  It was amazing how quickly the data were cleaned because providers realized that the data was not just going into a blackhole but actually being reviewed and used to inform decisions.  After focusing on the data quality for a year, providers received their performance report cards for a set of agreed upon measures on a monthly basis.  This started a dialogue between the state and the providers that was based on data and not impressions of data.  This was and is a step in the right direction.  We replicated this process in Connecticut at the Department of Mental Health and Addiction Services successfully.  For big data to be used meaningfully, the same simple but rigorous process is needed.  Explaining the context and quality of data is an essential  backstory when presenting results.

Usually the people that manage, create, and analyze big data are not close to the data collection and as a result sometimes fail to appreciate the significance and as a result do not discuss the noise in the data resulting from either missing data or data that is invalid (does not conform to the value set). For example, let’s say we are looking at a measure of improved employment, the headline is that as a result of an intervention, 90% of the people that participated in the program were employed at completion of the program.  The rate at baseline was for employment was 50%.  This is great news.  As you dig a little deeper you realize that these results are based on 50% of the people as only 50% of the participants reported meaningful data at two-points (need baseline and a second point to measure change).  What happened to the other 50% and what should you do with these results?  This is a dilemma that faces many analysts though very rarely do we see footnotes that discuss the amount of missing data or discuss the results in the light of these missing data.

Another challenge of big data is the noise generated when merging datasets collected for different purposes and that use different value sets and terminologies. For example, data coming from a patient receiving home health care.  Data will come from the care delivery team, the patient, and the medical devices in use. Let’s focus on a small sliver of this interdependent data, data from the medical device that dispenses medicine based on another device that records the blood sugar level.  Now imagine that the time on these devices is not synced, and no one noticed that significant but little detail.  Now we have data with mismatched time stamps that will not make sense and will be difficult to explain in the future when these data are analyzed away from the source of collection without the knowledge of time.

My point today is one of caution. When we state that big data will solve our healthcare problems we need to focus on quality of the data much more than we do today. The big data is only as good as the humans that collect these data or the devices that record these data. Above all we will have to instill the same disciplined focus to data quality that we have toward assimilating and creating tools that can handle big data.