Open Space 1: Data Quality in High Stakes Decisions

Participants: Vojtech Sedlak, Steve Oldridge, Theo Rosenfeld, Mark Twohig, Milan Veverka

How Can We Ensure We Are Not Compromising the Value of Data in High Stakes Decisions?

This post is a summary of the Open Space discussion at the November 17th Data Leadership Summit

Our conversation revolved around a bit of wisdom that Milan Veverka offered (paraphrasing Rita Mae Brown, I think): Good choices are borne out of good judgment, and good judgment is borne out of bad choices. By sharing failures we found common threads we could turn into general advice. As our nascent field matures, we encourage today’s data leaders to collect this sort of advice and pass it on as best-practices so that tomorrow’s leaders can enjoy even bigger and better failures.

A few words about data quality: there is not nearly enough discourse on this topic yet. We (and our clients) need to understand that data always represents the world imperfectly. We hope to learn about the world through patterns in our data, but in order to do so we need to understand the imperfections in our data. This is a Sisyphean task of trying to learn what we don’t know about what we don’t know. It’s a good thing this field attracts people who like challenging puzzles.

Here is a distillation of the advice we shared:

  1. Data quality is all about prevention, and prevention often feels like a waste of time until it’s too late. The level of effort should match the value of the project – quick and dirty is sometimes appropriate, but should never be the default mode.
  2. Each piece of the pipeline should generate an auditable record of what it does to the data. Where this is not possible manual checks should be instituted.
  3. At every point that a human intervenes there should be a checklist. It can be small and simple, but it should exist. Even the best of us have bad days, and one day the task will be done by someone that didn’t live through yesterday’s catastrophes.
  4. Always use multiple metrics. Always. Many of our stories involved a major issue escaping attention because it didn’t look bad on one metric–or an unnecessary freak-out was triggered by incidental behaviour of a metric.
    For example:

o   Some servers went down but the flow of records was still within tolerance levels so a specific customer segment lost service without the company noticing.

o   The proportion of records with warning messages was small compared to the total dataset so they were discarded before analysis. However, they were all discarded due to the same problematic edge case and they all were part of the same (important) subset. Days were lost to skewed analysis.

o   A one-time cost levied on users caused a spike in revenue followed by a dramatic drop in week-over-week revenue. Some panic occurred before it was noted that every other metric was positive.

On a personal note, I’m thrilled to see that others at the Data Leadership Summit are passionate about data quality and the integrity of analysis. Sometimes I feel that the hype around data bestows an assumption that data is magically irrefutable, that truth is established by presenting a graph. I look forward to seeing a more nuanced understanding of the relationship between data and truth emerge from our field.