Welcome back to the Data Connections blog! In my last blog entry, we looked at the hype around data and we learned that while data may share many characteristics of precious natural resources, it is in some ways better than those resources. In this entry, we’ll take on another tough data issue as we try to determine whether data is good or bad.
As a data nerd, people talk to me about how data is good, in that it can be used “neutrally” for fact-based decision making that will make society better or fairer. At the same time, there are those who tell me they believe that reliance on data is bad in that it is not an unbiased decision maker and too easily can be used simply to reinforce unfair historical inequities. Who is right and who is wrong and how can we know?
I know that I want data to be good. Like many people, I hold out hope that with enough data or with the right type of data we can help people to avoid disease and suffering; end discrimination and bias; and generally make our lives better.
When I was a young boy living in Portsmouth, Ohio my doctor found a lump at the top of my spinal cord which he thought might be cancer. He could not tell by looking at it whether it was a serious issue or simply a benign lump of cartilage. He wanted me to go to a hospital in another, bigger city to have it removed and tested. I remember the ashen faces of my parents as they tried to explain all this to me in gentle, kid terms. It was a scary experience, and though it turned out okay, I think about this experience every time I see the scar on the back of my head.
What if doctors had a database of hundreds or thousands of images of cancerous and non-cancerous lumps (data)? What if they could use artificial intelligence to search that database to help determine whether a strange lump on a child’s head is cancer (or not) based on those images? And what if we could use this approach to avoid sending children to a scary hospital in the big city for surgery?* Doesn’t that potential for a better outcome mean data is, or at least can be, good?
Of course, data is not by itself inherently good or bad. It is true that we can project our biases onto the collection or use of data in such a way as to make it unsuitable for a particular use (and thus “bad” data). In addition, because there is no gold standard for the collection and use of data there is currently no way to fully understand a dataset unless you have collected and curated the data yourself. This would include knowing, for example: what method was used in collection of the data; how current is the data; who decided what to include in the dataset and what to leave out and how was this decided; and what did the person(s) collecting the data intend to have the data used for.
There is a lot in the above paragraph, so let’s break that down.
To start, if we don’t know or understand the criteria used to select data for a project, we can’t know if that data is right for the intended usage. We also cannot know whether we agree with how it was collected and selected for use. Even carefully collected and well–curated datasets might just be the wrong/bad data for a particular use.
Some of this may result from the bias of the data collector. Bias here refers to making a choice or many choices about the collection, curation, and use of the data. That choice may be expressly or inherently influenced by bad or illegal intent or it may simply be influenced by innocent things like the collectors’ life experiences. Either way, we cannot know the potential impact of these biases on the dataset and whether it is good or bad for our usage unless we understand more about the data and the process of collecting the data.
For the same reason we need to see ingredients listed on the packages of food we purchase to ensure good results (e.g., high quality ingredients can mean better taste) and help avoid bad results (e.g., allergic reactions), we need to understand the “ingredients” of the data we use to help us to know if it is likely to give us good or bad results.
As an example, if you wanted to use data to determine which public services are needed by the residents of a shelter for the unhoused in Columbus, Ohio you would want datasets that represent that unhoused population, that are recent/current, and that are collected with a rigor that truly represents a cross-section of the impacted children and adults.
Let’s say you decided to pull all population data you could find with geotags showing they were collected within a 5 mile radius of the center of Columbus. Even if you found the “perfect” dataset that met best practices for quality, completeness, and representation (within the group covered by the dataset) and that met your geographic proximity requirements, that dataset still might not be a good fit. To learn about the unhoused population of downtown Columbus, a dataset focused on the population of the upscale suburb of Bexley, Ohio (no more than three miles from downtown Columbus in distance but tens of thousands of dollars away in average income) would not be the right dataset. That is true notwithstanding geographic proximity or other similar characteristics. Thus, some labeling on the dataset to identify its “ingredients” is critical to knowing if you will get a good or a bad result from its use.
At present, there is no ingredient label on a dataset with “best used as” guidance. To compound this, most programmers and developers are working at a lightening pace to meet project deadlines with a “grab what you can and go fast” mentality in the ultra-competitive AI marketplace. A data developer might say this is not a problem, “I know I have a good dataset if all the columns and rows are completed and the data is formatted in a consistent manner" (e.g., numbers all are shown with three digits to the right of the decimal point). In fact, that type of quantitative information can probably be confirmed by visual scan. However, qualitative factors like inclusion of members of all impacted groups intended to be represented (e.g., the unhoused population of Columbus, Ohio) cannot be so easily verified, especially in very large datasets.
Is the quest for good (and not bad) data hopeless? Well, for many industries the International Standards Organization (ISO) has set standards for quantitative and even qualitative metrics that can be followed by members and relied on by the public to understand the quality of what they are getting. As one example, the Pharmaceutical Industry relies on ISO 9001 to ensure use of high quality raw materials; use of consistent and controlled manufacturing processes; to see that quality control processes are in place; and that documentation and record-keeping are complete. The same is true for standards in other industries.
While the European Union’s Artificial Intelligence Act proposed use of such standards to ensure the quality of data (and eventually the quality of the AI that uses this data) this has not yet been realized. Industry groups including the Data & Trust Alliance have started the process of defining data quality characteristics, but this work is far from complete and more needs to be done before we can truly know that we have good data.
I am not sure if we have (or could have) answered all the questions around the complex topic of how we can know if data is good or bad. But, I do think we have learned how we can better use data for good and that there is more work to be done to get there. In the next Data Connections Blog, we’ll look at where data comes from and why, in the acquisition of data, we cannot solve data quality issues by contract terms alone. We’ll additionally explore the view that we can solve data quality issues through having more and more “messy” data – i.e., that massive quantities of raw data will override any anomalies in the data.
I hope you’ll be back for that as we continue to explore data connections.
If you have questions about your data and your legal compliance programs for data, Mortinger & Mortinger LLC can help! Contact me directly at: steve@mo
If you have questions about your data and your legal compliance programs for data, Mortinger & Mortinger LLC can help! Contact me directly at: steve@mortingerlaw.com
Mortinger & Mortinger LLC: When Experience is Important and Cost Matters
Data and Intellectual Property Law Services
*Once such database is described here: Machine Learning and AI in Cancer Prognosis, Prediction and Treatment Selection: A Critical Approach (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312208/)
Loved your blog on data. Data Data and it’s meta data
Thanks for reading, Pavan!