Welcome back to the Data Connections Blog! We left you last time with a data cliffhanger: data silos within organizations causing lack of control over unruly data and leading to conflicts with corporate goals, and also potentially thorny legal and reputational issues. A perfect set up for a data hero to emerge.
In this edition, we are on to the exciting topic of data governance as the tool to help any data hero tame the data and avoid the thorns. Yes, I know that when you start by saying something is exciting it just calls into question whether it really is. I’ll try to convince you of that in the next thousand + words.
We had discussed the premise that a centralized data intake process is the lynchpin to data governance. What do I mean by that, what would it incorporate and how can this be accomplished?
I’ll start by drawing a parallel to software development. Once upon a time, developers mostly generated original, proprietary code. It was a slow and costly process. Then, with the advent of open source code[1], developers could easily obtain reusable code from public repositories – some well-regulated, and some not. To help control this code intake, organizations implemented a clearance process through which outside (or third party) code first undergoes pedigree and security checks, and then goes into a central repository. Any developer in the company can then reuse that cleared code. This open source governance process ensures that business and legal risk are appropriately managed, all while encouraging the rapid software development that today’s competitive business environment demands.
So what about data? Can we implement a similar process to “clear” it specific to the needs of an organization?
To start, it is important to create data guardrails/data rules covering these types of issues: How do you define “quality” data? How can you understand the reason data was collected and what it can appropriately be used for? How current is the data? What types of data should never be ingested (e.g. pornography, hate speech, advocacy of violence), etc.? In the abstract, I cannot answer these questions for every organization out there but that is the work my law firm and many others do in helping to create a data governance structure for our clients.
Also, it is important to understand that for every rule, there may be an exception. For example, if you ban pornography from your organization’s data intake but you support oncology research, it does not make sense to broadly ban images of breasts since certain types of these images could be the key to predicting and treating breast cancer. Still, other images of breasts are not appropriate in most all organizations. Both the enforcement and the interpretation of your data rules will be critical for an organization.
Second, what is the data we are even talking about? Data comes at organizations from a variety of sources. Some of it is generated by the organization itself and its interaction with clients/customers (… fill out this simple 83 question form divulging all your personal information to qualify for our free branded bookmark!). This works best if you are a massive organization providing a service that thousands of people want enough to give you that information (e.g., Amazon, Google, Meta). Even those massive data creating organizations still supplement what they generate with third party data, as we discussed in Blog #5. This data can come from any number of free or fee-based sources with widely varying quality and usability.
Third, which group is bringing in this data? Procurement? Finance? Product Developers? Researchers? All of the above? Without a data governance effort, data comes into your organization at multiple points from multiple sources and it is something like the proverbial herding of cats.
To revisit our question – how do we clear data and manage risk ? The answer brings us back to the need to create a centralized data intake process. This will not be easy; I know from personal experience. However, the benefits are tremendous.
If all data enters an organization at a single point and by a single process, you can enforce your data policy and rules. You can additionally add searchable metadata, an index of data about the dataset, into the dataset to help you find and manage that data. You can include such things as when the dataset was created or entered your organization to help users know if it is current. You can include metadata to identify the provenance of the data, not only the recent source but also any known prior history. And you can include a summary of any preferred use case scenarios such as why the data was collected and how is it best used.
From a pure data management standpoint, you can include metadata to help you search for and find the dataset, including for discovery or regulatory purposes. If regular payments or renegotiations for continued use of the data are required, you can use metadata to automatically trigger that. Importantly, if you take on data (e.g., client data) for use only in a specified engagement and you agree to disgorge it at the end of the engagement, identifying metadata can enable far better compliance than you would have without it.
All of these features can be implemented by having each data user complete a form when requesting to ingest the dataset into your organization. This form would have drop down questions, standardized across the organization[2], about any dataset that they want to bring inside your firewall. The answers can help you determine whether the data that a user wants to ingest meets your organization’s data rules and standards. The answers then become searchable information for later finding and using that dataset or for understanding why it was rejected for ingestion into your organization.
The answers from this intake form also enable you to employ a Data Ethics Board[3] to intentionally approve exceptions (or not) to your data rules, when applicable. This process helps your organization interpret the rules in a consistent way and one that meets the evolving needs of the data community, your organization and your users.
Once you have a corpus of information from these data intake forms, not only can you create and tag datasets with metadata to enable the functionality mentioned above, but you have also created a virtual card catalog of the data in your organization. Searching this card catalog then becomes the first step anytime someone needs data.
Imagine if your data users could browse a virtual index of all the data already in your organization before going outside to find data. Not only could this speed up their ability to find and use data, but you could also include follow ups with current data users to have them rate and evaluate the quality and usability of the data that they are using. Subsequent data users could then get the same types of evaluations of datasets already inside your firewall that we are accustomed to seeing at on-line shopping sites like Amazon or Target.
Likewise, if you organization develops concerns over certain types of datasets[4] (based on industry updates, cybersecurity events, government warnings or otherwise) a searchable index of data showing not only new datasets of a particular type but also all other such datasets of that type in your organization holds tremendous value.
Lest you think I am living in some fantasy land where the data and business teams all readily jump on board with this significant effort… I am very aware that there will be resistance to attempts to centralize data intake / review. This will be especially true in larger organizations. I have heard such things as: “You’re taking away our autonomy” “You’re just creating a bureaucracy” “This is just another bottleneck that will slow down product development”… and more (some of which is not fit for printing here!).
The counter to all of these is the efficiency, control and eventual cost reduction that will come with this centralized data intake process. Even attempts to create data intake reviews at a divisional level within an organization cannot achieve the efficiencies, cost savings and control benefits of a centralized intake process.
The next, and quite logical question, is
“…do I need to do this centralized review for all the data already inside my organization ?”
The subtext being that this retrospective review may take years to accomplish. The simple answer is to keep this achievable and practical. We know that only a small percentage of data in most organizations is currently in use and this data will typically be the fairly current data as intake and usage of data has ramped up dramatically in the past five years.
Therefore, it makes sense to start the review process with all the data entering into your organization today. If you discover historic datasets that are critical to your operations, by all means review them and tag them with metadata. However, the key to successfully implementing this approach is to focus on the achievable and to take it one step at a time.
Based on my experience with clients, I can predict that in months you will start to see real benefits and in years you could have a world class data program.
I hope you’ll be back next time as we continue to explore these issues and learn more about data connections!
If you have questions about your data and your legal compliance programs for data, Mortinger & Mortinger LLC can help! Contact me directly at: steve@mortingerlaw.com

Footnotes
- 1. See this link for history of open source software: https://www.computer.org/csdl/magazine/co/2021/02/09353517/1r8kwgBjU9W ↑
- 2. Standardized questions and responses add huge value in making the responses searchable. If an item is “blue” it is more easily searchable than if write in options like “navy” (vs just blue) are available. ↑
- 3. A Data Ethics Board is a cross disciplinary group, often including participants from outside the organization, intended to bring a broad perspective to potentially tricky decisions regarding what data an organization will use and how that data may be used. It will look both at internal and client need as well as try to anticipate any public backlash from taking certain actions related to data. ↑
- 4. Organizations are increasingly concerned about dataset coming from countries like China and Russia due to issues related to data quality (intentional or otherwise). ↑