Toxic Data: A New Challenge for Data Governance and Security

Recently, I had a great conversation with an early adopter in the financial services realm. We discussed a new and challenging problem that many companies are now facing and that is going to be really difficult to manage as the landscape of data grows, and as we use and make available far more data about our companies.

That problem: Toxic data.

My early adopter friend said that one of the unintended consequences of the publication of many different forms of data, both from individual companies and across companies, is the ability this public data provides to create unintentional integrations that reveal too much. Namely, multiple sources of public information can be pieced together in a way that ends up creating security risks.

My friend called this toxic data and this is a great name because it highlights that even seemingly mundane data, if used improperly, can create problems and security threats. As much as we need open data in many realms, there are a variety of ways that ad-hoc releases of data, or improper management of data can lead to problems for everyone involved.

A prime, well-publicized example of this occurred a few years ago when the Social Security Administration allowed people to mine social security numbers online. A journalist noticed this and the administration quickly shut down, but problems like this are about to become more common and complicated, and it is not at all clear how to solve these issues. I’d like to take a closer look and see what the nature of the problem is.

What is toxic data?

Imagine that you had a schedule for a senior executive available either explicitly or implicity online. And now add to this that the senior executive was known to use a certain type of private jet service and the trips on that service could be determined publicly. Finally, because the jet service also made it possible to see information on the flight crew that was going to be on that private jet, if you were a nefarious hacker, you could get to that senior executive in a number of ways. For instance, you could target the electronics of the flight crew member and they might have far less secure profiles than the senior executive or staff surrounding him or her. This could open the door to a new type of cyberattack that wouldn’t have been possible in the past.

Now, that’s just one example. I’m collecting as many as I can, so please reach out if you have some ideas.

So what do we do about this?

Given how vulnerable all companies could be to these type of threats, it’s logical that businesses are now looking for a solution to toxic data. The question is, is there a way for us to govern and handle security concerns as the landscape of data continues to expand?

As I pointed out throughout my cybersecurity series (“Creating a Balanced Cybersecurity Portfolio”), there is no easy solution and no way to provide 100% security. Threats can and will succeed. Toxic data challenges will be no different. But, what makes toxic data even more complicated is that because it involves linking various sources of public or semi-public data together, it will be hard, and in some cases impossible, for enterprises to even know when a breach has occurred.

Take my first example, in which the jet company published the flight schedule to do optimization of flights, and to help coordinate flight attendants. This public information also allows the senior executives’ team to optimize flights based around his or her schedule. But the idea of putting all these things together to do something sinister — well, it’s hard to know how anyone could notice that? It’s a Catch-22 situation — you need to have criminal intent to spot places where criminal intent could occur.

And this is truly where data governance becomes thorny. You can’t meet these challenges unless you have resources devoted to brainstorming about the possible toxic use of data. But this brainstorming could lead to the conclusion that any release of data is a risk. The goal is to find a balance between what’s possible and what’s useful. And frankly, there are not a lot of good principles to guide us at this point. It’s an emerging threat, and the solutions are very nascent.

Some recent attempts at combating toxic data

With so many data breaches in the news, toxic data, in all its forms, is becoming more and more recognized as a problem. How are those in the tech field meeting these problems currently? As I said, there aren’t great solutions as of yet.

One thing that I’ve long advised, throughout my cybersecurity series, and that is gaining traction in other places as well, is that companies must have a data protection strategy. This means having policies regarding data retention, sharing, and usage.

Other ideas about how to solve this include some who advocate for creating a comprehensive data protection platform, that includes a heavy detection component. Others, like Bruce Schneier, believe data itself is a toxic asset, and companies should do whatever they can to hold onto less of it. If companies and the government store less data, and there are regulations enforcing entities to do so, there will just be less of a change that data can turn toxic.

You have also seen the rise of the idea that companies can outsource the handling and securing of their data to third party vendors, which only solves the problem of toxic data if you can be sure that third party vendor is completely secure (which, obviously, is a big risk). Companies may also begin to explore bounty programs in which they hire vendors or third party “good-intentioned” hackers to look into their data and see if they can find potential toxic data problems in advance.

It will be fascinating to see how the response to toxic data evolves over time. And as that occurs, I think we have a number of key questions that must be addressed:

What would be the warning signs of the potential for toxic data to be released?
Is it possible to score data sets towards their risks of becoming toxic data?
Is it possible to be proactive to discover toxic data?
What is the role for a crowd sourcing, bounty-based toxic data presentation, or other reward based detection product?

Anyway, this article is just a start down a road that I hope turns out to be interesting.