First, let's acknowledge that Big Data is going to be the most revolutionary scientific and technological advance of the early 21st century. Giant, mine-able data sets are already changing how we do biology, sociology, epidemiology, economics, and just about any other -ology or -nomics that you care to name. Big Data makes problems of intractable complexity suddenly tractable.
And this goes for counterterrorism intelligence analysis as well. By definition, looking for terrorists is about looking for needles in haystacks. Big data lets you find needles by looking at how they've changed the shape of the haystack in which they've been placed. Big Data drastically amplifies the effectiveness of traffic analysis.
So, to the extent that you really, really want to stop every last whack job from setting pressure cookers filled with black powder in public places, then you should be really, really happy that the NSA has mountains of data under its control and the computing power to sift through them. But of course the fly in the ointment is that "mountains of data under its control" bit. If they can sift through them looking for terrorists, they can sift through them looking for drug kingpins. Or child pornographers. Or embezzlers. Or Republicans.
Seems to me that this boils down to two basic questions:
- Who owns the data set?
- What techniques can the owner perform on the data?
We're not exactly happy that these guys own all this stuff, but we're used to it. Why? Because we trust corporations to mind their business, even when their business is making money off of our semi-public behavior. Which brings us to the second question: what limits do we place in the owners of data when they process it?
To answer this question, I like to imagine doing something that I'd be ashamed of (of course this never happens in real life...) and then imagining various people looking at the electronic breadcrumbs.
Do I care if companies provide access to my data for advertising? If Amazon is trying to sell me a romance novel (hey, my wife and I share an account!), not so much. But if BDSM-Mart tried to sell me whips and chains, I'd be perturbed. Is there a difference between these two cases? Maybe.
First, Amazon tries to sell me things that Amazon sells. If I didn't do business with Amazon, they wouldn't try to sell me romance novels. And if Amazon started trying to sell me whips and chains, I'd probably close my account (no doubt causing Amazon's revenues to decline by at least 25%).
On the other hand, BDSM-Mart would have to acquire my data from somebody else. Maybe they could do this through Google, but Google understands where its bread is buttered and would never give BDSM-Mart direct access to my metadata. Instead, what they do is allow BDSM-Mart to place advertising with Google, and Google will insert advertising in pages that people whose search and email patterns indicate that they might like bondage paraphernalia. At no point does Google say, "Hey, BDSM-Mart! The Radical Moderate looks like he might be a good customer for you!" If they did, I'd have a big problem with Google. At the very least, I'd be doing a lot more private browsing, which would hurt Google's revenue. At the most, I might be complaining to my congressman or filing lawsuits, which, writ large, might put Google out of business.
Next hypothetical: Imagine two guys at Google's water cooler, swapping stories about those hilarious searches that The Radical Moderate made last week. Would I be creeped out about this? The idea doesn't make me happy, but my guess is that if this happens (and maybe Google has a policy about their employees making directed queries and maybe they don't--I'd guess that developers have to make directed queries in the course debugging stuff all the time), they don't talk about me, they talk about me as "some guy". Do I care now? Ennh.
Now, let's imagine that somebody in Google takes my data and posts my searches on a public website, with the express purpose of shaming or humiliating me. That's lawsuit territory, and I'm suing Google, not their employee. Google has a vested interest in making sure that this doesn't happen, the same way that they're not going to let BDSM-Mart see my data.
Bottom line: I trust private companies with this sort of data because they value it properly, or they don't remain in business.
But now let's suppose that Google's "customer" is the government. What kinds of questions can they ask?
Well, some of them might be close to innocuous. They might want to know the correlation between the number of accounts that search for BDSM stuff and certain types of pornography, for purposes of formulating law enforcement policy. If all of those correlations are anonymous, no harm done. But the logical next step isn't so wonderful. If the feds manage to convince some judge that a certain pattern of BDSM searches indicates that there's a high likelihood that the account will also contain searches for kiddie porn, that would be not so harmless. I can't imagine a judge issuing a warrant under those circumstances, but you never know--and the way the FISA courts are set up, you never will know.
But notice what had to happen for me to be at risk from the government: Google had to turn over account-specific information. So, not only did Google no longer control the data, but they gave it over to somebody who would process it in ways that Google would never dream of. Here's where the ownership of the data becomes crucial. Without the data sitting in some server farm controlled by the feds, they can't control the post-processing and queries, and they can't overstep any bounds.
So maybe that's the first step toward a sensible data-mining policy: the government shouldn't be able to control data. Or, at the very least, it can't control data with identifiable account info. The NSA seems to have paid lip-service to anonymizing account info, but it's also pretty clear that the names and addresses were just a protected table, with lots of people able to gain access. If that information remains safely back with the company that owns it, retrievable only with a warrant, things are somewhat more private. But we haven't solved the problem of how to keep the feds from mining the anonymous data for whatever they want.
Another possibility is to require warrants for the actual post-processing and search queries. This is something that would obviously require a pretty sophisticated court (I don't see many judges being able to conduct a code review), but it has the advantage that all data now stays with a private owner. The feds can donate a firewalled, sterilized server farm to the company for doing the processing, and they can get the anonymous results back. But the government can't be allowed to own the data, or it will find some way to abuse it.
Does this impede the NSA in its quest to identify real security threats? It certainly makes the process more ponderous. If you've got a real-time threat, your engineers aren't going to be able to poke at the data willy-nilly until some interesting answer pops out. But that's an unlikely way for these guys to work; they're much more likely to come up with specific data-mining programs that they'd like to run on well-established data sets, over a long period of time. Adding one more hoop to jump through before they conduct this sort of surveillance doesn't seem like a big price to pay.
The much more troubling form of data collection are the taps that the NSA is alleged to have placed to collect raw internet traffic. There's no way to firewall that sort of collection with a private company. On the other hand, the only "account" associated with such data is an IP address. Keeping that information firewalled from the feds without a warrant should allow them to do bulk analysis on the traffic without compromising privacy. And if they find something alarming, they can get a warrant.
So we can reduce the problem, through fairly simple means, to a question of how much we trust the courts. Per reports on how FISA courts run currently, the answer to that question is, "not very much". Fixing the operation of these courts is difficult, but paramount. At the very least, application for warrants needs to be an adversarial process. Currently, outside groups can file friend-of-the-court briefs, but it's pretty hard to quash a warrant when you don't know what it's for. Appointing and providing security clearances for a pool of outside parties with aggressive anti-surveillance agenda would go a long way toward leveling the playing field. I'd feel a lot better if an EFF lawyer were opposing every warrant application. I'd gladly contribute to a fund for the billable hours. (I'd rather not have the feds footing the bill...)
Ultimately, though, we're going to have to live with the fact that applying Big Data techniques to the detritus of daily life potentially gives the government huge power over us. The only way to prevent that power from being abused will be the exercise of constant vigilance and aggressively proscribing the activities of the security agencies. I'm not sure I approve of how Snowden outed the whole system, but it's hard to deny that his actions have opened up a very useful conversation. I just hope that we put some real reforms in place before we all go back to sleep.