Well, this is hardly surprising; the FBI was in the habit of pretending to be on a terrorism case every time they wanted telecoms traffic data. Their greed for call-detail records is truly impressive. Slurp! Unsurprisingly, the lust for CDRs and the telcos’ eagerness to shovel them in rapidly got the better of their communications analysis unit’s capacity to crunch them.
Meanwhile, Leah Farrell wonders about the problems of investigating “edge-of-network” connections. Obviously, these are going to be the interesting ones. Let’s have a toy model; if you dump the CDRs for a group of suspects, 10 men in Bradford, and pour them into a visualisation tool, the bulk of the connections on the social network graph will be between the terrorists themselves, which is only of interest for what it tells you about the group dynamics. There will be somebody who gets a lot of calls from the others, and they will probably be important; but as I say, most of the connections will be between members of the group because that’s what the word “group” means. If the likelihood of any given link in the network being internal to it isn’t very high, then you’re not dealing with anything that could be meaningfully described as a group.
By definition, though, if you’re trying to find other terrorists, they will be at the edge of this network; if they weren’t, they’d either be in it already, or else they would be multiple hops away, not yet visible. So, any hope of using this data to map the concealed network further must begin at the edge of the sub-network we know about. And the principle that the ability to improve a design occurs primarily at the interfaces – this is also the prime location for screwing it up also points this way.
But there’s a really huge problem here. The modelling assumptions are that a group is defined by being significantly more likely to communicate among itself than with any other subset of the phone book, that the group is small relative to the world around it, and that it is boring; everyone has roughly similar phoning behaviour, and therefore who they call is the question that matters. I think these are reasonable.
The problem is that it’s exactly at the edge of the network that the numbers of possible connections start to curve upwards, and that the density of suspects in the population falls. Some more assumptions; an average node talks to x others, with calls being distributed among them on a well-behaved curve. Therefore, the set of possibilities is multiplied by x for each link you follow outwards; even if you pick the top 10% of the calling distribution, you’re going to fall off the edge as the false positives pile up. After three hops and x=8, we’re looking at 512 contacts from the top 10% of the calling distribution alone.
In fact, it’s probably foolish to assume that suspects would be in the top 10% of the distribution; most people have mothers, jobs, and the like, and you also have to imagine that the other side would deliberately try to minimise their phoning or, more subtly, to flatten the distribution by splitting their communications over a lot of different phone numbers. Actually, one flag of suspicion might be people who were closely associated by other evidence who never called each other, but the false positive rate for that would be so high that it’s only realistically going to be hindsight.
Conclusions? The whole project of big-scale database-driven social network analysis is based on the wrong assumptions, which are drawn either from military signals intelligence or from classical policing. Military traffic analysis works because it assumes that the available signals are a subset of a much bigger total, and that this total is large compared to the world. This makes sense because that’s what the battlefield of electronic warfare is meant to look like – cleared of civilian activity, dominated by one side or the other’s military traffic. Working from the subset of enemy traffic that gets captured, it’s possible to infer quite a lot about the system it belongs to.
Police investigation works because it limits the search space and proceeds along multiple lines of enquiry; rather than pulling CDRs and assuming the three commonest numbers must be suspects, it looks for suspects based on the witness and forensic evidence of the case, and then uses other sources of data to corroborate or refute suspicion.
To summarise, traffic analysis works on the assumption that there is an army out there. We can only see part of it, but we can make inferences about the rest because we know there is an army. Police investigation works on the observation that there has been a crime, and the assumption that probably, only a small number of people are possible suspects.
So, I’m a bit underwhelmed by projects like this. One thing that social network datamining does, undoubtedly, achieve is to create handsome data visualisations. But this is dangerous; it’s an opportunity to mistake beauty for truth. (And they will look great on a PowerPoint slide!)
Another, more insidious, more sinister one is to reinforce the assumptions we went into the exercise with. Traffic-analysis methodology will produce patterns; our brains love patterns. But the surge of false positives means that once you get past the first couple of hops, essentially everything you see will be a false positive result. If you’ve already primed your mind with the idea that there is a sinister network of subversives everywhere, techniques like this will convince you even further.
Unconsciously, this may even be the purpose of the exercise – the latent content of Evan Kohlmann. At the levels of numbers found in telco billing systems, everyone will eventually be a suspect if you just traverse enough links.
Which reminded me of Evelyn Waugh, specifically the Sword of Honour trilogy. Here’s his comic counterintelligence officer, Colonel Grace-Groundling-Marchpole:
Colonel Marchpole’s department was so secret that it communicated only with the War Cabinet and the Chiefs of Staff. Colonel Marchpole kept his information until it was asked for. To date that had not occurred and he rejoiced under neglect. Premature examination of his files might ruin his private, undefined Plan. Somewhere, in the ultimate curlicues of his mind, there was a Plan.
Given time, given enough confidential material, he would succeed in knitting the entire quarrelsome world into a single net of conspiracy in which there were no antagonists, only millions of men working, unknown to one another, for the same end; and there would be no more war.
Want a positive idea? One reading of this and this would be that the failure of intelligence isn’t a failure to collect or analyse information about the world, or rather it is, but it is caused by a failure to collect and analyse information about ourselves.
Although Robert Conquest’s ‘The Great Terror’ (1968) has been criticised on reasonable grounds I seem to recall that he notes that the Stalinist terror of the thirties had to end on grounds of geometrical progression. If it had not finished the entire population of the USSR would have been denounced & imprisoned