Developments over the last two decades have made data a defining feature of modern life. Nearly all information created today is generated in digital form. Smaller and cheaper sensors have enabled an explosion in the number of internet-connected devices, which along with transactional systems and social media platforms produce constant streams of human-and machine-generated data. Improvements in the cost and efficacy of digital storage enable this information to be captured and analysed. Intelligence agencies have taken advantage of this new data landscape, but there are questions around the effectiveness of their new interest in data. Does it unlock new intelligence or is it a distraction from the real work of analysis?
For signals intelligence agencies like GCHQ this explosion of data is both a challenge and an opportunity. An expansion in the amount of information they have at their fingertips theoretically allows them to deliver more intelligence to decision-makers than they previously could. On the other hand, as the pool of information at their disposal grows, pulling out relevant, timely and accurate intelligence becomes harder.
Edward Snowden’s 2013 revelations brought the scope of GCHQ’s data collection to light. At the time, the agency was tapping into 200 of the fibre optic cables that transmit internet and telecommunications data across the world. Units of data are called “bytes” and one byte is roughly equivalent to the data needed to encode one character of text on a computer. The latest iPhone model has a capacity of 256 gigabytes—that’s 256bn bytes. GCHQ’s capabilities give it access to 21 petabytes of data per day (a petabyte is equal to one million billion bytes). GCHQ also planned to tap an additional 200 cables, which if successful would have doubled its data uptake. An operation of this scale means that significant elements of the collection, processing and analysis of this data must be automated—this is where big data analytics come in.
The promise of big data analytics is an optimistic one—that with enough data, the patterns, outliers and trends that would be difficult for humans to discern can be identified in real time. Algorithms can be created that will scour vast databases and deliver useful insights directly to human analysts, freed from the laborious task of sorting and analysing information themselves.
Delivering these results is more complicated in practice, especially in secretive government agencies with their inevitable tendency towards bureaucratic stagnation.
Data is only valuable if useful insights can be gained from it. Intelligence agencies contend with a large volume of unstructured data, including social media posts, email content and audio, image and video data. All of this can be very useful, especially if it can be organised into databases and cross-referenced with structured data, such as bank transactions, travel records, and phonecall and e-mail metadata. But before this is possible, unstructured data must be processed. For textual data, this might involve expanding shorthand, correcting typos, removing gibberish and discarding duplicate information. The data would then be mined for particular features, such as locations, individual identities and keywords.
This process is not infallible. Converting unstructured data into an intelligible form needs to happen in real time—it is algorithms that do the job. Machine intelligence, while better than humans at image recognition, is less skilled at identifying sentiment, and may fail to recognise a joke or draw other more subtle interences that human intelligence intuitively grasps.
Further, the effectiveness of an algorithm is dependent on the assumptions of the person who built it—an algorithm is after all nothing more than a set of encoded instructions. If the underlying assumptions contained in those instructions are incorrect, the most useful information may be overlooked. A failure on this level might not be obvious until made clear by events in the real world.
Compounding the challenges of dealing with unstructured data are additional limitations on the capacity of data analytics, especially in the context of huge bodies of data. While algorithms are good at identifying correlations, they will turn up many meaningless coincidences. They are also not able to provide insight into causality—into what has caused a piece of information to appear the way it does. Human analysts are charged with establishing causal relationships and adjudicating the value of correlations. In this case having an enormous pool of data may actually be a hindrance, because the more data you have the more likely you are to find evidence that apparently supports your hypothesis.
The concerns outlined above are real, but they are not necessarily new for intelligence agencies. The risk of mischaracterising information, the difficulty of proving causality, the existence of ambiguous correlations, the influence of human biases and the chance of unanticipated blind spots have long been challenges in the intelligence world.
But, despite these challenges, the collection and analysis of big data has a value for intelligence agencies that far outweighs the shortcomings. It provides a way to perform real-time analysis on a scale that was previously impossible, an ability to track long-term trends and the capacity to generate hypotheses that go beyond the confines of human imagination. These benefits, augmented by human analysis, lead to enhanced predictive and forecasting capabilities—one of the key functions carried out by intelligence agencies.
To realise the opportunities of big data, agencies will need to ensure that their analytic abilities keep pace with their collection. This means budgets must include allocations for updating systems, human analysts must be trained to understand the automated machine learning tools that provide them with information, and intelligence agencies must compete with the private sector to hire skilled data scientists to develop and refine capture, storage and analysis algorithms.