Data Science Techniques.... by Simon Flower

Simon Flower is interested in the application of data science techniques to the range of data in BGS. He is also a senior specialist in IT in geophysics, chair of the INTERMAGNET Operations Committee and a Chartered Engineer. He has managed the growth of information technology across the Geomagnetism programme for the past 25 years.

We recently held a hands-on data science workshop for BGS scientists, an opportunity to spend time learning a little about new ways to work with our data. Many of us are becoming aware of the revolution that is going on in data science and of the power of what household names like Google are doing. Sometimes seemingly ‘magic’ things surprise us. Have you received a flight booking by email only to have Google remind you a few hours before you’re due to depart for the airport? Or stopped at a cafĂ© to find that Google knows you’ve been there and is asking you for a review. The uncanny accuracy with which Facebook seems able to recommend people that you know is the result of some serious data mining activity in the background.

BGS is rich in data, past and present, and is on the verge of a vast increase in what’s available to us through projects like UK Geoenergy Observatories and the European Plate Observing System. At times we’ve possibly thought of this as a problem, having all this data to look after. But it represents a huge opportunity if we can start to master data science techniques. What would it be like if we could do a semantic search(1) of the entire BGS data holdings, past and present – be able to track down things that are relevant to our work from across the whole range of BGS data. Or continuously trawl the web for things that people are saying that are relevant to us – about earthquakes, landslides or the reputation in which the public hold us? Are we really thought of as an independent body, as we hope we are? Data Science techniques can answer this question. As we develop the ability to use these techniques, we will realise that there are new scientific questions that we can answer.

Looking round BGS I can see that there are a number of areas where we are doing some of this data processing manually and already feeling swamped (a case of data being a ‘problem’). Through necessity I suspect that these areas will develop expertise in data science, turning problems into opportunities. At the workshop we had examples of this, some of us using Apache Spark(2) to investigate processing large volumes of groundwater data. It was good to see scientists in BGS  exploring the benefit from these techniques to their work. I hope that we can build on this and create some exemplar systems that not only solve a particular problem, but also provide the catalyst for others in BGS to take the first step. Watch this space as there’s definitely more to come!

Note 1: A semantic search allows you to search for the meaning of words rather than simply matching the way the letters in a word are ordered. For example, semantic searching would understand that, in the context in which I work, aurora refers to the Northern or Southern lights, not to a Norwegian singer or a type of trainer.

Note 2: Apache Spark is a software system that allows data to be manipulated by clusters of computers, enabling large data sets or complex operations to be completed in a short time. A feature of Apache Spark is that users do not need to concern themselves with how the work is split across the multiple computers comprising the cluster.