Simon Flower is
interested in the application of data science techniques to the range of data
in BGS. He is also a senior specialist in IT in geophysics, chair of the
INTERMAGNET Operations Committee and a Chartered Engineer. He has managed the
growth of information technology across the Geomagnetism programme for the past
25 years.
We recently held a hands-on data science workshop for BGS scientists,
an opportunity to spend time learning a little about new ways to work with our
data. Many of us are becoming aware of the revolution that is going on in data
science and of the power of what household names like Google are doing.
Sometimes seemingly ‘magic’ things surprise us. Have you received a flight
booking by email only to have Google remind you a few hours before you’re due
to depart for the airport? Or stopped at a café to find that Google knows
you’ve been there and is asking you for a review. The uncanny accuracy with
which Facebook seems able to recommend people that you know is the result of
some serious data mining activity in the background.
BGS is rich in data, past and present, and is on the verge
of a vast increase in what’s available to us through projects like UK Geoenergy
Observatories and the European Plate Observing System. At times we’ve possibly
thought of this as a problem, having all this data to look after. But it
represents a huge opportunity if we can start to master data science techniques.
What would it be like if we could do a semantic search(1) of the
entire BGS data holdings, past and present – be able to track down things that
are relevant to our work from across the whole range of BGS data. Or
continuously trawl the web for things that people are saying that are relevant
to us – about earthquakes, landslides or the reputation in which the public
hold us? Are we really thought of as an independent body, as we hope we are?
Data Science techniques can answer this question. As we develop the ability to
use these techniques, we will realise that there are new scientific questions
that we can answer.
Looking round BGS I can see that there are a number of areas
where we are doing some of this data processing manually and already feeling
swamped (a case of data being a ‘problem’). Through necessity I suspect that these
areas will develop expertise in data science, turning problems into opportunities.
At the workshop we had examples of this, some of us using Apache Spark(2)
to investigate processing large volumes of groundwater data. It was good to see
scientists in BGS exploring the benefit from
these techniques to their work. I hope that we can build on this and create
some exemplar systems that not only solve a particular problem, but also
provide the catalyst for others in BGS to take the first step. Watch this space
as there’s definitely more to come!
Note 1: A semantic
search allows you to search for the meaning of words rather than simply
matching the way the letters in a word are ordered. For example, semantic
searching would understand that, in the context in which I work, aurora refers
to the Northern or Southern lights, not to a Norwegian singer or a type of
trainer.
Note 2: Apache
Spark is a software system that allows data to be manipulated by clusters of
computers, enabling large data sets or complex operations to be completed in a
short time. A feature of Apache Spark is that users do not need to concern
themselves with how the work is split across the multiple computers comprising
the cluster.
Comments