Setting ‘long tail’ geological data free...by Mike Stephenson, Qiuming Cheng, Junxuan Fan, Chengshan Wang and Roland Oberhänsli
‘Long tail’ data is the difficult-to-get-at data that sits in libraries, institutes and on the computers of individual scientists. Informatics specialists like to contrast it with the smaller number of large, more accessible data sets . The name ‘long tail’ derives from graphs drawn of the size of data sets against their number: there are relatively few large datasets and a lot of smaller ones. Geological science has more long tail data than sciences like physics or meteorology, probably because historically it has been less associated with big science infrastructure and sensors.
Recovering data islands
The fact that ‘long tail’ data is difficult to get at probably holds back progress in geological science. The low discoverability of long tail geoscience data, and its heterogeneity make it difficult to bring it together to gain the benefits of machine learning and artificial intelligence. Informaticians describe ‘data islands’ in geological libraries, in the records of geological surveys and on desktop computers.
Two new projects aim to improve the discoverability of these data islands and the ease of compilation (interoperability) of geoscience data. One looks at the fundamentals of making historical paper data in islands available to cyberspace, the other seeks to bind already-digital data together to answer some of the biggest geoscience questions that still remain.
The first project is a unique collaboration between a geological survey and the academics and computer scientists of the GeoBioDiversity database (GBDB). Archipelagos of data islands exist within the geological surveys of the world. An example is British Geological Survey (BGS) biostratigraphical data associated with about 3 million fossils and thousands of localities and stratigraphic sections, gathered over 150 years from all over the country to exacting and consistent standards. The data has great potential for science, but much of it is contained within paper documents or simple document scans and so is inaccessible to big data tools. It needs lifting from the page and into cyberspace. The BGS began working recently with GBDB (the official database of the International Commission on Stratigraphy (ICS)) which is almost unique in being the only large database to hold sequences of fossils tied to sections, rather than just spot collections. To date GBDB and BGS scientists have placed live manipulable data from more than 6000 UK stratigraphic sections on a public access website . The project is also using machine-learning methods to get at biostratigraphical information directly from text.
Deep Time Digital Earth
Another even more ambitious project seeks to make already-digital geoscience data work together better. The Deep Time Digital Earth (DDE) project has just been approved as the first of the International Union of Geological Science (IUGS) ‘Big Science’ programs and its aim is to harmonize deep time geological data. Again the focus is on data islands, but this time the islands are already digital but are not linked or cannot easily be used together. Through DDE, data will be available in easily used ‘hubs’ providing insights into the distribution and value of earth resources and materials, as well as earth hazards. Data brought together in new ways may provide a novel glimpses into the Earth’s geological past and its future.
An example of how DDE will work concerns the evolutionary history of the biosphere. Previous analyses of long-term paleobiodiversity change were mostly at a resolution of ~10 million years, which are too coarse to reveal fine details of past biodiversity changes. Linked databases in DDE could provide high-resolution (10-100 kyr) diversity patterns. In the realm of minerals, DDE could, for example, provide integration of database systems for mapping clusters of porphyry copper deposits (PCDs) by linking georeferenced plate motion and geometric properties of subducted slab data. Linked databases in DDE, including African groundwater and aquifer data, recharge data, meteorological data, sediment flux, subcrop geology, basin subsidence, sequence stratigraphy, compaction, geomechanics and tectonics data could lead to more accurate models for African groundwater storage, underpinning sustainable development in poor countries vulnerable to climate change.
The DDE is closely consistent with the vision of the IUGS which is to promote development of the Earth sciences through the support of broad-based scientific studies relevant to the entire Earth system. It brings together an almost unique range of partners including the ICS, the International Association of Palaeontology (IAP), the International Association of Sedimentologists (IAS), the Society for Sedimentary Geology (SEPM) and the International Association for Mathematical Geosciences (IAMG). Major geological surveys and institutes including the China Geological Survey (CGS), the BGS and the All Russian Geological Institute (VSEGEI) are also involved.
These institutions are coming together at a time when informatics and computing are evolving fast, but where a wider range of geoscience data were not available until now. In this way DDE may help to solve some of the biggest geoscience questions that still remain.
DDE is being linked with UNESCO, the International Geosphere-Biosphere Programme (IGBP), the Global Sedimentary Geology Program (GSGP), the International Geoscience and Geopark Program (IGGP), the Commission of the Geologic Map of the World (CGMW), the Global Geochemical Baseline (GGB), the International Lithosphere Program (ILP), and OneGeology. DDE will also operate the full FAIR data concept (Findable, Accessible, Interoperable, and Re-usable) and link to desktop systems for geoscientists all over the world as well as to students and teachers in classrooms and on the internet.
Building bridges between data islands
Geology could be said to have lagged behind other physical sciences in capitalizing on its big data, but DDE will enable bridges between data islands to be built and for data to be interrogated using modern tools tackling some of the most important and pressing questions of our time. The project has an ambitious time frame but aims to report its first progress at the 36th International Geological Congress, New Delhi in March 2020.