Tuesday, 1 November 2016

Hacking semantic paths to make BGS resources more accessible... by Rachel Heaven

Rachel Heaven works in geoinformatics, developing spatial databases, applications and standards to share and visualise geoscience information.

BGS ran an internal hackathon a couple of months ago and Team Semantic Search (starring Agelos Deligiannis, Rachel Heaven, Tim McCormick, Gemma Nash, Ike Nkisi-Orji, Marcus Sen) took on the challenge to implement a semantically and spatially intelligent search service.

The aim was to improve access to BGS's resources with an enhanced search tool that can cut through the tangle of complex geoscience terminology, and to improve navigation between information resources.

Why should we do this ?

BGS is custodian of a wealth of textual information, including the important observations and interpretations that accompanied BGS’s traditional core product, the hardcopy 2D geological map. Our digital-era information products make it easy to view interpretations of the spatial extent of geological properties but they are not easy to discover, are difficult for non-experts to understand and they are divorced from the documented evidence they are based on. These provenance links are increasingly important when information is used in decision making.

So how do we make a search "semantically intelligent"?

Well, if you do a regular web search or directory search for content, all you are matching is a sequence of letters. The search engine doesn't know that other terms mean the same thing, or are related to your search intent in some way, or that the term you are using has different meanings in different eras or contexts – all of which happens a lot in the long history of geoscience terminology. All of these problems mean that the user has to filter out "false positives" in the search results and often run repeated searches using the alternative terms that he/she knows about, potentially missing out on valid results, especially if they don’t know the experts’ terms.

But what if the experts who knew all about the terminology and all the relationships between the terms and the things (we call them concepts) had already captured all of that knowledge (we call that an ontology) ? And even better, if that was available in a machine readable form and the search engine could use it with some algorithms that made it behave as if it knew what your search term meant ? That is what we mean here by semantically intelligent.

The "spatially intelligent" bit means we also want the search engine to understand the various ways that locations are mentioned in unstructured text so that a user can find all information relevant to a chosen location no matter how it is represented.

The good news is that this has been a goal of the web community for some time (e.g. Tim Berners-Lee Semantic Web paper in 2001) and there are standards we can use to specify the terminology and tools to use them (e.g. W3C Data Activity that BGS are contributing to).

Furthermore, codifying geoscience terminology has been an activity of BGS for a long time - before any sniff of the semantic web - so we have some really mature vocabulary resources, some of which we published in the appropriate web standard form a few years ago, and we also work on internationally standardised vocabularies with the IUGS-CGI Geoscience Terminology Working Group.

To push things along further BGS have been supporting PhD research (with Robert Gordon University) into these topics, and our student Ike was able to join us at the hackathon and contribute algorithms he has written to use the ontologies.

So what happened during the hackathon ?

Planning
With handy post it notes on a flip chart, we identified 5 separate components to work on in parallel:
  1. document indexing for the search engine
  2. application to run the search service as a simple search
  3. extending the search service to use the ontologies
  4. web interface for search
  5. coordinate to placename converter
  6. links to the search from Groundhog Web virtual cross section and borehole viewer (as an example just because that’s an application that I develop anyway so know the codebase)
Indexing
On task A, Rachel, Marcus and Ike worked together to install elasticSearch (open source search engine software) and used it to create an index of plain text terms in a set of BGS publications – no ontologies involved yet. After a few failed starts this was then left to cook overnight.

On task B, Marcus and Ike implemented Ike’s existing search application as an elasticSearch client to run a simple text search.
Indexing complete
Search service running
Search interface working

On task C, Gemma created a web front end to launch the search, display the results, highlight relevant terms in the pdf documents and to capture user evaluation of the search results – useful for Ike’s PhD. “High five at 11.25” moment on the second day when the search was working properly and the web interface submitted searches, showed results and highlighted the found terms in the pdf documents.

Tim McCormick concentrating
On task E, Tim started adapting (and then found it was quicker to write from scratch – hacking isn’t always the best option for a quick win then !) a few database functions in PL/SQL to create a gazetteer translation and expansion tool. This meant querying a corporate spatial database of OS administrative placenames, map sheet names, some geological feature names to convert a coordinate location to a list of placenames, and expanding a single placename to a list of all co-located placenames. This performs a similar function to Ike’s semantic comparison algorithm but in the geographic space. Marcus created a small web service that would work as an interface to Tim’s function.

Search links added to virtual borehole viewer
On task F, Agelos and I tried (and failed) to get the BGS Groundhog Web code to compile locally (new PC missing some vital configuration that we couldn’t pin down), and in the end I captured an example page from its output and hardcoded new links to Gemma’s search form and Marcus’s interface to Tim’s Gazetteer tool. Going great at 12.08 – even with a bit of cheating!

Existing BGS Lexicon entry for Vale of York Formation
On task C, Ike continued indexing the document collection using concepts in the geological timescale ontology (BGS Chronostratigraphy), and adapting the search engine to use that ontology in his query expansion, semantic comparison and relevance ranking algorithm. Agelos also developed some web scraping scripts to pick up BGS Lexicon terms from structured web pages so that search links could be applied in that way if we wanted.

Going like a dream at 13.17 ! Tasks A,B,D,E,F complete and ready to demonstrate at the final presentation at 2pm. Task C was always going to be the tricky bit so we weren’t too worried that it wasn’t finished yet.

At the close of the hackathon we were able to demonstrate the web front end to the search and show it running to retrieve text-matched results from the indexed document collection.  We also demonstrated a Groundhog virtual cross section from the Vale of York 3D model with new context sensitive “Search publications” links from the legend of model layers that open the search form pre-populated with the placenames and geological time or formation name term relevant to that part of the cross section. Just after the hackathon, and just a little too late to demonstrate, Ike managed to plug in the full semantic comparison algorithm using the Chronostratigraphy ontology, completing Task C.

The hackathon judging panel were impressed at how much we achieved and were excited about the possibilities and the way it could help users discover and navigate through our wealth of resources. We all enjoyed working in a new environment and with a different team of people to usual – despite the extreme heat on those days ! On a personal level this team effort brought together various strands of work that I have been working on – sometimes in the sidelines - for a number of years so it was really satisfying to finally have something to show. Huge thanks to all the great team members.

What happens next ?

We would like to build on the work we did to
  • implement a more robust version on our intranet for staff to assess
  • add further ontologies to the search tool
  • use third party online data sources or APIs for some of the gazetteer translation and expansion service rather than having to maintain our own copies of OS data
  • index documents by location by geoparsing for recognisable coordinates, or proxies for locations such as borehole registration numbers
  • provide a similar application on the BGS external website to search publications, showing snippets of documents and links to the BGS shop if the publication is not open access
  • eventually implementing a single point of entry to search and navigate through all BGS website resources


No comments: