Wednesday, April 29, 2015

Creating Linked Open Data of a Taxonomic Description


 Linked Open Data is the entry point to Big Data. As taxonomist with a strong believe that we have to bring our data, as opposed that the public will look for it, I am determined to make this happen.

The process

The following blog describes and quantifies the workflow from the discovery of a name to make the referenced taxonomic name available as LOD we develop and implemented at Plazi.

1. Starting point. For my lectures on chemical communication in ants I stumbled upon this note in sci-news describing a novel form of social parasitism based on the discovery of Cephalotes specularis described in 2014 by Brandão et al in Zootaxa, for which a DOI is provided (DOI: zootaxa.3796.3.9). Unfortunately, this is a closed access articles. A search provided a link to it via antcat to antwiki, is available. To start processing the document into LOD, I added the article to the Biodiversity Literature Repository as closed access article. This includes shows the original Zootaxa minted DOI, as well as an alternative identifier from Zoobank.
2. Adding the name to the Hymenoptera Name Server: our reference system for ant names (HNS) showed that the bibliographic reference of the article has already been added (reference), but checking the name did not result in an entry. So I added the name to HNS through the online form. Here it is in HNS, and here in HOL, to which we will create a link to from the treatment that we are going to produce now. We need this name server that is in our control, since Zootaxa does not add new names to Zoobank anymore.

3. To convert the above article into a semantically enhanced document I use our Imagine software. It is already installed on my machine.A manual is available. I added the metadata of the publication, parsed all the taxonomic names, bibliographic records (that are included in Refbank upon saving),  treatment and structure, and materials citation and linked the name to HNS. The result is here as html or RDF, which has been the goal of the the exercise.


Time from reading a name to finding the referrenced article 5min
1. Time for upload the article to BLR 5min
2. Time to add name the HNS 2 min
3. Time to convert pdf into semantically enhanced doc, uploaded to Plazi  21min


This only works within this time framework with training and both understanding the semantic structure of a taxonomic work, its model and the tagging tools.

It also depends on having access to all the resources, including Plazi (that can be obtained).


Why to make all this effort?

 The advantages are easily obvious:
1. 90% of the Name usages are like this: Anochetus grandidieri Forel, 1891, or Fisher & Smith, 2008 cite Anochetus grandidieri Forel. Neither of the referenced publications nor the treatments are linked. The most obvious case is in the Catalogue of Life, where no linking is provided.
2. Much better is already a provision (via a link, or directly) of the proper bibliographic reference for Forel, 1891 or Fisher & Smith, 2008.
3. Even better is a link to a catalogue, such as Anochetus grandidieri Forel 1891 in the Hymenoptera Name Server  or more explicitely using a persistent URI: that allows to cite this name properly. Ultimately, we need a universal system.
4. The next improvement is to get from the name a direct link to the respective article. Ideally the article is an archive that provides a DOI, eg 10.5281/zenodo.9896, such as the Biodiversity Literature Repository can provide for legacy literature.
5. The next is to get a direct link to a digital version of the respective cited page: Anochetus grandidieri Forel, 1891: 108, allowing directly to understand what the respective author had in mind when he created the concept (The principle of reproducibility in science).
6. The next is to get a direct link to the treatment. Ideally, the treatment has a persistent identifier, such as this httpURI
7. The next is to link cited treatments to the respective treatment, such as Fisher & Smith, 2008 usage of  Anochetus grandidieri
8. The final step is to create 5* data by providing all this content in an open, machine readable, semantically enhanced version: Anochetus grandidieri Forel 1891 or Anochetus grandidieri Forel sensu Fisher & Smith, 2008.

Why not rely on existing resources, such as antwiki, antcat?

How long does it take that this new treatment is propagated on the Web, especially by those which harvest data from Plazi:

Starting time is 20150429:11:57