Thursday, May 07, 2015

The Global Registry of Biorepositories (GRBio)

Wouldn't it be great, if we had a registry of all the collections with biological collection including the baseline information, contact addresses, and even better a list of all their specimens, DNA-samples extracted from the samples, images or publications not only by their staff, but those including specimens in the collection? Whilst much of this may lay in the future, one simple element, a unique persistent identifier for each of the collection would be extremely helpful, so we could link our taxonomic publications with those institutions, once and for ever.

Whilst I have been reviewing a manuscript this morning I suggested in the review, that the authors ought to add a link to the respective institutions in the Global Registry of Biorepositories. Somehow I suddenly felt, I had to look up some of the collections - and it seems well worth it.


CAS California Academy of Sciences, San Francisco, U.S.A.   Is not in the registry….
MCB D. Mezger Collection, Balingen, Germany  Is not in the registry….
NHMW Natural History Museum Vienna, Austria Is not in the registry….
NMM National Museum of the Philippines, Manila, the Philippines MPMP Which is not the name provided in the ms
UKL University of Koblenz-Landau (Campus Landau), Germany Is not in the registry….
WCD P.S. Ward Collection, Davis, California, U.S.A. Is not in the registry….
SCV D.M. Sorger Collection, Vienna, Austria Is not in the registry….
ZCV H. Zettel Collection, Vienna, Austria Is not in the registry….

So, not exactly helpful, and obviously one can not recommend to go ahead right now.

But what I found much more astonishing is that all the discussions in our domain regarding Linked Open Data, persistent identifiers (eg Guralnick et al., 2015; Bouchout Declaration) seems unheard off. How comes, that a new project is launched without giving this a top priority?

Friday, May 01, 2015

The STM publishers are not our friends.

Data Mining and Extraction

The Goal

I am interested to extract and mine data in scientific publications. I want to be able to run such analyses over large corpora of publications, first to discover which journals do include respective data, and in a following step to run the extraction or mining processes over it.

The focus here is not about the issues to find and build up this corpus, but to access and extract content from a PDF.

The Extracted Text
















From this article (doi: 10.1111/1365-2656.1236710.1111/1365-2656.1236710.1111/1365-2656.12367), I would like to extract the text so I can run it through my tools (Imagine) to find the target elements.

This is what I get. Not really the same text that is nicely structured in the original.

And this is not an artifact for my simple - simplistic - way to get the text out of a PDF that is repeated through all the PDF. In fact, most of the journal we look at have slightly to very different problems attached to.

Whose Problem is it?

So, is this a publisher's problem? To some extend yes, because if they would be interested to provide access then they would assure that this is possible, and not even change over time the way they encode PDFs. They are now all created from manuscripts, mainly based on MS Word where those artifacts are not present.

At the same time, it seems also to be a strategy by Adobe, one of the companies providing tools to create PDF. An indication might be that when I save a webpage - pure HTML - I get a PDF that is a mere image without text available anymore, and so loosing all the encoding as well as all the structuring. That means, if I want to extract text from it, I need an OCR machine that is itself cumbersome, but also introduces new errors, not to speak of the time it takes to make this extra step.

Now you can look at me as a naive user of modern electronic resources. I would argue that I am rather an advanced user. The experience we have with mining corpora of literature that have been assembled by taxonomic specialists just is were in our domain the corpora are being assembled, such as ca 4,500 PDFs covering ant systmatics, or ca 16,000 drosophild taxonomy. These corpora are not available otherwise, and the modern publishing industry is not interested in helping to make this happen. Thus the "naive" user is where research is going on and thus an important factor that has its important role in advancing science.

If STM publishers and Adobe are interested in contributing to the future of science, then they should give up their efforts to hamper access.

Wednesday, April 29, 2015

Creating Linked Open Data of a Taxonomic Description

Challenge

 Linked Open Data is the entry point to Big Data. As taxonomist with a strong believe that we have to bring our data, as opposed that the public will look for it, I am determined to make this happen.

The process

The following blog describes and quantifies the workflow from the discovery of a name to make the referenced taxonomic name available as LOD we develop and implemented at Plazi.

1. Starting point. For my lectures on chemical communication in ants I stumbled upon this note in sci-news describing a novel form of social parasitism based on the discovery of Cephalotes specularis described in 2014 by Brandão et al in Zootaxa, for which a DOI is provided (DOI: zootaxa.3796.3.9). Unfortunately, this is a closed access articles. A search provided a link to it via antcat to antwiki, is available. To start processing the document into LOD, I added the article to the Biodiversity Literature Repository as closed access article. This includes shows the original Zootaxa minted DOI, as well as an alternative identifier from Zoobank.
 
2. Adding the name to the Hymenoptera Name Server: our reference system for ant names (HNS) showed that the bibliographic reference of the article has already been added (reference), but checking the name did not result in an entry. So I added the name to HNS through the online form. Here it is in HNS, and here in HOL, to which we will create a link to from the treatment that we are going to produce now. We need this name server that is in our control, since Zootaxa does not add new names to Zoobank anymore.

3. To convert the above article into a semantically enhanced document I use our Imagine software. It is already installed on my machine.A manual is available. I added the metadata of the publication, parsed all the taxonomic names, bibliographic records (that are included in Refbank upon saving),  treatment and structure, and materials citation and linked the name to HNS. The result is here as html or RDF, which has been the goal of the the exercise.

Time

Time from reading a name to finding the referrenced article 5min
1. Time for upload the article to BLR 5min
2. Time to add name the HNS 2 min
3. Time to convert pdf into semantically enhanced doc, uploaded to Plazi  21min

Caveat

This only works within this time framework with training and both understanding the semantic structure of a taxonomic work, its model and the tagging tools.

It also depends on having access to all the resources, including Plazi (that can be obtained).

Questions

Why to make all this effort?

 The advantages are easily obvious:
1. 90% of the Name usages are like this: Anochetus grandidieri Forel, 1891, or Fisher & Smith, 2008 cite Anochetus grandidieri Forel. Neither of the referenced publications nor the treatments are linked. The most obvious case is in the Catalogue of Life, where no linking is provided.
2. Much better is already a provision (via a link, or directly) of the proper bibliographic reference for Forel, 1891 or Fisher & Smith, 2008.
3. Even better is a link to a catalogue, such as Anochetus grandidieri Forel 1891 in the Hymenoptera Name Server  or more explicitely using a persistent URI: http://bioguid.osu.edu/xbiod_concepts/187786 that allows to cite this name properly. Ultimately, we need a universal system.
4. The next improvement is to get from the name a direct link to the respective article. Ideally the article is an archive that provides a DOI, eg 10.5281/zenodo.9896, such as the Biodiversity Literature Repository can provide for legacy literature.
5. The next is to get a direct link to a digital version of the respective cited page: Anochetus grandidieri Forel, 1891: 108, allowing directly to understand what the respective author had in mind when he created the concept (The principle of reproducibility in science).
6. The next is to get a direct link to the treatment. Ideally, the treatment has a persistent identifier, such as this httpURI http://treatment.plazi.org/id/1C4EDC17-8AD7-9DD7-F1A5-AB856E8C5BCA.
7. The next is to link cited treatments to the respective treatment, such as Fisher & Smith, 2008 usage of  Anochetus grandidieri
8. The final step is to create 5* data by providing all this content in an open, machine readable, semantically enhanced version: Anochetus grandidieri Forel 1891 or Anochetus grandidieri Forel sensu Fisher & Smith, 2008.

Why not rely on existing resources, such as antwiki, antcat?

How long does it take that this new treatment is propagated on the Web, especially by those which harvest data from Plazi:

HOL
Antweb
GBIF
EOL 
Starting time is 20150429:11:57



Thursday, October 18, 2012

Conservation Commons

Is the Conservation Commons dead? A review document submitted by the current secretariat of the Conservation Commons at WCMC regarding its principles is currently being discussed at the 11th Conference of the Parties of the Convention of Biological Diversity in Hyderabad in document cop-11-wg-02-crp-01. The relevant paragraph is 14 and has the following language:
14.    Notes the recommendations made by the Conservation Commons in document UNEP/CBD/COP/11/INF/8 and calls upon Parties and other stakeholders to consider how they can most effectively address barriers to data access that are under their direct control with a view to contributing to the achievement of the Aichi Biodiversity Targets, and Targets 1 and 19 in particular, and requests SBSTTA to develop further guidance thereon;
This is extremely lame: "address barriers to data" seems to be all but what has been at the core of the CBD, that is Access and Benefit Sharing. But those at the meeting selling the CC, and most likely the recipients of such a service, obviously did not, do not want to understand the relevance of the notion to have a Commons. If so, the sentence would have at least be written "REMOVE barriers to data"

In my view this just resonates the partnership of the conservation community with the private industry that is all but for open access to data, and of ignorance by the respective policy makers at this policy making forum.

But it also means, that we have to be more alert of these not easily understandable processes and participate more actively. It is always easier to complain afterwards.

To be afraid of the future or not to be

John Wilbank's talk at TED is a highlight and stimualting at the same time, and puts the ongoing debate about privay rights, right now in the EU, into a very different light, if not makes it very questionable. Are we really all so much afraid about the future?

This reminds me of the ongoing discussion on Open Access in the scientific world. Similar to the debate on privacy rights, where a central point is to deny Google the right to federate all their different databases and even more, break down restrictions to re-use data - data can only be used for the purpose it has been collected according to EU privacy law, scientists debate about access to single PDFs. Neubauer, as an example,  misses in his column the point, that there is new emerging power in a federation of all the content that we scientists produce, and that this is the really new character the Internet is all about. 

Why are we so defensive in a world that we all enjoy, where we have no experience with this new tools that already make our lives completely transformed? We all look back to the totalitarian regimes that where spying on us, and at the same time just accept that we are spied on by our on democratic states in the name of our protection against terrorism. We have no control on that - so why are we worried about data that we in many case produce ourselves, deliberately because we use Facebook, Twitter or not so deliberately, because we do not understand our gadgets well enough to stop recording out actions?

As John states, we should rather adhere, as an example to the principle of consent to research , be proactive by putting our data out with the foresight, that a federation of many data sets together with the ingenuity of people doing things with it, like analyzing medical data in John's case,  will save our lives, not the pricacy policy restricting the use of our medical records.

The consequences of non-access we have been living so far and seem to increasingly promote should also  be a warning to us:
Back in 2002, governments around the world agreed that they would achieve a significant reduction in biodiversity loss by 2010. But the deadline came and went and the rate of loss increased BBC News, Oct 12, 2012.
We scientists are one of the culprits because, and this is not just since 2002 but in fact 1992 when the Convention on Biological Diversity has been born, we just hide our information and at best give it up on pdf at once.


Tuesday, February 28, 2012

Conservation Commons

It looks like as if the Conservation Commons is dead. The possible culprit: WCMC. The reason: conflict of interest between sharing, but more vitally own data to sell to commercial entities like the big mining and oil companies.

Or has it to do, that WCMC, the current host of CC is too much of a corporation that has no funds to spend on this case?

It could also be, that this is completely wrong.

In any case, CC has to be revitalized and an analysis done to understand how to not just make it works, but its ideas spread.

Tuesday, March 15, 2011

Citizien Scientists: A Positive View

Volunteers play a vital role in ensuring that a range of valuable long-term datasets continue to survive, a team of scientists will say. [BBC online, March 15]


I am critical against the trend to engulf increasingly citizen scientists in projects and tasks traditionally scientists can not do anymore, or simple don't have the resources. A very typical example is the effort by GBIF or EOL to enlisten volunteers to help to create content (data) for there services.

It is not so much the notion that they there are not many very skillful and dedicated people out there. It is more two elements that concern me. To know, whether something is relevant or not that needs to understand the topic in a wider sense, and the reliability of the data generation in terms of commitment. A commitment that is purely driven by interest and makes these volunteers very dedicated, but at the same time certain tasks are not being done, because they are less attractive or at times that are not convenient.
This is especially of concern for long term monitoring studies, such as birders do. For collaborations, these needs its own skills to manage a crowd that you can not promise a financial reward but motivation.

The story reported in BBC and mainly reflects the work done by Earthwatch makes just the opposite point, that many of the long term observation studies only work because of the volunteers.

This is a very positive note, similar to the fact that very many taxonomists are amateurs and produce a huge wealth of knowledge.

Monday, March 14, 2011

The African Wolf: no data accessible


The discovery that there is a wolf and not a jackal in Africa has recently been widely published in the news. This is based on a publication in PLoSOne by Rueness et al. There has been some older morphological evidence for the long kept secret that, despite the saying that there is no wolf in Africa, and there was always the insight from egyptian zoologist that there is a wolf population in Fayoum, that is distinct from the jackals in other places. Furthermore, the jackals also seeemed to be small relative to the rest of the population, especially those from the Qattara depression.

When I read through the article by Rueness et al., it was striking that it is impossible to find what specimens they used and from they originate. The citation leads to the master thesis by Nassef (Nassef M (2003) The Ecology and Evolution of the golden jackal (Canis aureus) Investigating a cryptid species. Master thesis. The university of Leeds.), where I can not get any further. I fist search on Google doesn't reveal the whereabout of Nassef, so I will contact the authors.

I think it is not a good policy for both PLoS-One and the authors to keep back all the observation data in a case that is clearly very important and far reaching. It is a very small data set, and I wonder, how well the samples kept in Egypt have been used to figure out that there might actually a jackal AND a wolf species living close to each other.

ICZN and Open Access, or ICZN's self-inflicted no-role

This is probably the most unbelievable discussion on a thematic list serve I have come across for a long time:

I suspect you are both missing the point here. Please read the abstract once again:
http://iczn.org/content/cyclodina-aenea-girard-1857-currently-oligosoma-aeneum-reptilia-squamata-scincidae-proposed-

unfortunately, I don't have access to the whole article, only the abstract (can anyone send me the PDF, please?), but, as I understand it:

Cyclodina aenea Girard, 1857 has been found to be a junior subjective synonym of Tiliqua ornata Gray, 1843, but *both names* are in current usage (as Oligosoma aenea and O. ornata, respectively). What the authors want is to retain the usage of *both names*. Suppression of Tiliqua ornata Gray, 1843 is not going to help! The *only way* is to designate a neotype for Tiliqua ornata Gray, 1843 ...

Stephen [ICZN-list listserve, Mon 3/14/2011 11:41PM]


Here is the ICZN who faces an uphill battle to loose nomenclatorial control over the scientific names of animals, who is unable to provide lists of available names so they could offer a service to judge whether a name has already been used, and who wants to sort out errors in the naming of species. And here are the way discussions are led on their list serve, because their cases and opinions are not open access.

One of the main reasons why ICZN can not produce a list of available names is copyright. If there is no copyright for taxonomic material we could provide at least today instant access to all the nomenclatorial acts that are currently published. There are many barriers in a world where such acts are published in over 1,200 journals and books annually, but those are decreasing with the change from print only to print/electronic publishing, and especially those from non-western countries becoming open access. This process could even go faster, if the Biodiversity Heritage Library wouldn't have to run against to copyright wall.

But here is the ICZN that uses copyright and bars even those seriously interested but not attached to an institution with a subsription from reading their material. This policy is a sign far beyond nomenclature that all the discussions we lead about open access, not to speak developing technology to provide machines access to harvest its content automatically (so for example Zoobank could become more efficiently if it ever sees the light of the day), is null and void, because we believe in copyright.

Thorpe's comment is the best example how detrimental such a policy is to the discussion of internal affairs, but to the enforcement of open access. But luckily, there is an increasing move to provide and request open access from our funding bodies, so that ICZN with its publishing policies is once again maneuvered into the offside.

I wonder why ICZN can afford to add to the their loosing battle of controlling names in the electronic world another policy to play at the no role in the important nomenclatorial control of scientific names. But may be better, why it takes so long to realize what that open access is the only way out of the dark.

I am sure the reason for this business policy is that those revenues from the sale of the Bulletin is needed. If this barrier of non-access is detrimental to ICZN, I wonder whether this is the right business model to adhere to. What is clear though, is that a no-access policy is not very attractive for any funding agency.

P.S. If you are interested in reading the Bulletin, you can purchase it here