Thursday, May 07, 2015

The Global Registry of Biorepositories (GRBio)

Wouldn't it be great, if we had a registry of all the collections with biological collection including the baseline information, contact addresses, and even better a list of all their specimens, DNA-samples extracted from the samples, images or publications not only by their staff, but those including specimens in the collection? Whilst much of this may lay in the future, one simple element, a unique persistent identifier for each of the collection would be extremely helpful, so we could link our taxonomic publications with those institutions, once and for ever.

Whilst I have been reviewing a manuscript this morning I suggested in the review, that the authors ought to add a link to the respective institutions in the Global Registry of Biorepositories. Somehow I suddenly felt, I had to look up some of the collections - and it seems well worth it.

CAS California Academy of Sciences, San Francisco, U.S.A.   Is not in the registry….
MCB D. Mezger Collection, Balingen, Germany  Is not in the registry….
NHMW Natural History Museum Vienna, Austria Is not in the registry….
NMM National Museum of the Philippines, Manila, the Philippines MPMP Which is not the name provided in the ms
UKL University of Koblenz-Landau (Campus Landau), Germany Is not in the registry….
WCD P.S. Ward Collection, Davis, California, U.S.A. Is not in the registry….
SCV D.M. Sorger Collection, Vienna, Austria Is not in the registry….
ZCV H. Zettel Collection, Vienna, Austria Is not in the registry….

So, not exactly helpful, and obviously one can not recommend to go ahead right now.

But what I found much more astonishing is that all the discussions in our domain regarding Linked Open Data, persistent identifiers (eg Guralnick et al., 2015; Bouchout Declaration) seems unheard off. How comes, that a new project is launched without giving this a top priority?

Friday, May 01, 2015

The STM publishers are not our friends.

Data Mining and Extraction

The Goal

I am interested to extract and mine data in scientific publications. I want to be able to run such analyses over large corpora of publications, first to discover which journals do include respective data, and in a following step to run the extraction or mining processes over it.

The focus here is not about the issues to find and build up this corpus, but to access and extract content from a PDF.

The Extracted Text

From this article (doi: 10.1111/1365-2656.1236710.1111/1365-2656.1236710.1111/1365-2656.12367), I would like to extract the text so I can run it through my tools (Imagine) to find the target elements.

This is what I get. Not really the same text that is nicely structured in the original.

And this is not an artifact for my simple - simplistic - way to get the text out of a PDF that is repeated through all the PDF. In fact, most of the journal we look at have slightly to very different problems attached to.

Whose Problem is it?

So, is this a publisher's problem? To some extend yes, because if they would be interested to provide access then they would assure that this is possible, and not even change over time the way they encode PDFs. They are now all created from manuscripts, mainly based on MS Word where those artifacts are not present.

At the same time, it seems also to be a strategy by Adobe, one of the companies providing tools to create PDF. An indication might be that when I save a webpage - pure HTML - I get a PDF that is a mere image without text available anymore, and so loosing all the encoding as well as all the structuring. That means, if I want to extract text from it, I need an OCR machine that is itself cumbersome, but also introduces new errors, not to speak of the time it takes to make this extra step.

Now you can look at me as a naive user of modern electronic resources. I would argue that I am rather an advanced user. The experience we have with mining corpora of literature that have been assembled by taxonomic specialists just is were in our domain the corpora are being assembled, such as ca 4,500 PDFs covering ant systmatics, or ca 16,000 drosophild taxonomy. These corpora are not available otherwise, and the modern publishing industry is not interested in helping to make this happen. Thus the "naive" user is where research is going on and thus an important factor that has its important role in advancing science.

If STM publishers and Adobe are interested in contributing to the future of science, then they should give up their efforts to hamper access.