Friday, May 01, 2015

The STM publishers are not our friends.

Data Mining and Extraction

The Goal

I am interested to extract and mine data in scientific publications. I want to be able to run such analyses over large corpora of publications, first to discover which journals do include respective data, and in a following step to run the extraction or mining processes over it.

The focus here is not about the issues to find and build up this corpus, but to access and extract content from a PDF.

The Extracted Text
















From this article (doi: 10.1111/1365-2656.1236710.1111/1365-2656.1236710.1111/1365-2656.12367), I would like to extract the text so I can run it through my tools (Imagine) to find the target elements.

This is what I get. Not really the same text that is nicely structured in the original.

And this is not an artifact for my simple - simplistic - way to get the text out of a PDF that is repeated through all the PDF. In fact, most of the journal we look at have slightly to very different problems attached to.

Whose Problem is it?

So, is this a publisher's problem? To some extend yes, because if they would be interested to provide access then they would assure that this is possible, and not even change over time the way they encode PDFs. They are now all created from manuscripts, mainly based on MS Word where those artifacts are not present.

At the same time, it seems also to be a strategy by Adobe, one of the companies providing tools to create PDF. An indication might be that when I save a webpage - pure HTML - I get a PDF that is a mere image without text available anymore, and so loosing all the encoding as well as all the structuring. That means, if I want to extract text from it, I need an OCR machine that is itself cumbersome, but also introduces new errors, not to speak of the time it takes to make this extra step.

Now you can look at me as a naive user of modern electronic resources. I would argue that I am rather an advanced user. The experience we have with mining corpora of literature that have been assembled by taxonomic specialists just is were in our domain the corpora are being assembled, such as ca 4,500 PDFs covering ant systmatics, or ca 16,000 drosophild taxonomy. These corpora are not available otherwise, and the modern publishing industry is not interested in helping to make this happen. Thus the "naive" user is where research is going on and thus an important factor that has its important role in advancing science.

If STM publishers and Adobe are interested in contributing to the future of science, then they should give up their efforts to hamper access.

0 Comments:

Post a Comment

<< Home