Wednesday, February 27, 2008

The Launch of Plazi.org

PLAZI.ORG - THE DIGITAL REPOSITORY FOR SPECIES DESCRIPTIONS.

Knowledge of the actual number of species on planet Earth is one of the last frontiers in science. It is not known exactly how many species have been identified and described, much less the number of as yet undescribed species.

However, the species we do know are documented in well over hundred million pages of printed scientific books and journals. – This knowledge is hidden in libraries, and no single library holds all this knowledge.

The species descriptions are very rich in data, essentially a quality controlled summary of what is known at any specific time about a particular species. In best cases, this information includes a detailed morphological description, drawings and images, a summary on behavior and ecology and a detailed list of all the specimens studied. In more recent publications, links to DNA sequences or video documentation – among other forms of data – may be provided. Recently e-publications have become available, but many of these are copyrighted and thus not generally available open to the public for perusal or use. Nor are they easily machine-searchable for discovery and re-use of contents.

Recently, the Biodiversity Heritage Library as a large scale operation to digitize this biodiversity literature has been launched. Currently, it includes major US and UK natural history libraries, with the ultimate goal of including the entire global literature. All publications will be openly accessible to the public, unless they are copyrighted -- thus most of the recent publications are still out of reach. The BHL thus falls short of optimizing the potential uses of these publications.

Tagging the “boundaries” of a species description and identifying the species dealt with, supports discovery and retrieval of data not possible through Google. Mark-up of species descriptions permits queries, such as which are the "red ant in London", a very common form of query.

Under some national copyright legislation like the Swiss, descriptions can not be copyrighted because they are through historical constraints (there are tens of millions of descriptions) and peer review standardized and listing factual, in most cases morphological data describing species, and thus they can all be made readily accessible.

Plazi.org is a new Web based service that offers access to descriptions of species and an archive to store the publications as marked up documents. GoldenGate, a dedicated editor has been developed to mark up the publications supporting the extraction of descriptions, based on a TaxonX, an XML schema modeling the logic content of these publications. The Plazi Search and Retrieval Server, building on this systematic mark-up of texts, allows powerful search functions to find species descriptions, or even simple mention of species, permitting users to answer questions like: “Which species occur together”?

Plazi.org includes already more than 3,700 description of 3,000 taxa with a goal of archiving all the forthcoming new descriptions and, contingent upon additional funding, all the descriptions of the known 12,278 ant species listed in the Hymenoptera Name Server/ antbase.org, enhanced with globally unique species numbers (LSID’s: Life Science Identifiers). While ants provide the original test case, the service is not restricted to ants but is potentially open to all groups, from Bacteria to Plants, and will support most major languages. All descriptions are machine readable and thus can be picked up for mash-ups or individual Websites.

Plazi.org is run and developed by Donat Agosti, Terry Catapano, Christiana Klingenberg and Guido Sautter, its development is supported by Grants from the US National Science Foundation (to the American Museum of Natural History: Christie Stephenson and Tom Moritz), the German Deutsche Forschungsgemeinschaft (to University of Karlsruhe: Klemens Böhm) and the Global Biodiversity Information Facility (GBIF; to Plazi.org and Zootaxa), and is collaborating with the Hymenoptera Name Server at Ohio State University (Norm Johnson), Zoobank (Richard Pyle), University of Massachusetts (Robert Morris), antweb.org (Brian Fisher) and Zootaxa (Zhi-Qiang Zhang).

Plazi.org has been released to the public at the EDIT "IPR and the web: challenges for taxonomy" meeting in London, Feb. 20, 2008

Related Links:
EDIT- IPR and the Web Workshop, Kew Gardens, UK, February 20, 2008

The organizers have to be thanked for taking on this important topic - IPR and copyright being a topic which all has an impact on us. It's (mis-)application is essentially one of the main culprit for biodiversity is neither on the science agenda nor a relevant issue on todays global environmental politics. "You can only protect what you know..." and with literally no access to the more than 10 Millions of descriptions, and no clear strategy yet who to do this in the current misunderstood and applied copyright framework, we fare not very well.

I was astonished that nobody from the printing industry, the law makers nor very little from the legal branch have been present. It was then very refreshing, that the representative of the legal branch, Willi Egloff had a very refreshing talk, in a way rather the opposite of what a large part of our community thinks copyright is, and whose complicated somber picture has been represented by EDIT's legal advisor (who is not a lawyer) Naomi Korn, who brought in a new term "Risk Assessment".

Practical aspects publishing side came up only at the end of my lecture, and in discussions stressing the point that we need at least self archiving (Green Road) or better find ways for open access policy, that must assure that we do not sign any contracts and give a way exclusive rights to publishers.

There was the usual discussion, based on the idea that we all continually publish in Nature and Science.... and which showed how little the same people know about contracts, such as publishing in Nature (which allows preprints in repositories).

Naomi Korn, a consultant to EDIT was speaking much more for the NHL administration then the interest of the scientists, and warned that there ar dangerous all over. The example of the danger of being sued though came from Google, or Access and Benefit Sharing issues at Kew.

Egloff gave a legal perspective (I think, he was one of the very few legal professionals in the room) which essentially would say that with his interpretation of the law descriptions can not be copyrighted, but probably the entire publications, because descriptions do not represent original, unique entities that qualify as works. It is also possible to temporarily make copies of publications for example in a process to mark them up and extract the descriptions.

A lot of all this depends on individual cases.

Whereas Naomi would stress the point that we life in a very dangerous world, which needs a sound Risk Assessment of all we do, this point was flawed because the argument would build on cases dealt with in the courts - but there are none so far.

A further point stressed by Naomi was, that we need to protect the interest of the scientists, so that they could eventually make money with their work. This, in fact, mixes commercial with scientific use and thus disallows many exceptions in the copyright law given for scientific use. My feeling was, that Naomi, supposedly representing the interest of the scientists is much more leaning towards to administration's point of view. Her point "I have to sell a IPR policy to the administration" might not what is needed, but a convincing argument, that the administration supports the scientists point.

Egloff would take the route to feel much less constrained and make us of modern technology so it is clear to the law makers what the advantage is of having open access. So far, there is little to be seen assuming that there are 10s of millions of descriptions out there. Certainly the approach BHL is choosing helps little to make the case, since they hardly touch modern publications because of their copyright policy, and thus will not allow to build comprehensive bodies of descriptions that can be mined or used.

One problem is that our large research institutions (the natural history museums and collections) rather position them as corporations interested of selling products rather then focusing on the interest of their scientists.

Vishwas Chavan's talk about citation was one step in the direction of linking all the various pieces of data and information together, what is finally needed to represent the taxonomic knowledge as an entity as opposed to single publications and thousands for little databases nobody but the individual scientist would use.

My own presentation went along the same route, making the case that we need to make content accessible, how we could do it (see plazi.org) by providing the tools to convert legacy publications into semantically enhanced publications with an emphasis on access to descriptions, as well as the respective infrastructure to provide access to them. There will be no single way to get there, so we need to consider the Green and Gold Road to open access, and furthermore whether we want to keep publishing descriptions the same old way or whether we better make sure we can use web publications and strive for comprehensive databases and ontologies providing a more or less uniform access to the publications.

This latter point has been stressed with Vince Smith and Dave Roberts points about Scratchpads.

One point that needs be explored is the question of why we begun to talk about publications in the first hand. Clearly, if we talk about simple access to pdf copies, there is little chance to succeed, besides implementing the Green Road, but which does not allow machine readable access. What we need is a vital science curriculum that also makes use of the published material as much as it produces new material to make new discoveries by third parties, for example compiling a list of all the host-plant relationships of a given taxon. We want to introduce new metrics to show what is being how often used to measure a scientists contribution. But foremost, we have to have a strong research curriculum, and based on this ask what sort of publications we need. We all want things, but we, the systematists community, must get together more actively to show, that we develop innovative tools TOGETHER to chart the world's species. Providing open access is just one of the outcomes, which also allows the public to follow what we as scientists do.

Thus, a lot of the issues would be resolved, if the emphasis would be to come up with good science in the first hand that is built upon making use of the vast body of literature and growing databases, such as delivered through GBIF, GenBank, CBOL. That would mean that few follow some simple standards and transfer protocols.

In summary, it was a meeting that was timely, a lot of the participants left confused but still thinking that copyright is needed to protect even their right. I hope that Egloff's position will be noticed. Why not even run all the biodiversity literature operation from places like Switzerland that have the best possible legal framework to deal with? Already now, scanning takes place in the US and not the UK because of copyright law.

It is time, that the scientists go back to square one. Science is about
citations and free flow of information, and NOT copyright, which is a commercial issue. New routes are not prohibited, nor to come up with OUR strategy, and then talks including all the stakeholders.