Rogue Scholar

Data QualityParsingPlaziSpecimenText MiningComputer and Information Sciences

Problems with Plazi parsing: how reliable are automated methods for extracting specimens from the literature?

Published October 25, 2021

The Plazi project has become one of the major contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences (see Plazi's GBIF page for details). These occurrences are extracted from taxonomic publication using automated methods.

Guest PostMacroscopeComputer and Information Sciences

Reflections on "The Macroscope" - a tool for the 21st Century?

https://doi.org/10.59350/d3dc0-7an69

Published October 7, 2021

Author Roderic Page

This is a guest post by Tony Rees. It would be difficult to encounter a scientist, or anyone interested in science, who is not familiar with the microscope, a tool for making objects visible that are otherwise too small to be properly seen by the unaided eye, or to reveal otherwise invisible fine detail in larger objects.

JSON-LDRDFComputer and Information Sciences

JSON-LD in the wild: examples of how structured data is represented on the web

https://doi.org/10.59350/jzvs4-r9559

Published August 27, 2021

Author Roderic Page

I've created a GitHub repository so that I can keep track of the examples of JSON-LD that I've seen being actively used, for example embedded in web sites, or accessed using an API. The repository is https://github.com/rdmpage/wild-json-ld. The list is by no means exhaustive, I hope to add more examples as I come across them. One reason for doing this is to learn what others are doing.

Computer and Information Sciences

Species Cite: linking scientific names to publications and taxonomists

https://doi.org/10.59350/jarsz-yfm45

Published July 23, 2021

Author Roderic Page

I've made Species Cite live. This is a web site I've been working on with the GBIF Challenge as a notional deadline so I'll actually get something out the door. "Species Cite" takes as its inspiration the suggestion that citing original taxonomic descriptions (and subsequent revisions) would increase citation metrics for taxonomists, and give them the credit they deserve.

Bibliography Of LifeCSLElasticSearchJSONJSON-LDComputer and Information Sciences

Towards a WikiCite search engine

https://doi.org/10.59350/45mzm-67867

Published July 22, 2021

Author Roderic Page

I've released a simple search engine for publications in Wikidata. Wikicite Search takes its name from the WikiCite project, which was an initiative to create a bibliographic database in Wikidata. Since bibliographic data is a core component of taxonomic research (arguably taxonomy is mostly tracing the fate of the "tags" we call taxonomic names) I've spent some time getting taxonomic literature into Wikidata.

CitationCSLMachine LearningParsingComputer and Information Sciences

Citation parsing tool released

https://doi.org/10.59350/9416m-mzz03

Published July 22, 2021

Author Roderic Page

Quick note on a tool I've been working on to parse citations, that is to take a series of strings such as: Möllendorff O (1894) On a collection of land-shells from the Samui Islands, Gulf of Siam. Proceedings of the Zoological Society of London, 1894: 146–156. de Morgan J (1885) Mollusques terrestres & fluviatiles du royaume de Pérak et des pays voisins (Presqúile Malaise). Bulletin de la Société Zoologique de France, 10: 353–249.

C++CloudCompilingHerokuComputer and Information Sciences

Compiling a C++ application to run on Heroku

https://doi.org/10.59350/vy6b8-0eh95

Published June 15, 2021

Author Roderic Page

TL;DR Use a buildpack and set "LDFLAGS=--static" --disable-shared I use Heroku to host most of my websites, and since I mostly use PHP for web development this has worked fine. However, every so often I write an app that calls an external program written in, say, C++. Up until now I've had to host these apps on my own web servers. Today I finally bit the bullet and learned how to add a C++ program to a Heroku-hosted site.

ALABHLBioStorGBIFPlaziComputer and Information Sciences

Thoughts on BHL, ALA, GBIF, and Plazi

https://doi.org/10.59350/17w25-9m342

Published June 4, 2021

Author Roderic Page

If you compare the impact that BHL and Plazi have on GBIF, then it's clear that BHL is almost invisible. Plazi has successfully in carved out a niche where they generate tens of thousands of datasets from text mining the taxonomic literature, whereas BHL is a participant in name only. It's not as if BHL lacks geographic data.

CitationCRFIdentifiersMachine LearningSpecimensComputer and Information Sciences

Finding citations of specimens

https://doi.org/10.59350/gg8m4-vb985

Published May 28, 2021

Author Roderic Page

Note to self. The challenge of finding specimen citations in papers keeps coming around. It seems that this is basically the same problem as finding citations to papers, and can be approached in much the same way. If you want to build a database of reference from scratch, one way is to scrape citations from papers (e.g., from the "literature cited" section), convert those strings into structured data, and add those to your database.

Catalogue Of LifeGraphvizSummary TreesVisualisationComputer and Information Sciences

Maximum entropy summary trees to display higher classifications

https://doi.org/10.59350/af01t-6sw74

Published May 28, 2021

Author Roderic Page

How to cite: Page, R. (2021). Maximum entropy summary trees to display higher classifications https://doi.org/10.59350/af01t-6sw74 A challenge in working with large taxonomic classifications is how you display them to the user, especially if the user probably doesn't want all the gory details.

iPhylo

Problems with Plazi parsing: how reliable are automated methods for extracting specimens from the literature?

Reflections on "The Macroscope" - a tool for the 21st Century?

JSON-LD in the wild: examples of how structured data is represented on the web

Species Cite: linking scientific names to publications and taxonomists

Towards a WikiCite search engine

Citation parsing tool released

Compiling a C++ application to run on Heroku

Thoughts on BHL, ALA, GBIF, and Plazi

Finding citations of specimens

Maximum entropy summary trees to display higher classifications