Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics.  For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188. Written content on this site is licensed under a Creative Commons Attribution 4.0 International license.

University of Glasgow

Roderic Page

I've been banging on about having citable, persistent identifiers for specimens, so was suitably impressed when Derek Sikes posted a comment on iPhylo that Arctos already does this. For example, here is a DOI for a specimen: http://dx.doi.org/10.7299/X7VQ32SJ. So, we're all done, right? Not quite.

DOIs for specimens are here, but we're not quite there yet

One of the things that didn't make last week's deadline for launching BioNames was the inclusion of phylogenies.

BioNames - Phylogenies? Yes, phylogenies

This post was prompted by Stephen Thorpe's post on TAXACOM about Wikispecies in which he wrote (in a thread discussing Roger Hyam's recent blog post) that  I beg to differ. Wikispecies runs on a database (the Mediawiki software uses a database to store the wiki), and Mediawiki can be thought of as a database of semi-structured text, but it lacks a lot of the functionality database users would expect.

Wikispecies is not a database

Lately I've become more and more interested in moving data off my machine(s) and into the cloud.

Zotero: creating bibliographies in the cloud

As part of the NSF "Assembling, Visualising and Analysing the Tree of Life" Ideas Lab that I took part in earlier this week I had an assessment of my "problem solving style" carried out using a service called FourSight.

I am not a number...I am an "ideator"

I spent last week in Ottawa at a "Ecobiomics" hackathon organised by Joel Sachs. Essentially we spent a week exploring the application of linked data to various topics in biodiversity, with an emphasis on looking at working examples.

Ottawa Ecobiomics hackathon: graph databases and Wikidata

One of the potentially powerful features of TreeBASE II is availability of a RDF version of a study. This means that, in principle, one could take the RDF for a TreeBASE study, combine it with RDF from other sources, and generate a richer view of a particular study.

TreeBASE II RDF

A quick, and not altogether satisfactory hack, but I've added a simple interactive treemap to BioStor. It's essentially a remake of the Catalogue of Life treemap I created in 2008, but coloured by the number of references I've extracted from BHL.

Displaying taxonomic coverage using a treemap

Over on the EOL blog is a summary of a meeting Visualizing the Evolutionary Tree of Life. This sounds like it was a fun meeting, but part of me is suffering from déjà vu. Our community has tossed this subject around for a while now.

Visualizing the Evolutionary Tree of Life

I've written a short paper entitled "Liberating links between datasets using lightweight data publishing: an example using plant names and the taxonomic literature" (phew) and put a preprint on bioRxiv (https://doi.org/10.1101/343996) while I figure out where to publish it. Here's the abstract:  In some ways the paper is simply a record of me trying to figure out how to publish a project that I've been working on for several years, namely

Liberating links between datasets using lightweight data publishing: an example using IPNI and the taxonomic literature

Day three of TDWG 2017 highlighted some of the key obstacles facing biodiversity informatics. After a fun series of "wild ideas" (nobody will easily forget David Bloom's "Kill your Darwin Core darlings") we had a wonderful keynote by Javier de la Torre (@jatorre) entitled "Everything happens somewhere, multiple times". Javier is CEO and founder of Carto, which provides tools for amazing geographic visualisations.

TDWG 2017: thoughts on day 3

Continuing with my exploration of the Biodiversity Heritage Library one obstacle to linking BHL content with nomenclature databases is the lack of a consistent way to refer to the same bibliographic item (e.g., book or journal). For example, the Amphibia Species of the World (ASW) page for

 Gastrotheca aureomaculata

gives the first reference for this name as: Gastrotheca aureomaculata Cochran and Goin, 1970, Bull. U.S. Natl.

n-gram fulltext indexing in MySQL

At the end of day two of the GBIF LSID-GUID Task Group I put together this crude diagram to summarise some of the possible links between biodiversity data and the larger linked data cloud, which I, among others, have argued is where biodiversity informatics should be heading.

GBIF and Linked Data

This is a quick sketch of a way to combine existing tools to help clean and annotate data in GBIF, particularly (but not exclusively) occurrence data. GitHub   The data provider puts a Darwin Core Archive (expanded, not zipped) into a GitHub repository. GBIF forks the repository, cleans the data, and uploads that to GBIF to populate the database behind the portal.

Annotating and cleaning GBIF data:  Darwin Core Archive, GitHub, ORCID, and DataCite

Yes, I know this is ultimately a case of the "genius of and", but the more I play with the Semantic Mediawiki extension the more I think this is going to be the most productive way forward. I've had numerous conversations with Vince Smith about this. Vince and colleagues at the NHM have been doing a lot of work on "Scratchpads" -- Drupal based webs sites that tend to be taxon-focussed.

Wikis versus Scratchpads

Some random notes on the first day of TDWG 2017. First off, great organisation with the first usable conference calendar app that I've seen (https://tdwg2017.sched.com). I gave the day's keynote address in the morning (slides below).

 Towards a biodiversity knowledge graph

from

 Roderic Page

It was something of a stream of consciousness brain dump, and tried to cover a lot of (maybe too much) stuff.

TDWG 2017: thoughts on day 1

Reading the GitHub issue Define objective rules for taxon concept identity referred to by Markus Döring in a comment on a previous post, I'm once again struck by the unholy mess generated by any discussion of "taxonomic concepts". The sense of déjà vu is overwhelming.

Taxonomic concepts: a possible way forward

A couple of weeks ago I was grumpy on the Internet (no, really) and complained about museum websites and how their pages often lacked vital metadata tags (such as rel=canonical or Facebook Open Graph tags). This got a response:  Vince's lovely line "diddle with semantic data" is the inspiration for the title of this post, in which I describe a tool to display links across datasets, such as museum specimens and scientific publications.

Diddling with semantic data: linking natural history collections to the scientific literature

Wired's article How Yahoo Blew It contains this wonderful extract:  For some reason, the word really appeals. Not that I've got anything against Yahoo -- far from it. The provide some very cool tools, such as Pipes, which is an interactive feed aggregator and manipulator. David Shorthouse brought this to my attention. As he points out, its highly relevant to our conversation on iSpecies.

Word for the day - "clusterfuck"

The following is a guest post by Bob Mesibov. Nico Franz and Beckett Sterner created a stir last year with a preprint in

 bioRxiv

about expert validation (or the lack of it) in the "backbone" classifications used by aggregators.

Guest post: The Not problem

Inspired by the forthcoming Hack4Knowledge I've put together a service that enables you to assert that you are the author of a paper using the Mendeley API. If you are impatient, give it a try at:  http://iphylo.org/~rpage/hack4knowledge/iwrotethat/  To use it you need a Mendeley account. When you go to I wrote that you will be asked to connect to your Mendeley account.

I wrote that: asserting authorship using the Mendeley API

I'm giving a short talk at the Workshop On Open Citations And Open Scholarly Metadata 2020, which will be held online on September 9th.

Workshop On Open Citations And Open Scholarly Metadata 2020 talk

Just a placeholder to mark the ongoing impact of the Internet Archive being attacked (see here, here and here for details).  The impact of this on the Biodiversity Heritage Library (BHL) has been huge, and reveals the extent to which BHL depends on the Archive.

Internet Archive as a single point of failure

How to cite:

 Page, R. (2024). Problems with the DataCite Data Citation Corpus https://doi.org/10.59350/t80g1-xys37

DataCite have released the Data Citation Corpus, together with a dashboard that summarises the corpus. This is billed as: The goal is to build a citation database between scholarly articles and data, such as datasets in repositories, sequences in GenBank, protein structures in PDB, etc.

Problems with the DataCite Data Citation Corpus

A System for Converting PDF Documents into Structured XML Format

Unsupervised document structure analysis of digital scientific articles

Header and footer extraction by page association

Ceci n'est pas un hamburger: modelling and representing the scholarly article

Layout-aware text extraction from full-text PDF of scientific articles

VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups

How to cite:

 Page, R. (2023). Document layout analysis. https://doi.org/10.59350/z574z-dcw92

Some notes to self on document layout analysis. I’m revisiting the problem of taking a PDF or a scanned document and determining its structure (for example, where is the title, abstract, bibliography, where are the figures and their captions, etc.). There are lots of papers on this topic, and lots of tools.

Document layout analysis

How to cite:

 Page, R. (2023). The problem with GBIF’s Phylogeny Explorer. https://doi.org/10.59350/v0bt3-zp114

GBIF recently released the Phylogeny Explorer, using legumes as an example dataset. The goal is to enables users to “view occurrence data from the GBIF network aligned to legume phylogeny.” The screenshot below shows the legume phylogeny side-by-side with GBIF data.

The problem with GBIF's Phylogeny Explorer

How to cite:

 Page, R. (2023). Sub-second searching of millions of DNA barcodes using a vector database. https://doi.org/10.59350/qkn8x-mgz20

Recently I’ve been messing about with DNA barcodes.

Sub-second searching of millions of DNA barcodes using a vector database

This blog post has some notes in support of a talk given to the Systematics Association meeting in Reading June 20th, 2024.

 Slides

I will post a link to the slides here once I have given the talk. Page, Roderic (2024). Visualising big trees. figshare. Presentation.

Visualising big trees: a talk at the Systematics Association 2024

How to cite:

 Page, R. (2024). Notes on transforming BHL images https://doi.org/10.59350/2gpbb-98a53

I’ve been down this road before, e.g. BHL, DjVu, and reading the f*cking manual and Demo of full-text indexing of BHL using CouchDB hosted by Cloudant, but I’m revisiting converting BHL page scans to black and white images, partly to clean them up, to make them closer to what a modern reader might expect, and partly to reduce the

Notes on transforming BHL images

How to cite:

 Page, R. (2023). Adventures in machine learning: iNaturalist, DNA barcodes, and Lepidoptera. https://doi.org/10.59350/5q854-j4s23

Recently I’ve been working with a masters student, Maja Nagler, on a project using machine learning to identify images of Lepidoptera. This has been something of an adventure as I am new to machine learning, and have only minimal experience with the Python programming language.

Adventures in machine learning: iNaturalist, DNA barcodes, and Lepidoptera

How to cite:

 Page, R. (2023). A taxonomic search engine. https://doi.org/10.59350/r3g44-d5s15

Tony Rees commented on my recent post Ten years and a million links. I’ve responded to some of his comments, but I think the bigger question deserves more space, hence this blog post. Tony’s comment My response I think there are several ways to approach this.

A taxonomic search engine

One thing about ChatGPT is it has opened my eyes to some concepts I was dimly aware of but am only now beginning to fully appreciate. ChatGPT enables you ask it questions, but the answers depend on what ChatGPT “knows”. As several people have noted, what would be even better is to be able to run ChatGPT on your own content. Indeed,  ChatGPT itself now supports this using plugins.

ChatGPT, semantic search, and knowledge graphs

I haven’t blogged for a while, work and other reasons have meant I’ve not had much time to think, and mostly I blog to help me think. ChatGPT is obviously a big thing at the moment, and once we get past the moral panic (“students can pass exams using AI!”) there are a lot of interesting possibilities to explore.

ChatGPT, of course

The Plazi project has become one of the major contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences (see Plazi's GBIF page for details). These occurrences are extracted from taxonomic publication using automated methods.

Problems with Plazi parsing: how reliable are automated methods for extracting specimens from the literature?

Just some thoughts as I work through some datasets linking taxonomic names to the literature. In the diagram above I've tried to capture the different situatios I encounter. Much of the work I've done on this has focussed on case 1 in the diagram: I want to link a taxonomic name to an identifier for the work in which that name was published. In practise this means linking names to DOIs.

Linking taxonomic names to the literature

I've released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint is for a subset of the entities that are of interest to WikiCite, such as scholarly articles, people, and journals. There is a crude demo at https://wikicite-graphql.herokuapp.com. The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php.

GraphQL for WikiData (WikiCite)

uBioRSS: Tracking taxonomic literature using RSS

Aggregating, Tagging and Integrating Biodiversity Research

Over a decade ago RSS (RDF Site Summary or Really Simple Syndication) was attracting a lot of interest as a way to integrate data across various websites. Many science publishers would provide a list of their latest articles in XML in one of three flavours of RSS (RDF, RSS, Atom). This led to tools such as uBioRSS [1] and my own e-Biosphere Challenge: visualising biodiversity digitisation in real time.

Revisiting RSS to monitor the latest taxonomic research

Note to self. The challenge of finding specimen citations in papers keeps coming around. It seems that this is basically the same problem as finding citations to papers, and can be approached in much the same way. If you want to build a database of reference from scratch, one way is to scrape citations from papers (e.g., from the "literature cited" section), convert those strings into structured data, and add those to your database.

Finding citations of specimens

I've created a GitHub repository so that I can keep track of the examples of JSON-LD that I've seen being actively used, for example embedded in web sites, or accessed using an API. The repository is https://github.com/rdmpage/wild-json-ld. The list is by no means exhaustive, I hope to add more examples as I come across them. One reason for doing this is to learn what others are doing.

JSON-LD in the wild: examples of how structured data is represented on the web

Next few weeks will be busy with term starting, kids visiting, and other commitments, so time to jot down some ideas. The first is to have a Wiki for taxonomic names. Bit like Wikispecies, but actually useful, by which I mean useful for working biologists. This would mean links to digital literature (DOIs, Handles, etc.), use of identifiers for names and taxa (such as NCBI taxids, LSIDs, etc.), and having it pre-populated with data.

Half-baked ideas. I. Wiki for taxonomy

The good news is that the merger of Blackwell's digital content with that of Wiley's has not affected the DOIs, which is exactly as you'd expect, and is a nice demonstration of the power of identifiers that use indirection (although there was a time when Wiley was offline).  For example, the article identified by doi:10.1111/j.1095-8312.2003.00274.x had the URL http://www.blackwell-synergy.com/doi/abs/10.1111/j.1095-8312.2003.00274.x and now has

DOIs, the good news and the bad news

Nico Franz and Beckett W. Sterner recently published a preprint entitled "To Increase Trust, Change the Social Design Behind Aggregated Biodiversity Data" on bioRxiv http://dx.doi.org/10.1101/157214  Below is the abstract:  Below I respond to some specific points that annoyed me about this article, at the end I try and sketch out a more constructive response.

Response to To Increase Trust, Change the Social Design Behind Aggregated Biodiversity Data

D. Ross Robertson has published a paper entitled "Global biogeographical data bases on marine fishes: caveat emptor" (doi:10.1111/j.1472-4642.2008.00519.x - DOI is broken, you can get the article here). The paper concludes:  As I've noted elsewhere on this blog, and as demonstrated by Yesson et al.'s paper on legume records in GBIF (doi:10.1371/journal.pone.0001124) (not cited by Robertson), there are major problems with geographical information

Global biogeographical data bases on marine fishes: caveat emptor

I've made Species Cite live. This is a web site I've been working on with the GBIF Challenge as a notional deadline so I'll actually get something out the door. "Species Cite" takes as its inspiration the suggestion that citing original taxonomic descriptions (and subsequent revisions) would increase citation metrics for taxonomists, and give them the credit they deserve.

Species Cite: linking scientific names to publications and taxonomists

From Nature's blog on web technology and science comes this post on Open Text Mining Interface (OTMI): and further Currently playing in iTunes:

 By the Time I Get to Phoenix

by Glen Campbell

Nascent: Open Text Mining Interface

Recently I’ve been exploring data downloaded from BOLD. Part of this was motivated by work done with David Schindel for a recent book:  In this blog post I record some struggles I’ve had with the supposedly “Frictionless” data provided by BOLD. I list a serious of issues, and make some recommendations as to how these can be fixed. Previous versions disappear from site   The web page Data Packages lists datasets that can be downloaded.

Exploring BOLD's DNA barcode data releases: there's a fraction too much friction

Following the 2024 BHL meeting, and the departure of Martin Kalfatovic and the uncertainty the departure of such a pivitol person brings, perhaps it’s time to think about the future of BHL. Below I sketch some thoughts, which are hazy at best. I should say at the outset that I think BHL is an extraordinary project. My goal is to think about ways to enhance its utility and impact.

A future for the Biodiversity Heritage Library

TL;DR  These are some brief notes on the latest version (v. 2) of the Data Citation Corpus, relased shortly before the Make Data Count Summit 2024, which also included a discussion on the practical uses of the corpus. I downloaded version 2 from Zenodo doi:10.5281/zenodo.13376773. The data is in JSON format, which I then loaded into CouchDB to play with.

The Data Citation Corpus revisited

How to cite:

 Page, R. (2023). It’s 2023 - why are we still not sharing phylogenies? https://doi.org/10.59350/n681n-syx67

A quick note to support a recent Twitter thread https://twitter.com/rdmpage/status/1729816558866718796?s=61&amp;t=nM4XCRsGtE7RLYW3MyIpMA The article “Diversification of flowering plants in space and time” by Dimitrov et al. describes a genus-level phylogeny for 14,244 flowering plant genera.

It's 2023 - why are we still not sharing phylogenies?

How to cite:

 Page, R. (2023). Where are the plant type specimens? Mapping JSTOR Global Plants to GBIF. https://doi.org/10.59350/m59qn-22v52

This blog post documents my attempts to create links between two major resources for plant taxonomy: JSTOR’s Global Plants and GBIF, specifically between type specimens in JSTOR and the corresponding occurrence in GBIF.

Where are the plant type specimens? Mapping JSTOR Global Plants to GBIF

Pensoft have recently introduced “nanopubs”, small structured publications that can be thought of as containing the minimum possible statement that could be published. Nanopubs are promoted as FAIR, that is findable, accessible, interoperabile, and reusable. I like the idea of nanopubs, but the examples I have seen so far are problematic.

Nanopubs, a way to create even more silos

I heard yesterday from Martin Kalfatovic (BHL) that David Remsen has died. Very sad news.

David Remsen

Quick notes to self on fulltext search and CouchDB. Note that links to CouchDB are local to my machine(s),and won't work unless you are me, or have a copy of the same database running on your machine). CouchDB and Lucene adds fulltext indexing to CouchDB. After a few false starts I now have this working.

CouchDB and Lucene

As trailed on a Twitter thread last week I’ve been working on a manuscript describing the efforts to map taxonomic names to their original descriptions in the taxonomic literature. The preprint is on bioRxiv doi:10.1101/2023.05.29.542697  Much of the work has been linking taxa to names, which still has huge gaps.

Ten years and a million links

Some quick notes on interface ideas for digital libraries and/or knowledge graphs. Recently there’s been something of an explosion in bibliographic tools to explore the literature.

Library interfaces, knowledge graphs, and Miller columns

My dad died last weekend. Below is a notice in today's New Zealand Herald. I'm in New Zealand for his funeral. Don't really have the words for this right now.

Dugald Stuart Page 1936-2022

More arm-waving notes on taxonomic databases. I've started to add data to ChecklistBank and this has got me thinking about the issue of data quality.

Can we use the citation graph to measure the quality of a taxonomic database?

Taxonomic treatments have come up in various discussions I'm involved in, and I'm curious as to whether they are actually being used, in particular, whether they are actually being cited. Consider the following quote:    "Traditional" academic citation is from article to article.

Does anyone cite taxonomic treatments?

There are several instances where I have a collection of references that I want to deduplicate and merge.

Deduplicating bibliographic data

If you compare the impact that BHL and Plazi have on GBIF, then it's clear that BHL is almost invisible. Plazi has successfully in carved out a niche where they generate tens of thousands of datasets from text mining the taxonomic literature, whereas BHL is a participant in name only. It's not as if BHL lacks geographic data.

Thoughts on BHL, ALA, GBIF, and Plazi

How to cite:

 Page, R. (2021). Maximum entropy summary trees to display higher classifications https://doi.org/10.59350/af01t-6sw74

A challenge in working with large taxonomic classifications is how you display them to the user, especially if the user probably doesn't want all the gory details.

Maximum entropy summary trees to display higher classifications

Quick note on Frankenplace, a cool search tool that displays the geographic distribution of documents that match the user's query as a heatmap. Details of how the tool works are given in:  At the heart of the method is a discrete global grid that divides the world up into small areas of the same size.

Frankenplace, geospatial search, and discrete global grid systems

My paper "Ozymandias: A biodiversity knowledge graph" has been published in PeerJ https://doi.org/10.7717/peerj.6739  The paper describes my entry in GBIF's 2018 Ebbe Nielsen Challenge, which you can explore here. I tweeted about its publication yesterday, and got some interesting responses (and lots of retweets, thanks to everyone for those).  Carl Boettiger (@cboettig) asked where the triples were, as did Kingsley Uyi Idehen (@kidehen). Doh!

Ozymandias: A biodiversity knowledge graph published in PeerJ

David Shorthouse (@dpsspiders) makes some very cool things, and his latest project World Taxonomists &amp; Systematists is a great example of using automation to assemble a list of the world's taxonomists and systematists. The project uses ORCID.

World Taxonomists and Systematists via ORCID

[Work in progress]  The "dummy" in this case is me. I'm trying to make sense of how to model taxa, especially in the context of linked data, and projects such as Wikidata where there is uncertainty over just what a taxon in Wikidata actually represents. There is also ongoing work by the TDWG Taxon Names and Concepts Interest Group. This is all very rough and I'm still working on this, but here goes.

Taxonomic concepts for dummies

Came across Microsoft's announcement of a "A planetary computer for a sustainable future through the power of AI", complete with a glossy video featuring Lucas Joppa @lucasjoppa (see also @Microsoft_Green and #AIforEarth).  On the one hand it's great to see super smart people with lots of resources tackling important questions, but it's hard not to escape the feeling that this is the classic technology company approach of framing difficult

A planetary computer for Earth

Following on from previous posts The Semantic Web made fun: d3sparql and The Biodiversity Heritage Library meets Wikidata via Wikispecies: adding author identifiers to BioStor I've put together an example query that can be used to extract a taxonomic classification from Wikidata.

Displaying taxonomic classifications from Wikidata using d3js and SPARQL

It's Friday, so time for either a folly or a rant. BHL have put another user survey into the field http://www.surveymonkey.com/s/BHLsurvey. I loathe user surveys.

Where next for BHL?

At the start of this week I took part in a biodiversity informatics workshop at the Naturhistoriska riksmuseets, organised by Kevin Holston. It was a fun experience, and Kevin was a great host, going out of his way to make sure myself and other contributors were looked after. I gave my usual pitch along the lines of "if you're not online you don't exist", and talked about iSpecies, identifiers, and wikis.

Towards a wiki of phylogenies

The PLoS Biodiversity Hub has launched today. There's a PLoS blog post explaining the background to the project, as well as a summary on the Hub itself:  Readers of iPhylo may recall my account of one of the meetings involved in setting up this hub, in which I began to despair about the lack of readiness of biodiversity informatics to provide much of the information needed for projects such as hubs.

PLoS Biodiversity Hub launches

In between complaining about the lack of open data in biodiversity (especially taxonomy), and scraping data from various web sites to build stuff I'm interested in, I occasionally end up having interesting conversations with the people whose data I've been scraping, cleaning, cross-linking, and otherwise messing with. Yesterday I had one of those conversations at Kew Gardens.

On asking for access to data

Note to self (basically rewriting last year's Finding citations of specimens). Bibliographic data supports going from identifier to citation string and back again, so we can do a "round trip."

 1.

Given a DOI we can get structured data with a simple HTTP fetch, then use a tool such as citation.js to convert that data into a human-readable string in a variety of formats.

Round trip from identifiers to citations and back again

In a recent Twitter conversation including David Shorthous and myself (and other poor souls who got dragged in) we discussed how to demonstrate that adopting JSON-LD as a simple linked-data friendly format might help bootstrap the long awaited "biodiversity knowledge graph" (see below for some suggestions for keeping JSON-LD simple). David suggests partnering with "Three small, early adopting projects". I disagree.

Bootstrapping the biodiversity knowledge graph with JSON-LD

I'm continuing to play with the new version of iSpecies, seeing just how far one can get by simply grabbing JSON from various sources and mashing them up. Since the Open Tree of Life is pretty unresolved ("OMG it's full of stars") I've started to grab trees from TreeBASE and add those.

iSpecies meets TreeBASE

In my (previous post ) I discussed the potential for the Biodiversity Data Journal (BDJ) to be a venue for nano (or near-nano publications). In this post I want to draw attention to what I think is a serious stumbling block, which is the lack of machine readable statements in the journal.

The Biodiversity Data Journal is not machine readable

I'm going to the TDWG Identifier Workshop this weekend, so I thought I'd jot down a few notes. The biodiversity informatics community has been at this for a while, and we still haven't got identifiers sorted out. From my perspective as both a data aggregator (e.g., BioNames) and a data provider (e.g., BioStor) there are four things I think we need to tackle in order to make significant progress.

On identifiers (again)

I've made a video walkthrough of Ozymandias, which I described in this post. It's a bit, um, long, so I'll need to come up with a shorter version.  Ozymandias - a biodiversity knowledge graph from Roderic Page on Vimeo.

Ozymandias demo

Mitch Leslie has written an article on EoL (doi:10.1126/science.316.5826.818). It starts: Déjà vu because the defunct All-Species Foundation -- also covered in

 Science

(doi:10.1126/science.294.5543.769) -- had much the same ambitions six years ago. It is easy to be sceptical, but I think it was Rudi Giuliani who said "under promise, over deliver." Wise words.

EoL commentary in Science

I've just come back from II Iberian Congress of Biological Systematics (CISA2013) in Barcelona, where I had a great time.

Biodiversity informatics in charts

Continuing the theme of the failings of the GBIF classification I've been playing further with cluster maps to visualise the problem (see this earlier post for an introduction).  Browsing through bats in GBIF I keep finding the same species appearing more than once, albeit in different genera.

Cluster maps, papaya plots, and the trouble with GBIF taxonomy

Readers of this blog will know that I'm sceptical about the current value of linked data and RDF in biodiversity informatics. But I came across an interesting paper on RDF and biocuration that suggests a good "use case" for RDF in constructing and curating taxonomic databases. The paper is "Catching inconsistencies with the semantic web: a biocuration case study" (PDF here) by Jerven Bolleman and Sebastien Gehant.

A use case for RDF in taxonomy

Thinking about next steps for my BioStor project, one thing I keep coming back to is the problem of how to dramatically scale up the task of finding taxonomic literature online. While I personal find it oddly therapeutic to spend a little time copying and pasting citations into BioStor's OpenURL resolver and trying to find these references in BHL, we need something a little more powerful.

Next steps for BioStor: citation matching

Quick note to self about possible way to using fuzzy matching when searching for taxonomic names. Now that I'm using Cloudant to host CouchDB databases (e.g., see BioStor in the the cloud) I'd like to have a way to support fuzzy matching so that if I type in a name and misspelt it, there's a reasonable chance I will still find that name. This is the "did you mean?" feature beloved by Google users.

Fuzzy matching taxonomic names using ngrams

Following on from the last post, I've now set up a trivial NCBI RDF service at bioguid.info/taxonomy/ (based on the ISSN resolver I released yesterday and announced on the Bibliographic Ontology Specification Group).  If you visit it in a web browser it's nothing special. However, if you choose to display XML you'll see some simple RDF.

NCBI RDF

If we are ever going to link biodiversity data together we need to have some way of ensuring persistent links between digital records. This isn't going to happen unless people take persistent identifiers seriously. I've been trying to link specimen codes in publications to GBIF, with some success, so imagine my horror when it started to fall apart.

Dear GBIF, please stop changing occurrenceIDs!

Here is my presentation from today's Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond meeting.

Sherborn presentation on Open Taxonomy

This is a follow up to my previous post TDWG Challenge - what is RDF good for? where I'm being, frankly, a pain in the arse, and asking why we bother with RDF? In many ways I'm not particularly anti-RDF, but it bothers me that there's a big disconnect between the reasons we are going down this route and how we are actually using RDF.

Reflections on the TDWG RDF "Challenge"

Continuing on from my previous post Viewing scientific articles on the iPad: towards a universal article reader, here are some brief notes on the PLoS iPad app that I've previously been critical of.  There are two key things to note about this app. The first is that it uses the page turning metaphor. The article is displayed as a PDF, a page at a time, and the user swipes the page to turn it over.

Viewing scientific articles on the iPad: the PLoS Reader

To much fanfare (e.g.,

 Nature News

, "Linnaeus meets the Internet" doi:10.1038/news.2010.221), on May 5th

 PLoS ONE

published Sandy Knapp's "Four New Vining Species of

 Solanum

(Dulcamaroid Clade) from Montane Habitats in Tropical America" doi:10.1371/journal.pone.0010502.

Linnaeus meets the Internet: PLoS + Botany =  #fail

This week seems to be API week. The Encyclopedia of Life API Beta Test has been out since August 12th. By comparison with the Mendeley API that I've spent rather too much time trying to get to grips with, the EOL API release seems rather understated.

Navigating the Encyclopedia of Life tree on the desktop and the iPhone

Being in an unusually constructive mood, I've spent the last couple of days playing with the TreeBASE II API, in an effort to find out how hard it would be to replace TreeBASE's frankly ghastly interface. After some hair pulling and bad language I've got something to work. It's very crude, but gives a glimpse at what can be done.

Show me the trees! Playing with the TreeBASE API

Every so often I revisit the idea of browsing a collection of documents (or specimens, or phylogenies) geographically. It's one thing to display a map of localities for single document (as I did most recently for

 Zootaxa

), it's quite another to browse a large collection.

Browsing a digital library using a map

I've been playing a little with TreeBASE II, and the more I do the more I want to pull my hair out.

 Broken URLs

The old TreeBASE had a URL API, which databases such as NCBI made use of. For example, the NCBI page for

 Amphibolurus nobbi

has a link to this taxon in TreeBASE.

TreeBASE II makes me pull my hair out

Yesterday I fired off a stream of tweets, starting with:  Various people commented on this, either on twitter or in emails, e.g.:  So, to clarify, I'm not abandoning wikis. I'm just frustrated with the limitations of Semantic Mediawiki (SMW). Now, SMW is a great piece of software with some cool features.

Wiki frustration

A decade ago (OMG, that can't be right, an actual decade ago) I created "iSpecies", a simple little tool to mashup a variety of data from GBIF, NCBI, Yahoo, Wikipedia, and Google Scholar to create a search engine for species.

iPhylo

Why is the Atlas of Living Australia is invisible to Google?

Charting taxonomic knowledge

PageRank for biodiversity

Incomplete citation and ranking