The following is a guest post by Bob Mesibov. The i4Life project has very kindly liberated Catalogue of Life (CoL) data from its database, and you can now download the latest CoL as a set of plain text, tab-separated tables here.
The following is a guest post by Bob Mesibov. The i4Life project has very kindly liberated Catalogue of Life (CoL) data from its database, and you can now download the latest CoL as a set of plain text, tab-separated tables here.
On eof the things BioNames will need to do is match taxon names to classifications. For example, if I want to display a taxonomic hierarchy for the user to browse through the names, then I need a map between the taxon names that I've collected and one or more classifications. The approach I'm taking is to match strings, wherever possible using both the name and taxon authority.
I'm trying to get my head around the data model used by ZooBank to store taxonomic names. To do this, I've built a graph for the species Belonoperca pylei described by Baldwin &
Quick note to self about possible way to using fuzzy matching when searching for taxonomic names. Now that I'm using Cloudant to host CouchDB databases (e.g., see BioStor in the the cloud) I'd like to have a way to support fuzzy matching so that if I type in a name and misspelt it, there's a reasonable chance I will still find that name. This is the "did you mean?" feature beloved by Google users.
In Arthur C. Clarke's short story The Nine Billion Names of God Tibetan monks hire two programmers to help them generate all the the possible names of God. The monks believe that the purpose of the Universe is to generate those names, once that goal is achieved the Universe will end.
Google Refine is an elegant tool for data cleaning. One of its most powerful features is the ability to call "Reconciliation Services" to help clean data, for example by matching names to external identifiers. Google Refine comes with the ability to use Freebase reconciliation services, but you can also add external services. Inspired by this I've started to implement services to reconcile taxonomic names.
I've recently updated my database of links between animal taxonomic names and literature identifiers, which now has over 280,000 names linked to some form of identifier (127,000 of these being DOIs). You can see the current version here: http://iphylo.org/~rpage/itaxon/ As an experiment I've added a feature to list the number of names for each journal.
As part of my Quixotic attempt to construct a wiki of taxonomic names, I'm building a database of names and links. My current plan is to seed this with the NCBI taxonomy. What I want to do is flesh out the NCBI taxonomy with authorities and links to the original literature. At the moment the NCBI taxonomy is almost "nude", lacking links to the literature behind the names.