Duplicate records are the bane of any project that aggregates data from multiple sources.
Duplicate records are the bane of any project that aggregates data from multiple sources.
As part of my mantra that it's not about the data, it's all about the links between the data, I've started exploring matching GenBank sequences to GBIF occurrences using the specimen_voucher codes recorded in GenBank sequences. It's quickly becoming apparent that this is not going to be easy.
Given various discussions about identifiers, dark taxa, and DNA barcoding that have been swirling around the last few weeks, there's one notion that is starting to bug me more and more.
I'm in the midst of rebuilding iSpecies (my mash-up of Wikipedia, NCBI, GBIF, Yahoo, and Google search results) with the aim of outputting the results in RDF. The goal is to convert iSpecies from a pretty crude "on-the-fly" mash-up to a triple store where results are cached and can be queried in interesting ways. Why?
At the end of day two of the GBIF LSID-GUID Task Group I put together this crude diagram to summarise some of the possible links between biodiversity data and the larger linked data cloud, which I, among others, have argued is where biodiversity informatics should be heading.
OK, really must stop avoiding what I'm supposed to be doing (writing a paper, already missed the deadline), but continuing the theme of LSIDs and short URLs, it occurs to me that LSIDs can be seen as a disaster (don't work in webrowsers, nobody else uses them, hard to implement, etc.) or an opportunity.
The latest post on the EOL blog (Biodiversity in a rapidly changing world) really, really annoys me. It claims that Nope, I suggest it demonstrates just how limited EOL is. If I view the page for the red lionfish I get an out of date map from GBIF that shows a very limited distribution, and doesn't show the introductions in Florida and the Bahamas (I have to wade through text to find reference to the Florida introduction, and the page doesn't
D. Ross Robertson has published a paper entitled "Global biogeographical data bases on marine fishes: caveat emptor" (doi:10.1111/j.1472-4642.2008.00519.x - DOI is broken, you can get the article here). The paper concludes: As I've noted elsewhere on this blog, and as demonstrated by Yesson et al.'s paper on legume records in GBIF (doi:10.1371/journal.pone.0001124) (not cited by Robertson), there are major problems with geographical information
As spotted by dechronization, GBIF has made public that Vince Smith has won the 2008 Ebbe Nielsen Prize.
Resurrecting iSpecies after moving it to a new folder{"=““} on one of my servers, and browsing popular searches, I keep coming across clearly erroneous distributions. FishBase seems a major culprit. For example, the common pandora Pagellus erythrinus is a marine fish, yet GBIF displays numerous occurrences in mainland Africa (dots with black centre on map below). What gives?