Computer and Information SciencesBlogger

iPhylo

Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188. Written content on this site is licensed under a Creative Commons Attribution 4.0 International license.
Home PageAtom FeedMastodonISSN 2051-8188
language
Published

Recently I’ve been exploring data downloaded from BOLD. Part of this was motivated by work done with David Schindel for a recent book: In this blog post I record some struggles I’ve had with the supposedly “Frictionless” data provided by BOLD. I list a serious of issues, and make some recommendations as to how these can be fixed. Previous versions disappear from site The web page Data Packages lists datasets that can be downloaded.

Published

TL;DR These are some brief notes on the latest version (v. 2) of the Data Citation Corpus, relased shortly before the Make Data Count Summit 2024, which also included a discussion on the practical uses of the corpus. I downloaded version 2 from Zenodo doi:10.5281/zenodo.13376773. The data is in JSON format, which I then loaded into CouchDB to play with.

Published

This post is inspired by the Pharaoh exhibition at the NGV in Melbourne, Australia. This is a beautifully displayed exhibition of objects from the British Museum, London. It has all the trappings of a modern exhibition, beautiful lighting, a custom sound track, and lots of social media coverage. But I found it immensely frustrating to visit.

Published

Following the 2024 BHL meeting, and the departure of Martin Kalfatovic and the uncertainty the departure of such a pivitol person brings, perhaps it’s time to think about the future of BHL. Below I sketch some thoughts, which are hazy at best. I should say at the outset that I think BHL is an extraordinary project. My goal is to think about ways to enhance its utility and impact.

Published

Pensoft have recently introduced “nanopubs”, small structured publications that can be thought of as containing the minimum possible statement that could be published. Nanopubs are promoted as FAIR, that is findable, accessible, interoperabile, and reusable. I like the idea of nanopubs, but the examples I have seen so far are problematic.

Published

How to cite: Page, R. (2024). Notes on transforming BHL images https://doi.org/10.59350/2gpbb-98a53 I’ve been down this road before, e.g. BHL, DjVu, and reading the f*cking manual and Demo of full-text indexing of BHL using CouchDB hosted by Cloudant, but I’m revisiting converting BHL page scans to black and white images, partly to clean them up, to make them closer to what a modern reader might expect, and partly to reduce the

Published

How to cite: Page, R. (2024). Hugging Face Autotrain https://doi.org/10.59350/7p1n4-wdv84 These are notes to myself on using Hugging Face AutoTrain. The first version of this had a very nice interface where you could simply upload a folder of images and train a model. It was limited in the range of tasks and models, but made up for that in ease of use.

Published

How to cite: Page, R. (2024). Problems with the DataCite Data Citation Corpus https://doi.org/10.59350/t80g1-xys37 DataCite have released the Data Citation Corpus, together with a dashboard that summarises the corpus. This is billed as: The goal is to build a citation database between scholarly articles and data, such as datasets in repositories, sequences in GenBank, protein structures in PDB, etc.