Rogue Scholar

Published August 31, 2023

How to cite: Page, R. (2023). Document layout analysis. https://doi.org/10.59350/z574z-dcw92 Some notes to self on document layout analysis. I’m revisiting the problem of taking a PDF or a scanned document and determining its structure (for example, where is the title, abstract, bibliography, where are the figures and their captions, etc.). There are lots of papers on this topic, and lots of tools.

BHLCloudantCouchDBDjVuSearchComputer and Information Sciences

Demo of full-text indexing of BHL using CouchDB hosted by Cloudant

https://doi.org/10.59350/4crdc-fm682

Published August 10, 2015

Author Roderic Page

One of the limitations of the Biodiversity Heritage Library (BHL) is that, unlike say Google Books, its search functions are limited to searching metadata (e.g., book and article titles) and taxonomic names. It doesn't support full-text search, by which I mean you can't just type in the name of a locality, specimen code, or a phrase and expect to get back much in the way of results.

BHLCodeDjVuHOCRJATSComputer and Information Sciences

Towards BioStor articles marked up using Journal Archiving Tag Set

https://doi.org/10.59350/18fc1-gxf54

Published December 4, 2013

Author Roderic Page

A while ago I posted BHL to PDF workflow which was a sketch of a work flow to generate clean, searchable PDFs from Biodiversity Heritage Library (BHL) content: I've made some progress on putting this together, as well as expanded the goal somewhat. In fact, there are several goals: BioStor articles need to be archived somewhere.

BHLDjVuPDFComputer and Information Sciences

BHL to PDF workflow

https://doi.org/10.59350/7zndf-5ds97

Published June 15, 2012

Author Roderic Page

Just some random thoughts on creating searchable PDFs for article extracted from BHL.

BackgroundBHLBioStorDjVuRTFMComputer and Information Sciences

BHL, DjVu, and reading the f*cking manual

https://doi.org/10.59350/nmpja-1sh38

Published April 15, 2011

Author Roderic Page

One of the many biggest challenges I've faced with the BioStor project, apart from dealing with messy metadata, has been handling page images. At present I get these from the Biodiversity Heritage Library. They are big (typically 1 Mb in size), and have the caramel colour of old paper. Nothing fills up a server quicker than thousands of images.

BHLDjVuGoogle DocsJavascriptR-treeComputer and Information Sciences

Towards an interactive DjVu file viewer for the BHL

https://doi.org/10.59350/fyb2j-f1367

Published October 8, 2010

Author Roderic Page

The bulk of the Biodiversity Heritage Library's content is available as DjVu files, which package together scanned page images and OCR text. Websites such as BHL or my own BioStor display page images, but there's no way to interact with the page content itself.

BHLBioStorDjVuIPadComputer and Information Sciences

BHL and the iPad

https://doi.org/10.59350/wqwze-53438

Published September 13, 2010

Author Roderic Page

@elyw I'd leave bookmarking to 3rd party, e.g. Mendeley. #bhlib specific issues incl.

DjVuXMLXSLTComputer and Information Sciences

DjVu XML to HTML

https://doi.org/10.59350/q4ev5-ffd09

Published March 24, 2010

Author Roderic Page

This post is simply a quick note on some experiments with DjVu that I haven't finished. Much of BHL's content is available as DjVu files, which contain both the scanned images and OCR text, complete with co-ordinates of each piece of text. This means that it would, in principle, be trivial to lay out the bounding boxes of each text element on a web page.

iPhylo

Document layout analysis

Demo of full-text indexing of BHL using CouchDB hosted by Cloudant

Towards BioStor articles marked up using Journal Archiving Tag Set

BHL to PDF workflow

BHL, DjVu, and reading the f*cking manual

Towards an interactive DjVu file viewer for the BHL

BHL and the iPad

DjVu XML to HTML