Optical character recognition (OCR) is an often indispensable tool in data entry and transcription. This has been used by the Royal Botanic Garden Edinburgh (RBGE) for expanding its digital archive. The digitization drive is being carried out with great enthusiasm.
What a Herbarium is All About
Plant specimens of bygone eras are preserved in a herbarium in their original form – though dead, they are still in a distinguishable form. They are dried by being placed between paper pieces and mounted on card. They provide a reference collection of the various kinds of fauna that populate the earth – this process is how people have preserved plant specimens and kept them through centuries. The herbarium is where one can find such specimens. The RBGE is one such place, and it is looking to digitize these specimens and store them in virtual records.
Specimens in the RBGE herbarium date back to 1697, while the nearly three million specimens housed here represent more than half of all flora in the world. Moreover, the collection gets richer by around 10,000 to 20,000 specimens every year.
The Need for Online Documentation
However, the difficulty in sharing the specimens had set back the herbarium. Physical sharing of the specimens always poses the risk of damage or loss, but even if these factors were discounted, sharing is always limited to people visiting the herbarium unless they’re sent out on loan. Online documentation of the specimens helps increase access to the herbarium’s collection and also contributes to research.
As part of this process TBGE embarked upon imaging the specimens but faced issues when it came to capturing text on the specimen labels, which could be found in various languages and fonts. Many of the defining features of certain plants cannot always be spotted from the specimens; they have to be described textually including their habitat, scent and the color of their flowers. This information which could be found in the labels of the specimen was manually entered in a time-consuming procedure which caused many records in the database to be incomplete. That’s where the need for text recognition was felt.
The Need for OCR Text Recognition
The need for a smart technique was felt, to capture text on specimen labels even if it is complex and the quality of the label is poor. Most importantly, all the information needed to be captured without any bit of it being lost. The technique would also need to be efficiently incorporated into RBGE’s Image Management System.
RBGE’s Text Capture and Image Management
That’s where OCR (optical character recognition) was thought of. This technology, an integral element of document conversion services, could enable capturing the label text and its conversion to editable digital information. RBGE used it for converting scanned images into text documents in order to classify, search and export the information to the internal system of RBGE for document storage as well as management.
The Recognition Server then accesses the TIFF images (the format in which RBGE stores images) stored in one of the folders of the herbarium’s Image Management System and processes them, creating two output files – an image PDF for backup and the plain text file that is saved in a folder on the RBGE server. The latter file is picked up and entered into the MySQL database by RBGE’s workflow. From here it can be easily accessed by worldwide researchers through the RBGE website or other respectable online botany resources.
That’s job done and the mission of the RBGE fulfilled, all thanks to OCR.