Friday, October 19, 2012

Optical Character Recognition: Some Advice

A properly done OCR (Optical Character Recognition) task is not simply about text extraction, it also implies a set of operations meant to optimize the OCR process and increase efficiency in overall document-management practice.

To put it in other words, operations commonly considered as "adjacent" can actually really improve or totally destroy text recognition making your later life either comfortable or a living hell.

Here are just a few things to keep in mind:

(1) Before scanning

when placing the paper in scanner make sure the pages have the correct text orientation so you won't have to later waste time by either having to wait for the OCR software to automatically determine the orientation or, even worse, to have to make this operation manually, via file-by-file checking.
make proper scan settings to insure best quality for OCR (for example, 250 or 300 dpi resolutions are considered optimal for most of the documents).
test OCR output for a few pages before starting a batch scanning operation to make sure your settings are optimally fine-tuned.
select a lossless file format (such as TIFF) and do not be afraid of big sizes if the documents are important to you: storage space is not an issue these days and you can later convert the files to any other format for handling (or sharing) purposes.

Actually, for important document archives, maybe the best idea would be to store the "original" files into TIFF format then move them on an external storage device or media (external hard-disk or DVD) and use for current work a duplicated archive containing files converted into a format that you consider optimal for your needs ( JBIG2, PDF... ).

To a certain extent, this approach would be similar to how camera RAW format works for the professionals in digital photography domain.

(2) After scanning

use relevant filenames for resulting files and not mind if filenames tend to become lengthy: it isn't hard to do using automated file naming tools and, even if it might take a bit more of your time at file creation stage it can be a really life saviour later. And make sure that the filename contains important data, such as the language of the text, to name just one important detail for OCR.
do not hesitate to use image enhancement techniques: the quality of the paper documents cannot be controlled nor the hardware (like a scanner) particular details which might influence output quality (just an example among dozens: tiny scratches on scanner's glass).

To overcome them, professional document imaging software vendors provide their users with a wide range of image correction features such as brightness/contrast/gamma, median filtering and auto-deskew.

Founded in 2003, ORPALIS is a privately held and fast growing company, producer of Document Imaging toolkits for developers (SDKs) and applications for end users.

ORPALIS develops and maintains the comprehensive document imaging toolkit series released under the brand "GdPicture", which is now a worldwide known and respected leader in imaging technologies. http://www.gdpicture.com/

In 2011, ORPALIS released PaperScan, marking the beginning of a new line of products meant for end-users.

ORPALIS software are used by hundreds of thousands of customers in all over the world. Customer list includes companies like IBM, Dell, Philips, Siemens, Xerox... http://www.orpalis.com/


View the original article here

0 comments:

Post a Comment