OCR
During your exploration of the world of document scanning, such as invoices, you probably come across the term "OCR." You may even know that it stands for Optical Character Recognition. But what exactly is OCR, and what do you need to know to make the best use of OCR software?
The primary purpose of Optical Character Recognition is to quickly and automatically convert scanned images of machine-printed (typed) text - which to a computer are no more meaningful than a collection of pixels than any other image, such as a landscape photograph - into actual text data that you can search and modify. This is sometimes referred to as scan & recognize or scan & capture.
The exact technology of this process is complicated, but suffice it to say that an OCR engine looks at pixel data and looks for patterns that resemble letters, numbers and other symbols and creates a digitized record of these symbols.
Scan and recognize
There are two main types of optical character recognition: full page OCR and zone OCR.
Full page OCR
Converts the entire page to one of the following formats:
- Plain text - only standard text information on the page is saved sequentially.
- Formatted text - Text information is preserved in consecutive paragraphs, saving font size and style. This can also preserve tabular tables, such as spreadsheets.
- Exact copy - All information on the page is retained, including images, and is placed on the page in such a way as to recreate the closest thing to the original document.
- Searchable file - Text information is stored on a hidden layer behind the scanned image so that the file can be searched while maintaining the appearance of the original.
Zone OCR
Recognizes character strings located on certain parts of the page. This is usually for indexing and document management purposes. The information can be used to name a file, save it to a particular location, or archive certain data in an organized format, such as a database.
OCR Software
OCR software comes in many different types, which vary in price range based on their features, speed and accuracy. For example, you can get a freeware like SimpleOCR that will serve you in no time, but it will only be able to convert BMP, JPG and TIF images of English or French text into plain text documents of TXT or DOC format, one page at a time.
On the other hand, you can invest a few hundred dollars in a Batch OCR or even Server OCR software that can view certain folders for incoming documents in different image formats and languages and then automatically make exact copies of all the pages in them in a format of your choice.
You can also find Desktop OCR software, which will bridge the price gap and include many of the features of the Corporate editions, but still requires some user input during conversion.
Accuracy of OCR Software
Although some OCR engines are better than others, no software can guarantee 100% accuracy. This is because other factors come into play, including scan quality. Recognition software will not be able to do its job if the scanner does not properly digitize the page.
It is recommended to scan at a resolution of 300 dpi for best results. Black and white (Bitonal) is preferred over greyscale or color mode, and although most modern scanners are configured fairly well out of the box, you can adjust your brightness and contrast settings for your specific documents.
OCR software is also limited in what it can recognize. Most OCR software is only designed to recognize machine-printed text, as opposed to handwriting. While there is ICR software that can recognize handwritten information, these are usually enterprise-level solutions for forms processing, rather than full page recognition.
Similarly, most OCR software are only able to convert traditional machine fonts, not italic scripts or calligraphy. There are many fonts available and OCR engines rely on commonly used, separated font shapes to recognize the text, so fonts that are unusual or blend together will not be recognized.
Differences in OCR Software
The main features that distinguish OCR software are:
- Character recognition accuracy
- Correction accuracy page layout
- Support for languages
- User interface design
- Output file formats (Word, Excel, PDF, eBook, etc.)
- OCR speed and support for multi-core CPUs
- Batch processing modes
- Advanced PDF encoding or compression
- Special features for niche projects
Because of the infinite combinations of document types, OCR engines, project requirements and special features, one engine may perform better with your particular documents than with another.
E-invoicing: the alternative
Despite the benefits that e-invoicing has, its use is still anything but commonplace. Indeed, e-invoicing is not only an internal matter, but depends on the cooperation of suppliers. Moreover, there is no single e-invoice standard, but many different XML-based standards, such as HR-XML, SETU, various UBL standards, Finvoice. It is therefore a matter of converting incoming invoices from suppliers to the desired standard.
In this, there are several solutions that vary in degree of relief. Incoming paper invoices, for example, can be scanned yourself and then translated online into a validated data file (XML). You can do that translation yourself, validating the scanned document (accounts payable data) and encoding it (general ledger accounts), then sending the validated data file into a procuration workflow for approval by budget holders.
Automation of invoice receipt
Download our white paper on Automatic Invoice Receipt. From OCR to e-invoicing.
Solutions
We offer an OCR service that allows you to scan and recognize documents. Our service is based on Kofax's scan and recognize software.
For incoming invoices, we have a turnkey invoice recognition solution, which can optionally be extended to include invoice validation and invoice processing (encoding and matching).