User login

Navigation

Breadcrumbs

Tesseract, Google's New OCR Engine

165
vote

Google OCRZDNet announced earlier that Google has been working with HP Labs to "dust off" an optical character recognition (OCR) engine called Tesseract. OCR is the process of converting pages from books and other documents into text that can then be sorted and spidered for indexing within Google. This process is still in its infancy, but Google's aim is to have thousands of books online within the next year.

Luc Vincent, a lead Google engineer working on the Tesseract project, states that despite several visual shortcomings, "Tesseract is far more accurate than any other Open Source OCR package out there."

This is just on the heels of news surrounding Google Books and new concerns for the publishing industry once Google begins to offer tens of thousands of books and other publications online.

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
Security question, designed to stop automated spam bots