User login

Navigation

Breadcrumbs

Tesseract, Google's New OCR Engine

Google OCRZDNet announced earlier that Google has been working with HP Labs to "dust off" an optical character recognition (OCR) engine called Tesseract. OCR is the process of converting pages from books and other documents into text that can then be sorted and spidered for indexing within Google. This process is still in its infancy, but Google's aim is to have thousands of books online within the next year.

Luc Vincent, a lead Google engineer working on the Tesseract project, states that despite several visual shortcomings, "Tesseract is far more accurate than any other Open Source OCR package out there."

This is just on the heels of news surrounding Google Books and new concerns for the publishing industry once Google begins to offer tens of thousands of books and other publications online.

Comments

James Expert's picture

I think this will produce

I think this will produce more pirated materials everywhere. Google shoudn't really dabble in the publishing business. It'll create whole lot of problems than benefits.

Post new comment

The content of this field is kept private and will not be shown publicly.
If you have a Gravatar account, used to display your avatar.
Security question, designed to stop automated spam bots