Free OCR for Google Book Search pages

05 Sunday Jul 2009

Posted by futurilla in JURN tips and tricks, JURN's Google watch

Ever wanted to take the hassle out of re-typing a short quote, found on Google Books? Free OCR is a simple online OCR application that might help.

To test it, I gave it a very unpromising bit of text captured from Google Books using a standard screen-capture utility — slightly skewed, slightly fuzzy, in a non-standard typeface I’m willing to bet no-one has on their system, captured as a JPG at a mere 72 dpi, and just 500 pixels wide…

ocr-test

A few seconds after uploading, it gave me this…

ADVERTISEMENT.
Tms publication of the Works of Jomv KNOx, it is
supposed, will extend to F’ive Volumes. It was thought
advisable to commence the series with his History of
the Reformation in Scotland, as the work of greatest
importance. The next voliune will thus contain the
Third and Fourth Books, which continue the History to
the year 1564; at which period his historical labeurs
maybeconsideredtoterminate. ButtheFi&hBook,
forming a sequel to the History, and published under
his name in 1644, will also be included. His Letters
and Miscellaneous Writings will be arranged in the
subsequent volumes, as nearly as possible in chronolo-
gical order; each portion being introduced by a separate
notice, respecting the manuscript or printed copies from
which they have been taken.
It may perhaps be expected that a Life of the Author
should have been prefixed to this volume. The Life of
Knox., by Ds. M‘Cms, is however a work so universally
, known, and of so much historical value, as to supersede
l any attempt that might be made for a detailed bio-

Not perfect, but not bad for such a poor-quality capture. Stand-alone OCR software usually demands a much better quality source.

The popular screenshot software HyperSnap v6 promises to do the same with its TextSnap feature, but for some unknown reason this feature just doesn’t work with Google Books or the captured image above. I suspect it can only handle text that uses system fonts.

So until we get a neat free OCR Firefox addon (which is a direction I would urge the makers of Free OCR to go in) then screenshot – save image – upload image to Free OCR is a viable and speedy workflow for OCR-ing fair-use quotes found on Google Book Search or other places that only offer plain page-scans.

Oh, and don’t bother doing this for books that are already in the public domain — since last month Google provides the full-text of these for download, and also serves it up via Google Book Search Mobile.

** Update: If you have Microsoft Office 2007 or higher, then I find that the included Microsoft OneNote works just as well for OCR on low-res images such as the one above. It also works well on most PDFs that don’t allow copy/paste. See the comments to this post for details.

11 thoughts on “Free OCR for Google Book Search pages”

Pingback: D’log :: blogging since 2000 » Free online OCR
Borrowind said:

5 July 2009 at 10:15 pm

Update. I find that Microsoft Office 2007‘s OneNote application will also OCR from low-res images.

Open OneNote, then: “Insert” / “Screen Clipping” / capture your region / then right click on the auto-inserted image, and click on “Copy Text from Picture”. The text goes to your clipboard.

The OCR text I had from OneNote, for the same sample image above, was…

ADVERTISEMENT.
Trns publication of the Works of JOHN KNOX, it is
iipposed, will extend to Five Volumes. It was thought
advisable to commence the series with his History of
the Reformation in Scotland, as the work of greatest
importance. The next volume will thus contain the
Third and Fourth Books, which continue the History to
the year 1564; at which period his historical labours
may be considered to terminate. But the Fifth Book,
forming a sequel to the History, and published under
his name iii 1644, will also be included. His Letters
and Miscellaneous Writings will be arranged in the
ubequent volumes, as nearly as possible in chronolog
ical order; each portion being introduced by a separate
notice, respecting the manuscript or printed copies from
which they have been taken.
It may perhaps be expected that a Life of the Author
should have been prefixed to this volume. The Life of
Knox, by Da. M’CRLE, is however a work so universally
known, aII(l of so much historical value, as to supersede
any attempt that might ho made for a (let.

OneNote has the added advantage, that it lets you easily position your screenshot and text side-by-side, for correction of the text.

Reply
Borrowind said:

7 July 2009 at 11:21 am

Just out of interest, I also ran the Adobe Acrobat 9 Pro OCR on the same image, and this was the mess I got…

ADVERTISEMENT.
THIS publication of the Workfl of JOHN Ksox, it is
IIUpposed. will extend to Fil’o Volumes. It wu tJlOught
advisable to commence the I:I(lrics with hia Hisoory of
ilie Reformation in Scotland, 113 tho work of greatest.
importance. The next volume wiU dllll! contain the
Third and Fourth Books, which continuo the History to
the year 1554; o.t which period his historical labours
may be collilidered to tcnninatc. But tho Fifth Book,
funning a sequel to the Hi!tory, and publi!lhed under
his lIame ill 16-‘4, “‘ill ru90 be included. HiB Letters
and MiscellancoWi Writings will be arranged iii tho
IUbsequcnt \’o!uJl108, lUI IIcarly lUI possiblo in chronological
order : each portion being introduced by ” separate
notice, respecting tlnl maliWICript or printed collies frOIll
which they have boon taJ.:cn.
It may pcrhape be expected tllRt a Life of tho Author
~\l ld Ii:wo been prefixed to this ,·olumc. The Lifo of
Knox, by Da. M’Onll; is however a work 60 univcrstl.lly
known, and of 60 much historical value, I\IJ to supersede
any attempt that might be ronde for It detailed bi v

Reply
ScepticsBane said:

7 July 2009 at 11:44 am

Thanks for the update on the OCR. As regards Google book search, I can easily download a book as a PDF but when clicking on their TEXT link, I just get one page rather than the whole download. Is there a way to get the whole book as text downloaded to one’s computer?

Reply
Borrowind said:

7 July 2009 at 11:55 am

Hi ScepticsBane. It depends if you’re talking about commercial or public-domain books. I assume you’re talking about the latter.

For the Google-vended public-domain books, I guess the route to go down would be to just download the PDF, open it, “select all” text, then copy and paste the text into Notepad (or Word, suitably configured to import text without mangling it too much) or a plain HTML editor such as HomeSite 5.

There are also probably various free PDF-2-Text converters. But I’ve not used them so can’t advise on how well they work.

Reply
Borrowind said:

7 July 2009 at 12:03 pm

Another idea for ScepticsBane. As the books offering PDF files will be public-domain, there’s a good chance that a Google search will turn up a plain text copy elsewhere on the web.

Reply
ScepticsBane said:

18 July 2009 at 9:43 pm

Sorry, that won’t work – the PDF pages of the books are images, not text. Only an OCR would convert them to text.

Reply
Borrowind said:

25 October 2009 at 8:20 am

ScepticsBane, once you have a PDF, you should be able to take a screen capture of a page in the same way I’ve described above, then run the image through either the Free OCR service or Microsoft OneNote using the method described above.

Reply
Pingback: D'log :: blogging since 2000 » Pay-per Press
Pingback: An automated script for link/quote blogging « JURN blog
Pingback: Two Kindle Easter Eggs | Kindle blog

News from JURN

~ search tool for open access content

Free OCR for Google Book Search pages

11 thoughts on “Free OCR for Google Book Search pages”

Leave a Reply Cancel reply