Ever wanted to take the hassle out of re-typing a short quote, found on Google Books? Free OCR is a simple online OCR application that might help.
To test it, I gave it a very unpromising bit of text captured from Google Books using a standard screen-capture utility — slightly skewed, slightly fuzzy, in a non-standard typeface I’m willing to bet no-one has on their system, captured as a JPG at a mere 72 dpi, and just 500 pixels wide…
A few seconds after uploading, it gave me this…
ADVERTISEMENT.
Tms publication of the Works of Jomv KNOx, it is
supposed, will extend to F’ive Volumes. It was thought
advisable to commence the series with his History of
the Reformation in Scotland, as the work of greatest
importance. The next voliune will thus contain the
Third and Fourth Books, which continue the History to
the year 1564; at which period his historical labeurs
maybeconsideredtoterminate. ButtheFi&hBook,
forming a sequel to the History, and published under
his name in 1644, will also be included. His Letters
and Miscellaneous Writings will be arranged in the
subsequent volumes, as nearly as possible in chronolo-
gical order; each portion being introduced by a separate
notice, respecting the manuscript or printed copies from
which they have been taken.
It may perhaps be expected that a Life of the Author
should have been prefixed to this volume. The Life of
Knox., by Ds. M‘Cms, is however a work so universally
, known, and of so much historical value, as to supersede
l any attempt that might be made for a detailed bio-
Not perfect, but not bad for such a poor-quality capture. Stand-alone OCR software usually demands a much better quality source.
The popular screenshot software HyperSnap v6 promises to do the same with its TextSnap feature, but for some unknown reason this feature just doesn’t work with Google Books or the captured image above. I suspect it can only handle text that uses system fonts.
So until we get a neat free OCR Firefox addon (which is a direction I would urge the makers of Free OCR to go in) then screenshot – save image – upload image to Free OCR is a viable and speedy workflow for OCR-ing fair-use quotes found on Google Book Search or other places that only offer plain page-scans.
Oh, and don’t bother doing this for books that are already in the public domain — since last month Google provides the full-text of these for download, and also serves it up via Google Book Search Mobile.
** Update: If you have Microsoft Office 2007 or higher, then I find that the included Microsoft OneNote works just as well for OCR on low-res images such as the one above. It also works well on most PDFs that don’t allow copy/paste. See the comments to this post for details.
Pingback: D’log :: blogging since 2000 » Free online OCR
Borrowind said:
Update. I find that Microsoft Office 2007‘s OneNote application will also OCR from low-res images.
Open OneNote, then: “Insert” / “Screen Clipping” / capture your region / then right click on the auto-inserted image, and click on “Copy Text from Picture”. The text goes to your clipboard.
The OCR text I had from OneNote, for the same sample image above, was…
OneNote has the added advantage, that it lets you easily position your screenshot and text side-by-side, for correction of the text.
Borrowind said:
Just out of interest, I also ran the Adobe Acrobat 9 Pro OCR on the same image, and this was the mess I got…
ScepticsBane said:
Thanks for the update on the OCR. As regards Google book search, I can easily download a book as a PDF but when clicking on their TEXT link, I just get one page rather than the whole download. Is there a way to get the whole book as text downloaded to one’s computer?
Borrowind said:
Hi ScepticsBane. It depends if you’re talking about commercial or public-domain books. I assume you’re talking about the latter.
For the Google-vended public-domain books, I guess the route to go down would be to just download the PDF, open it, “select all” text, then copy and paste the text into Notepad (or Word, suitably configured to import text without mangling it too much) or a plain HTML editor such as HomeSite 5.
There are also probably various free PDF-2-Text converters. But I’ve not used them so can’t advise on how well they work.
Borrowind said:
Another idea for ScepticsBane. As the books offering PDF files will be public-domain, there’s a good chance that a Google search will turn up a plain text copy elsewhere on the web.
ScepticsBane said:
Sorry, that won’t work – the PDF pages of the books are images, not text. Only an OCR would convert them to text.
Borrowind said:
ScepticsBane, once you have a PDF, you should be able to take a screen capture of a page in the same way I’ve described above, then run the image through either the Free OCR service or Microsoft OneNote using the method described above.
Pingback: D'log :: blogging since 2000 » Pay-per Press
Pingback: An automated script for link/quote blogging « JURN blog
Pingback: Two Kindle Easter Eggs | Kindle blog