{"id":20902,"date":"2018-05-10T14:46:46","date_gmt":"2018-05-10T13:46:46","guid":{"rendered":"https:\/\/jurnsearch.wordpress.com\/?p=20902"},"modified":"2018-05-10T14:46:46","modified_gmt":"2018-05-10T13:46:46","slug":"free-ocr-for-german-blackletter-text","status":"publish","type":"post","link":"https:\/\/jurn.link\/jurnsearch\/index.php\/2018\/05\/10\/free-ocr-for-german-blackletter-text\/","title":{"rendered":"Free OCR for German blackletter text"},"content":{"rendered":"<p>The free open-source <a href=\"https:\/\/github.com\/UB-Mannheim\/tesseract\/wiki\">Tesseract OCR 4.0 for Windows<\/a> (beta, 64-bit), released 14th April 2018.<\/p>\n<blockquote><p>&#8220;The Mannheim University Library uses Tesseract to perform OCR of historical German newspapers. Normally we run Tesseract on Debian GNU Linux, but there was also the need for a Windows version. That&#8217;s why we have built a Tesseract installer for Windows.&#8221;<\/p><\/blockquote>\n<p>The Tesseract engine was apparently originally from Google, in use there at Google Books, but Google made it open source. <\/p>\n<p>Tesseract 4.0 supports OCR in a range of old and ancient letterforms including German blackletter (aka Fraktur, in popular parlance &#8216;Gothic&#8217;), but these need to selectively enabled at install&#8230;<\/p>\n<p><a href=\"https:\/\/jurn.link\/jurnsearch\/2018\/05\/2018-05-10_134121.jpg\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/jurn.link\/jurnsearch\/2018\/05\/2018-05-10_134121.jpg\" alt=\"\" width=\"511\" height=\"398\" class=\"alignnone size-full wp-image-20903\" \/><\/a><\/p>\n<p>Once installed there are a few Windows GUI front-ends to choose from, with which to operate Tesseract. <a href=\"https:\/\/sourceforge.net\/projects\/gimagereader\/\">gImageReader<\/a> is 64-bit Windows and current.  On their forums I found a <a href=\"https:\/\/github.com\/manisandro\/gImageReader\/issues\/328\">gImageReader beta version that is newly-compiled for Tesseract 4.0 beta<\/a>.  That needs to be launched in Windows Administrator mode, and then it also seems to require a Fraktur download, in order to handle OCR of German blackletter letterforms&#8230;<\/p>\n<p><a href=\"https:\/\/jurn.link\/jurnsearch\/2018\/05\/2018-05-10_141612.jpg\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/jurn.link\/jurnsearch\/2018\/05\/2018-05-10_141612.jpg\" alt=\"\" width=\"529\" height=\"299\" class=\"alignnone size-large wp-image-20904\" \/><\/a><\/p>\n<p>I&#8217;m assuming that gImageReader &#8216;knows&#8217; where Tesseract 4.0 is, and hooks into it automatically. Because I didn&#8217;t need to set any file-paths to it, in gImageReader. <\/p>\n<p>Once gImageReader is set up and the Frankur toggle\/icon is switched, even when taking a screenshot the OCR results were pretty good&#8230;<\/p>\n<p><a href=\"https:\/\/jurn.link\/jurnsearch\/2018\/05\/2018-05-10_143412.jpg\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/jurn.link\/jurnsearch\/2018\/05\/2018-05-10_143412.jpg\" alt=\"\" width=\"529\" height=\"285\" class=\"alignnone size-large wp-image-20905\" \/><\/a><\/p>\n<p>It can also handle complete PDFs, and seems to go at about 15 pages per minute on a modern desktop PC.  Nice to have, and (in combination with Google Translate) useful if your research takes you back to the German literature of pre-1938 &mdash; but you can&#8217;t read German and certainly not in blackletter.  <\/p>\n<p>There are probably online sign-up services that can do the same, these days, where you do a sluggish upload and have to deal with time-outs and usage-quotas etc. But I prefer the ease of having one&#8217;s own Windows desktop software.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The free open-source Tesseract OCR 4.0 for Windows (beta, 64-bit), released 14th April 2018. &#8220;The Mannheim University Library uses Tesseract &hellip;<\/p>\n<p><a href=\"https:\/\/jurn.link\/jurnsearch\/index.php\/2018\/05\/10\/free-ocr-for-german-blackletter-text\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8,16],"tags":[],"class_list":["post-20902","post","type-post","status-publish","format-standard","hentry","category-jurn-tips-and-tricks","category-spotted-in-the-news"],"_links":{"self":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts\/20902","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/comments?post=20902"}],"version-history":[{"count":0,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts\/20902\/revisions"}],"wp:attachment":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/media?parent=20902"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/categories?post=20902"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/tags?post=20902"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}