{"id":24471,"date":"2020-10-07T04:54:07","date_gmt":"2020-10-07T03:54:07","guid":{"rendered":"https:\/\/jurnsearch.wordpress.com\/?p=24471"},"modified":"2020-10-07T04:54:07","modified_gmt":"2020-10-07T03:54:07","slug":"some-free-tools-to-extract-data-from-fetched-html","status":"publish","type":"post","link":"https:\/\/jurn.link\/jurnsearch\/index.php\/2020\/10\/07\/some-free-tools-to-extract-data-from-fetched-html\/","title":{"rendered":"Some free tools to extract data from fetched HTML"},"content":{"rendered":"<p>Here are some relatively simple free Windows desktop tools to &#8216;extract an item of data from fetched HTML&#8217;. They were found while considering if it might be possible to append ISSNs to the JURN directories in a semi-automatic manner. <\/p>\n<p>My target task was: you have a big list of URLs, and the HTML pages for these are to be automatically fetched. Their text is then regex-ed (or the Excel equivalent) to extract a tiny snippet of text data from each page. In this case, any line following the first instance of the word &#8220;ISSN&#8221; on each home-page. Ideally, each extracted text snippet is then automatically appended to its source URL.<\/p>\n<p><strong>1.<\/strong> <a href=\"https:\/\/analystcave.com\/excel-tools\/excel-scrape-html-add\/\">Excel Scrape HTML Add-In<\/a>, free from Analyst Cave.  I can&#8217;t do anything with it in Excel 2007, so I assume it needs Excel 2016 or higher (2016 introduced the new features Power Query and Get &amp; Transform). <\/p>\n<p><strong>2.<\/strong> <a href=\"https:\/\/www.softpedia.com\/get\/Internet\/Other-Internet-Related\/WebExtractor360.shtml\">Download WebExtractor360 1.0<\/a>. Simple Windows abandonware from 2009, and lacking any Help in terms of&#8230; how do you format your big list of URLs so they can be automatically processed?  It also looks like it cannot be limited to just the first-encountered home-page. Still, someone might figure out that bit of the WebExtractor360 puzzle, or pick up the <a href=\"https:\/\/sourceforge.net\/projects\/webextract\/\">open source code<\/a> at SourceForge and develop it for easier batch processing and expanded output options.<\/p>\n<p><strong>3.<\/strong> <a href=\"https:\/\/deixto.com\/\">DEiXTo<\/a>. Genuine Windows freeware from Greece, for &#8220;Web data extraction made easy!&#8221; The baffling interface and example-free techie manual strongly suggest otherwise though, and you&#8217;ll likely need to read the manual very carefully to get it working. There&#8217;s also a <a href=\"https:\/\/intelligence.csd.auth.gr\/wp-content\/uploads\/2019\/03\/BCI2013-kokkoras.pdf\">2013 academic paper on DEiXTo<\/a> from the authors.  <\/p>\n<p><strong>4.<\/strong> Update: the open source freeware <a href=\"http:\/\/web-harvest.sourceforge.net\/release.php\">Web-Harvest 2.x<\/a> from 2010, Java with a clean Windows GUI and a good manual. Seems like a good alternative to DEiXTo. Still works and has many examples and templates, but no template to run through a list of URLs and grab a fragment of data from each home-page.  Despite the name it&#8217;s a data extractor, not a site harvester.<\/p>\n<p><strong>5.<\/strong> Update: I made one for Excel 2007, and it&#8217;s free. <a href=\"https:\/\/jurn.link\/jurnsearch\/2020\/10\/11\/working-excel-spreadsheet-take-a-list-of-home-page-urls-harvest-the-html-extract-a-snippet-of-data-from-each\/\">Take a list of home-page URLs, harvest the HTML, extract a snippet of data from each<\/a>.<\/p>\n<p>For paid Windows desktop software, that doesn&#8217;t require a PhD in Spreadsheet Wrangling and which indeed assumes you&#8217;re not working in Excel, look at Sobolsoft&#8217;s $20 <a href=\"https:\/\/www.sobolsoft.com\/extractdataweb\/\">Extract Data &amp; Text From Multiple Web Sites Software<\/a> and <a href=\"https:\/\/www.botsol.com\/Products\/WebExtractor\">BotSol&#8217;s Web Extractor<\/a>.  The first from Sobolsoft requires Internet Explorer and that you delve into two of Explorer&#8217;s settings to make it not be verbose, in terms of IE not freaking out with process-stopping alerts every time it meets a Twitter button etc. Search is not ideal, as it cannot be limited to just the first-encountered &#8216;home&#8217; page. Output it not ideal either, as it cannot offer <em>Source URL = no result<\/em> as a line in the results. The latter software from BotSol has the great advantage that it can limited itself to the home-page and will also try another 2 nearby pages (&#8220;About&#8221; etc), if it can&#8217;t find the target data on the home-page. It&#8217;s designed to extract phone numbers, but can be configured to get anything. It&#8217;s free for a version that processes a list of 10 URLs at a time, and is $50 for an &#8216;unlimited URLs&#8217; version (that is regrettably time-bombed).<\/p>\n<p>There are browser-based tools like the long-standing OutWitHub and new free Cloud services such as Octoparse, but they appear focused on ripping competitor ecommerce listings and plugging them into your boss&#8217;s database. Also, apparently Octoparse&#8217;s &#8220;List of URLs&#8221; feature requires all the pages to have exactly the same HTML elements.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Here are some relatively simple free Windows desktop tools to &#8216;extract an item of data from fetched HTML&#8217;. They were &hellip;<\/p>\n<p><a href=\"https:\/\/jurn.link\/jurnsearch\/index.php\/2020\/10\/07\/some-free-tools-to-extract-data-from-fetched-html\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[],"class_list":["post-24471","post","type-post","status-publish","format-standard","hentry","category-jurn-tips-and-tricks"],"_links":{"self":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts\/24471","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/comments?post=24471"}],"version-history":[{"count":0,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/posts\/24471\/revisions"}],"wp:attachment":[{"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/media?parent=24471"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/categories?post=24471"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jurn.link\/jurnsearch\/index.php\/wp-json\/wp\/v2\/tags?post=24471"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}