• Directory
  • FAQ: about JURN
  • Group tests
  • Guide to academic search
  • JURN’s donationware
  • Links
  • openEco: titles indexed

News from JURN

~ search tool for open access content

News from JURN

Monthly Archives: May 2019

CC Search

05 Sunday May 2019

Posted by futurilla in Spotted in the news

≈ Leave a comment

The apparently-new front-end for the CC Search – Image Search, for speedily finding re-usable Creative Commons images. There are said to be 289 million pictures here, mostly via Common Crawl apparently, from sources including DeviantArt. But Flickr is apparently not yet completely incorporated. Given the size it’s delightfully quick, and as you can see here it’s possible to ‘stack and chain’ the filters by selecting them repeatedly…

The drawback, compared to Google Images, is that in this incarnation of CC Search (the old one is still available) there’s no size filter and the relevancy ranking of results appears to be ‘easily distracted’. My very first search for Mongolian folk song got me a whole lot of Latvian folk dance, scrolling down into as a vast amount of Indian folk music. Very nice, but it took a lot of scrolling to eventually get down to some Mongolian content…

CC Search is under continual improvement through 2019 and more features are planned. Looking down the list in their forward plan I see that searching for CC texts is said to be coming to the new interface later in the summer (“incorporate open texts from major providers”), along with another design makeover (a new “distinct visual look and feel for landing page”).

There’s also talk of future delivery of a front-end for displaying “3D designs”, which suggests that 2020 or 2021 could see a very useful feature, a unified search for all CC 3D model files. I’d suggest that’s what vital in such a tool is a ‘can be re-textured’ search filter, as 3D models are not much use for quick re-use if (as is often the case with CC freebies) the material zones are either missing entirely or screwed up, which means they can’t be re-textured without specialist software and arcane skills. Perhaps a public user-feedback button could be used to indicate “I had success with this great model” / “Don’t waste your time on this”.

PanWriter – a free open source Markdown editor and HTML-Markdown converter

04 Saturday May 2019

Posted by futurilla in JURN tips and tricks

≈ Leave a comment

Pandoc is a useful universal document converter utility. Yet using the free Pandoc on Windows is often assumed to be about typing arcane command lines into a command prompt, and hoping for the best.

But now there’s a nice free and open source front-end for Windows, for those who just want the markdown. The 2019 PanWriter is a delightfully elegant and simple markdown writing pad for the rest of us.1 It’s a speedy Windows desktop install, and it didn’t even need to be told where Pandoc was on my PC. It just ‘knew’.

Even if you don’t need another sweet text editor, PanWriter also serves as a document importer and converter utility. Just load up a HTML page you saved from the Web, and it instantly converts all the HTML code to markdown. I loaded a few really complex pages and it didn’t blink, instantly presenting a clean markdown conversion.

Why would one want to convert HTML to markdown, you ask? Because it places the HTML and other elements onto single lines, while retaining Web links in place. Such lines are far easier to extract data blocks from. Compared to fiendishly nested HTML code that sprawls across multiple lines.3


Picture: the above web page (just some random Met Museum image search results) in Notepad++ as markdown.

Once the markdown text is in Notepad++, then a macro can have its way with it. This can ‘Find and Mark’ the repeating line-blocks2 containing the data you want to extract and clean (e.g. search-engine results). These can then be copied out to a new tab (Top Menu: Search | Bookmarks | Copy Bookmarked lines). Then the macro can run various operations that fix up the text a bit more,4. before finally saving the cleaned list back to HTML via the user-friendly MarkdownViewerPlusPlus plugin for Notepad++.5

So the basic workflow here is:

1. Get your search results and save your Web page(s) as usual. There’s no need to painstakingly select-copy-paste just part of the page.

2. Use a joiner utility (TXTcollector) to join all the saved pages.6 Open the saved HTML with PanWriter and it instantly auto-converts into markdown.

3. Open the saved markdown with Notepad++ and run your cleaning and text-sorting custom macro on it.

4. Copy-paste the resulting linked data list as HTML to your blog etc. (Or to .CSV for Excel import and sorting).

The above is a more advanced and robust version of my recent home-brew workflow, which suggested a browser addon and manual copy-paste. That was more suitable for occasional use by bloggers and academics who can’t afford sophisticated data scrapers (and the proxies to run them).

This workflow has the advantage that: i) it’s all free software; ii) it doesn’t need you to pay for and burn through proxies in your paid scraping software; iii) as long as you have the HTML in your browser it can be grabbed, and it basically doesn’t matter how complex and nested the page code is, as it’s all going into Markdown; and iv) collection can be automated with Windows automation software (JitBit) etc, and processing can be automated with Notepad++ macros. But it is obviously not suited to automated scraping of millions of records from multiple shopping sites — if you’re into that game then you should have the cash to buy in those datasets.



Notes:

1. There are also two old GUIs for Pandoc, over on GitHub. One there is nice and simple, and has batch… but it crashes and fails on 64-bit Windows, as the developer admits in his readme. The other GUI at GitHub was tried and runs on 64-bit Windows, but seems far less user-friendly. There are also a half dozen Python scripts projects that do this.

2. Marking and exporting lines in Notepad++ can’t currently be done for multi-line nested HTML code, which is why a HTML-to-Markdown conversion is so useful. While multi-line block marking can be done between two keywords [ Find | Mark | Regex with Newline | then paste in…

(?<=STARTWORD)([\s\S]*?)(?=ENDWORD)

…this only places a single mark at the top of each marked and highlighted block. It does not run a line of marks down the entire block.

3. Yes, I know about XPath, but with a complex Web page it's: i) fiendishly tricky to do the initial puzzling out of what needs to be captured; ii) often fails to then grab what’s needed; iii) and has even more difficulty in aligning data fields when used as a browser addon.

4. Note that multiline search-replace needs to be done as \n commands not plugins, in macros. Also that Crtl + Home will get your cursor back to the top of the text.

5. Sadly one can’t yet use Notepad++ as the initial importer/converter, as it has no such plugin at present. I’ve looked. There are a couple of possible Python scripts but support for the Python plugin in the latest Notepad++ is a bit of a mess at present, with plugin structures being swopped around and then reverted. So that’s not really an option, unless you want to fall right back to version 5.9 or thereabouts to use a script.

6. Update: If you have no absolute need to keep the saved HTML pages as backup, then Clipboard Magic is lovely little Windows freeware that keeps copies of each clipboard, then when done you “Copy all clips to clipboard” and paste to Notepad++. Or if you still use an older 32-bit Notepad++ you can use the fine MultiClipboard plugin.

Freeware for cleaning and manipulation of text lists

02 Thursday May 2019

Posted by futurilla in JURN tips and tricks

≈ 2 Comments

These are all Windows PC freeware, with graphical user interfaces, tested and working on Windows 8.1.x. They may be useful for those who occasionally have to sort and clean and combine lists in text form, and who do not have access to paid tools such as the sophisticated TextPipe Pro or the Sobolsoft utilities, or to advanced training in Excel and regex commands.



The relatively simple:

1. Text Cleanup 2.0.

It “fixes” text automatically when you copy-paste it, according to various cleaning options you can save presets for. Its main use is to unwrap a chunk of text that has hard line-breaks, when copied to the clipboard. Or to place a new blank line between each line. This vital software only recently went freeware.

Can be used in combination with the free Clipboard Magic which keeps copies of all Clipboard text, and then allows you to “Copy all clips to clipboard”.


2. List Numberer.

This can do what Notepad++ can’t yet do, and does easily what Excel can only seem to with complex fiddly formulas and macros. Most useful for dealing with repeated blocks of data in a list (e.g. labelling them 1234, 1234, 1234), to enable mass deletion of certain lines in a text editor.


3. Text Magician 1.3.

Various operations including append text to the start and end of each line, delete multi-line blocks between X and Y, and more. (If you have ‘.DLL missing’ problems, either go find the required .DLL file, or use Version 1.0 which does not have that problem).


4. Duplicated Finder from AKS-Labs.

Easily find and extract the duplicates from a single list. Useful for checking for the presence of a few duplicate URLs in a long list of uniques, for instance. (See also the free Duplicate Master addin for Excel).


5. Excel example sheet: compare two lists and extract non-duplicates.

My free ready-made .XLS sheet for Excel, with formula. The second list is a jumbled up variant of the first, with some new additions in it. These additions are extracted and placed alongside. (Excel is not free, admittedly, but my guess is you could probably get the same formula working in whatever LibreOffice has as its Excel equivalent).



The potentially quite complex:

1. Notepad++.

The code programmer’s text editor. Column numbering (though it can’t do what List Numberer — see above — can do); sophisticated Regex (though the more sophisticated, the more difficult to remember and to get it working); Remove blanks lines (provided you can remember the menu sequence within its complex UI); and much more. Intensive research is often needed to learn how to do a particular bit of sophisticated text manipulation, and it’s also easy to overlook its most powerful features such as per-line list bookmarking. The devs have recently fumbled a move to a different plugin structure, and thus you may need to run the latest 64-bit version alongside an older 32-bit version in order to run PythonScript and older plugins such as Multiline Search/Replace (appears under ‘Tool Bucket’), Column Sorting and Line Filter 2.


2. WildGem 1.3.

A tool for building and testing ‘regular expression’ or ‘regex’ commands. Find and replace with these commands, and see the resulting changes (if any) in realtime. This software can hide some of the more common ‘regex’ snippets behind more user-friendly visual icons. Useful for instantly testing ‘regex’ command formulas you find, to see if they work, without having to wrestle with Notepad++. This is portable Windows software. In order to save your UI layout preferences, it must be run in Administrator mode.


3. CSVed.

A CSV file editor, an alternative you may prefer to behemoth software such as Excel. Move lists and sections around, split lists, add to lines. Appears to lack ability to do column numbering for lists (for which see List Numberer, above).


4. Openview’s Index Generator 7.0.

Dedicated to creating a back-of-the-book index for a book. This one is more about the creation of the list, admittedly, but it has various filtering options while doing this. The curious lack of a ‘filter for capitalised words only’ filter make it far less useful than its paid competitor. Asks for a donation on exit. (Note that Softpedia’s review states you “upload a document to the program”, and this wording may mislead the casual reader into thinking this is partly online cloud-linked software. It isn’t, it’s standalone Windows software).


5. There’s a free Selected HTML page-content to Markdown addon for chrome-based browsers, and also a Markdown to BBCode converter.

May be useful if either Markdown or BBCode is easier to work with, re: sorting and cleaning list-shaped content grabbed from a Web page. The latter is a self contained javascript-based Web page and can work offline, just save the page locally and re-open it.


Review of Cabell’s Predatory Journal Blacklist

02 Thursday May 2019

Posted by futurilla in Academic search, Spotted in the news

≈ Leave a comment

A new review of a paywalled up-to-date blacklist of predatory journals, “Cabell’s Predatory Journal Blacklist: An Updated Review”, at the Scholarly Kitchen.

JURN pagination links fixed

01 Wednesday May 2019

Posted by futurilla in JURN's Google watch, My general observations

≈ Leave a comment

In the last week or so Google has made some slight changes to the default styling templates for CSEs, resulting in the numbered pagination links at the foot of the search results becoming very small and grey. This has now been fixed on JURN, and your per-page links to more search results should now look like this. They should be far more easily selectable now, and especially for touch-screen users…

My thanks to Amit Agarwal of India, for the elegant snippet of commented CSS for the .gsc-cursor-page element. If you have the same problem with your own CSE, this snippet goes in the style header of your page. Colours are controlled elsewhere, in the ‘Look & Feel’ | Customise | Refinement section of your CSE admin dashboard.

Changes may not show up until you and your users refresh your main page a few times, due to Web browser caching.

GRAFT has also had the same fix applied.


Update:

Also add padding for the pagination row, by adding the following to your CSS style (I have mine in the page itself)…

Newer posts →
RSS Feed: Subscribe

 

Please become my patron at www.patreon.com/davehaden to help JURN survive and thrive.

JURN

  • JURN : directory of ejournals
  • JURN : main search-engine
  • JURN : openEco directory
  • JURN : repository search
  • Categories

    • Academic search
    • Ecology additions
    • Economics of Open Access
    • How to improve academic search
    • JURN blogged
    • JURN metrics
    • JURN tips and tricks
    • JURN's Google watch
    • My general observations
    • New media journal articles
    • New titles added to JURN
    • Official and think-tank reports
    • Ooops!
    • Open Access publishing
    • Spotted in the news
    • Uncategorized

    Archives

    • May 2025
    • April 2025
    • December 2024
    • September 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • June 2023
    • May 2023
    • January 2023
    • December 2022
    • November 2022
    • October 2021
    • September 2021
    • August 2021
    • July 2021
    • June 2021
    • May 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • June 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • December 2016
    • November 2016
    • October 2016
    • September 2016
    • August 2016
    • July 2016
    • June 2016
    • May 2016
    • April 2016
    • March 2016
    • February 2016
    • January 2016
    • December 2015
    • November 2015
    • October 2015
    • September 2015
    • August 2015
    • July 2015
    • June 2015
    • May 2015
    • April 2015
    • March 2015
    • February 2015
    • January 2015
    • December 2014
    • November 2014
    • October 2014
    • September 2014
    • August 2014
    • July 2014
    • June 2014
    • May 2014
    • April 2014
    • March 2014
    • February 2014
    • January 2014
    • December 2013
    • November 2013
    • October 2013
    • September 2013
    • August 2013
    • July 2013
    • June 2013
    • May 2013
    • April 2013
    • March 2013
    • February 2013
    • January 2013
    • December 2012
    • November 2012
    • October 2012
    • September 2012
    • August 2012
    • June 2012
    • May 2012
    • April 2012
    • March 2012
    • February 2012
    • January 2012
    • December 2011
    • November 2011
    • October 2011
    • September 2011
    • August 2011
    • July 2011
    • June 2011
    • May 2011
    • April 2011
    • March 2011
    • February 2011
    • January 2011
    • December 2010
    • November 2010
    • October 2010
    • September 2010
    • August 2010
    • July 2010
    • June 2010
    • May 2010
    • April 2010
    • March 2010
    • February 2010
    • January 2010
    • December 2009
    • November 2009
    • October 2009
    • September 2009
    • August 2009
    • July 2009
    • June 2009
    • May 2009
    • April 2009
    • March 2009
    • February 2009

    Proudly powered by WordPress Theme: Chateau by Ignacio Ricci.