CC Search

05 Sunday May 2019

Posted by futurilla in Spotted in the news

The apparently-new front-end for the CC Search – Image Search, for speedily finding re-usable Creative Commons images. There are said to be 289 million pictures here, mostly via Common Crawl apparently, from sources including DeviantArt. But Flickr is apparently not yet completely incorporated. Given the size it’s delightfully quick, and as you can see here it’s possible to ‘stack and chain’ the filters by selecting them repeatedly…

The drawback, compared to Google Images, is that in this incarnation of CC Search (the old one is still available) there’s no size filter and the relevancy ranking of results appears to be ‘easily distracted’. My very first search for Mongolian folk song got me a whole lot of Latvian folk dance, scrolling down into as a vast amount of Indian folk music. Very nice, but it took a lot of scrolling to eventually get down to some Mongolian content…

CC Search is under continual improvement through 2019 and more features are planned. Looking down the list in their forward plan I see that searching for CC texts is said to be coming to the new interface later in the summer (“incorporate open texts from major providers”), along with another design makeover (a new “distinct visual look and feel for landing page”).

There’s also talk of future delivery of a front-end for displaying “3D designs”, which suggests that 2020 or 2021 could see a very useful feature, a unified search for all CC 3D model files. I’d suggest that’s what vital in such a tool is a ‘can be re-textured’ search filter, as 3D models are not much use for quick re-use if (as is often the case with CC freebies) the material zones are either missing entirely or screwed up, which means they can’t be re-textured without specialist software and arcane skills. Perhaps a public user-feedback button could be used to indicate “I had success with this great model” / “Don’t waste your time on this”.

PanWriter – a free open source Markdown editor and HTML-Markdown converter

04 Saturday May 2019

Posted by futurilla in JURN tips and tricks

≈ Leave a comment

Pandoc is a useful universal document converter utility. Yet using the free Pandoc on Windows is often assumed to be about typing arcane command lines into a command prompt, and hoping for the best.

But now there’s a nice free and open source front-end for Windows, for those who just want the markdown. The 2019 PanWriter is a delightfully elegant and simple markdown writing pad for the rest of us.¹ It’s a speedy Windows desktop install, and it didn’t even need to be told where Pandoc was on my PC. It just ‘knew’.

Even if you don’t need another sweet text editor, PanWriter also serves as a document importer and converter utility. Just load up a HTML page you saved from the Web, and it instantly converts all the HTML code to markdown. I loaded a few really complex pages and it didn’t blink, instantly presenting a clean markdown conversion.

Why would one want to convert HTML to markdown, you ask? Because it places the HTML and other elements onto single lines, while retaining Web links in place. Such lines are far easier to extract data blocks from. Compared to fiendishly nested HTML code that sprawls across multiple lines.³

Picture: the above web page (just some random Met Museum image search results) in Notepad++ as markdown.

Once the markdown text is in Notepad++, then a macro can have its way with it. This can ‘Find and Mark’ the repeating line-blocks² containing the data you want to extract and clean (e.g. search-engine results). These can then be copied out to a new tab (Top Menu: Search | Bookmarks | Copy Bookmarked lines). Then the macro can run various operations that fix up the text a bit more,^4. before finally saving the cleaned list back to HTML via the user-friendly MarkdownViewerPlusPlus plugin for Notepad++.⁵

So the basic workflow here is:

1. Get your search results and save your Web page(s) as usual. There’s no need to painstakingly select-copy-paste just part of the page.

2. Use a joiner utility (TXTcollector) to join all the saved pages.⁶ Open the saved HTML with PanWriter and it instantly auto-converts into markdown.

3. Open the saved markdown with Notepad++ and run your cleaning and text-sorting custom macro on it.

4. Copy-paste the resulting linked data list as HTML to your blog etc. (Or to .CSV for Excel import and sorting).

The above is a more advanced and robust version of my recent home-brew workflow, which suggested a browser addon and manual copy-paste. That was more suitable for occasional use by bloggers and academics who can’t afford sophisticated data scrapers (and the proxies to run them).

This workflow has the advantage that: i) it’s all free software; ii) it doesn’t need you to pay for and burn through proxies in your paid scraping software; iii) as long as you have the HTML in your browser it can be grabbed, and it basically doesn’t matter how complex and nested the page code is, as it’s all going into Markdown; and iv) collection can be automated with Windows automation software (JitBit) etc, and processing can be automated with Notepad++ macros. But it is obviously not suited to automated scraping of millions of records from multiple shopping sites — if you’re into that game then you should have the cash to buy in those datasets.

Notes:

1. There are also two old GUIs for Pandoc, over on GitHub. One there is nice and simple, and has batch… but it crashes and fails on 64-bit Windows, as the developer admits in his readme. The other GUI at GitHub was tried and runs on 64-bit Windows, but seems far less user-friendly. There are also a half dozen Python scripts projects that do this.

2. Marking and exporting lines in Notepad++ can’t currently be done for multi-line nested HTML code, which is why a HTML-to-Markdown conversion is so useful. While multi-line block marking can be done between two keywords [ Find | Mark | Regex with Newline | then paste in…

(?<=STARTWORD)([\s\S]*?)(?=ENDWORD)

…this only places a single mark at the top of each marked and highlighted block. It does not run a line of marks down the entire block.

3. Yes, I know about XPath, but with a complex Web page it's: i) fiendishly tricky to do the initial puzzling out of what needs to be captured; ii) often fails to then grab what’s needed; iii) and has even more difficulty in aligning data fields when used as a browser addon.

4. Note that multiline search-replace needs to be done as \n commands not plugins, in macros. Also that Crtl + Home will get your cursor back to the top of the text.

5. Sadly one can’t yet use Notepad++ as the initial importer/converter, as it has no such plugin at present. I’ve looked. There are a couple of possible Python scripts but support for the Python plugin in the latest Notepad++ is a bit of a mess at present, with plugin structures being swopped around and then reverted. So that’s not really an option, unless you want to fall right back to version 5.9 or thereabouts to use a script.

6. Update: If you have no absolute need to keep the saved HTML pages as backup, then Clipboard Magic is lovely little Windows freeware that keeps copies of each clipboard, then when done you “Copy all clips to clipboard” and paste to Notepad++. Or if you still use an older 32-bit Notepad++ you can use the fine MultiClipboard plugin.

Freeware for cleaning and manipulation of text lists

02 Thursday May 2019

Posted by futurilla in JURN tips and tricks

≈ 2 Comments

These are all Windows PC freeware, with graphical user interfaces, tested and working on Windows 8.1.x. They may be useful for those who occasionally have to sort and clean and combine lists in text form, and who do not have access to paid tools such as the sophisticated TextPipe Pro or the Sobolsoft utilities, or to advanced training in Excel and regex commands.

The relatively simple:

1. Text Cleanup 2.0.

It “fixes” text automatically when you copy-paste it, according to various cleaning options you can save presets for. Its main use is to unwrap a chunk of text that has hard line-breaks, when copied to the clipboard. Or to place a new blank line between each line. This vital software only recently went freeware.

Can be used in combination with the free Clipboard Magic which keeps copies of all Clipboard text, and then allows you to “Copy all clips to clipboard”.

2. List Numberer.

This can do what Notepad++ can’t yet do, and does easily what Excel can only seem to with complex fiddly formulas and macros. Most useful for dealing with repeated blocks of data in a list (e.g. labelling them 1234, 1234, 1234), to enable mass deletion of certain lines in a text editor.

3. Text Magician 1.3.

Various operations including append text to the start and end of each line, delete multi-line blocks between X and Y, and more. (If you have ‘.DLL missing’ problems, either go find the required .DLL file, or use Version 1.0 which does not have that problem).

4. Duplicated Finder from AKS-Labs.

Easily find and extract the duplicates from a single list. Useful for checking for the presence of a few duplicate URLs in a long list of uniques, for instance. (See also the free Duplicate Master addin for Excel).

5. Excel example sheet: compare two lists and extract non-duplicates.

My free ready-made .XLS sheet for Excel, with formula. The second list is a jumbled up variant of the first, with some new additions in it. These additions are extracted and placed alongside. (Excel is not free, admittedly, but my guess is you could probably get the same formula working in whatever LibreOffice has as its Excel equivalent).

The potentially quite complex:

1. Notepad++.

The code programmer’s text editor. Column numbering (though it can’t do what List Numberer — see above — can do); sophisticated Regex (though the more sophisticated, the more difficult to remember and to get it working); Remove blanks lines (provided you can remember the menu sequence within its complex UI); and much more. Intensive research is often needed to learn how to do a particular bit of sophisticated text manipulation, and it’s also easy to overlook its most powerful features such as per-line list bookmarking. The devs have recently fumbled a move to a different plugin structure, and thus you may need to run the latest 64-bit version alongside an older 32-bit version in order to run PythonScript and older plugins such as Multiline Search/Replace (appears under ‘Tool Bucket’), Column Sorting and Line Filter 2.

2. WildGem 1.3.

A tool for building and testing ‘regular expression’ or ‘regex’ commands. Find and replace with these commands, and see the resulting changes (if any) in realtime. This software can hide some of the more common ‘regex’ snippets behind more user-friendly visual icons. Useful for instantly testing ‘regex’ command formulas you find, to see if they work, without having to wrestle with Notepad++. This is portable Windows software. In order to save your UI layout preferences, it must be run in Administrator mode.

3. CSVed.

A CSV file editor, an alternative you may prefer to behemoth software such as Excel. Move lists and sections around, split lists, add to lines. Appears to lack ability to do column numbering for lists (for which see List Numberer, above).

4. Openview’s Index Generator 7.0.

Dedicated to creating a back-of-the-book index for a book. This one is more about the creation of the list, admittedly, but it has various filtering options while doing this. The curious lack of a ‘filter for capitalised words only’ filter make it far less useful than its paid competitor. Asks for a donation on exit. (Note that Softpedia’s review states you “upload a document to the program”, and this wording may mislead the casual reader into thinking this is partly online cloud-linked software. It isn’t, it’s standalone Windows software).

5. There’s a free Selected HTML page-content to Markdown addon for chrome-based browsers, and also a Markdown to BBCode converter.

May be useful if either Markdown or BBCode is easier to work with, re: sorting and cleaning list-shaped content grabbed from a Web page. The latter is a self contained javascript-based Web page and can work offline, just save the page locally and re-open it.

Review of Cabell’s Predatory Journal Blacklist

02 Thursday May 2019

Posted by futurilla in Academic search, Spotted in the news

≈ Leave a comment

A new review of a paywalled up-to-date blacklist of predatory journals, “Cabell’s Predatory Journal Blacklist: An Updated Review”, at the Scholarly Kitchen.

JURN pagination links fixed

01 Wednesday May 2019

Posted by futurilla in JURN's Google watch, My general observations

≈ Leave a comment

In the last week or so Google has made some slight changes to the default styling templates for CSEs, resulting in the numbered pagination links at the foot of the search results becoming very small and grey. This has now been fixed on JURN, and your per-page links to more search results should now look like this. They should be far more easily selectable now, and especially for touch-screen users…

My thanks to Amit Agarwal of India, for the elegant snippet of commented CSS for the .gsc-cursor-page element. If you have the same problem with your own CSE, this snippet goes in the style header of your page. Colours are controlled elsewhere, in the ‘Look & Feel’ | Customise | Refinement section of your CSE admin dashboard.

Changes may not show up until you and your users refresh your main page a few times, due to Web browser caching.

GRAFT has also had the same fix applied.

Update:

Also add padding for the pagination row, by adding the following to your CSS style (I have mine in the page itself)…

News from JURN

~ search tool for open access content

Monthly Archives: May 2019

CC Search

PanWriter – a free open source Markdown editor and HTML-Markdown converter

Freeware for cleaning and manipulation of text lists

Review of Cabell’s Predatory Journal Blacklist

JURN pagination links fixed