Medical Historian

07 Monday Dec 2009

Posted by futurilla in New titles added to JURN

≈ Leave a comment

Added to the JURN site-index today:—

Medical Historian : Bulletin of the Liverpool Medical History Society (1988-2008) (Substantial scholarly articles)

Evolving academic publishing

07 Monday Dec 2009

Posted by futurilla in Spotted in the news

≈ Leave a comment

Kyle Grayson summarises his thinking as… “part of an ad hoc working group with colleagues from Newcastle and Durham Universities that has been exploring the future of academic publishing” …

“in the social sciences and humanities, low citation rates and impact factors — even for leading journals — that in part reflect the inability to capture a broad audience within an academic discipline, let alone establish a readership with practitioners and/or the general public” […] “our research findings on the broader trends in media publishing in general, and scholarly publishing in particular, demonstrate that there are problems emerging over the horizon” […] “staying the course” — in terms of content, public interface, and revenue models — will lead to negative outcomes within a decade’s time.”

He suggests certain immediate remedies…

* implementing a dynamic journal website … where content is regularly updated * audio and video recordings of keynote speeches, lectures, interviews, or discussions * on-line book reviews […] invite contributions from the wider readership * blogs run by the editorial team and/or other members at large * alerting potential users of content [with] updates through social networking tools like email, Twitter, Facebook, and RSS feeds

To which I might add things like… * a collaborative subject-specific Custom Search Engine * simple “plain english” summaries of all articles (not the same thing as abstracts) * a curated “overlay” ejournal, linking to free repository content * Amazon pages for all monographs * translate all abstracts into Chinese, Japanese, and Spanish * a concerted campaign to get backlinks to your website * consider purchasing a good $50 template for the journal (it’s not just about the frequency of updating, but about how stylish it feels) * really good photography of the participants

Backlinks are particularly important. For instance, the journal Quaderno which I found yesterday. It’s six full issues of a free academic journal from a reputable university, on interesting aspects of early American history, in a country that’s teeming with re-enactors and amateur historians. Yet, according to Google, it has not a single inbound link — not even from other academic sites. It’s been online since 2004.

Quaderno

06 Sunday Dec 2009

Posted by futurilla in New titles added to JURN

≈ Leave a comment

Added to the JURN site-index today:—

Quaderno (early American history)

How to extract a CSV list of search-result URLs, along with their anchor titles

06 Sunday Dec 2009

Posted by futurilla in JURN tips and tricks, JURN's Google watch

≈ 7 Comments

In this simple tutorial I’ll show you how to rip a page of search result links into a .csv file, along with their link titles, using nothing more than Notepad and a simple bit of javascript.

(Update: January 2011. This tutorial superseded by a new and better one)

1) Have Google run your search in advanced mode, selecting “100 results on a page”. If you prefer Bing, choose Preferences / Results, and select “50 on a page”.

2) Run the search. Once you have your big page o’ results, just leave the page alone and save it locally — doing things like right-clicking on the links will trigger Google’s “url wrapping” behaviour on the clicked link, which you don’t want. So just save the page (In Firefox: File / Save Page As…), renaming it from search.html to something-more-memorable.html

3) Now open up your saved results page in your favourite web page editor, which will probably add some handy colour-coding to tags so you can see what you’re doing. But you can also just open it up in Notepad, if that’s all you have available. Right click on the file, and “Open with…”.

4) Locate the page header (it’s at the very top of the page, where the other scripts are), make some space in there, and then paste in this javascript script…

A hat-tip to richarduie for the original script. I just hacked it a bit, so as to output the results in handy comma-delimited form.

5) Now locate the start of the BODY of your web page, and paste in this code after the body tag…

Save and exit.

6) Now load up your modified page in your web browser (I’m using Firefox). You’ll see a new button marked “Extract all links and anchor titles as a CSV list”…

Press it, and you’ll get a comma-delimited list of all the links on the page, alongside all the anchor text (aka “link titles”), in this standard format…

Highlight and copy the whole list, and then paste it into a new Notepad document. Save it as a .csv file rather than a .txt file. You can do this by manually changing the file extension when saving a file from Notepad.

7) Now you have a normal .csv file that will open up in MS Excel, with all the database columns correctly and automatically filled (if you don’t own MS Office, the free Open Office Calc should work as an alternative). In Excel, highlight the third column (by clicking so as to highlight its top bar) , then choose “Sort and Filter” and then “A-Z”…

You’ll then be asked if you want “Expand the selection”. Agree to expansion (important!), and the column with the anchor text in it will be sorted by A-Z. Expansion means that all the columns stay in sync, when one is re-sorted like this.

Now you can select and delete all the crufty links in the page that came from Google’s “Cached”, “Similar”, “Translate this page” links, etc. These links will all have the same name, so by listing A-Z we’ve made them easy to delete in one fell swoop.

8) You’re done, other than spending a few minutes ferreting out some more unwanted results. Feel free to paste in more such results from Bing, de-duplicate, etc.

If you wanted to re-create a web page of links from the data, delete the first column of numbers, and then save. Open up your saved .csv in Notepad. Now you can do some very simple search and replace operations, to change the list back into HTML…

(Note: you can also use the excellent £20 Sobelsoft Excel Add Data, Text & Characters To All Cells add-in for complex search & replace operations in Excel)

Ideally there would be free Firefox Greasemonkey scripts, simple freeware utilities, etc, that could do all of this automatically. But, believe me, I’ve looked and there aren’t. Shareware Windows URL extractors are ten-a-penny (don’t waste good money on them, use the free URL Extractor), but not one of them also extracts the anchor text and saves the output as .csv.

Yes, I do know there’s the free Firefox addon Outwit Hub, which via its Data / Lists … option can capture URLs and anchors — but it jumbles everything in the link together, anchor text, snippet, Google gunk, etc, and so the link text requires major cleaning and editing for every link. Even with the hit-and-miss home-brew scraping filters, it’s not a reliable solution.

Four titles added

05 Saturday Dec 2009

Posted by futurilla in New titles added to JURN

≈ Leave a comment

Added to the JURN site-index today:—

American Research Institute in Turkey : newsletter

[ Hat-tip for the above: AWOL blog ]

Paul Mellon Centre for Studies in British Art : newsletter (has short book reviews)

Yale Undergraduate Journal of Art and Art History

British Journal of Undergraduate Philosophy

Teaching Classical Languages

03 Thursday Dec 2009

Posted by futurilla in New titles added to JURN

≈ Leave a comment

Added to the JURN site-index today:—

Teaching Classical Languages.

[ Hat-tip: AWOL blog ]

Fixed JURN’s content URL for Japanese Review

Ten problems with Google Custom Search Engines

03 Thursday Dec 2009

Posted by futurilla in How to improve academic search, My general observations

≈ 3 Comments

1) Google often doesn’t seem to index quite everything at a site. Nor does it always index everything on a page or in a PDF file. Or perhaps it does index everything, but the algorithm that shapes each set of search results jettisons a few results for various reasons? The other possibility is that Google’s results are drawn from a pool of ‘shards’ of previous results, rather than direct from the core crawl data.

Solution: Google “Caffeine” and subsequent revamps?

2) Results from the main Google search can sometimes differ from those in your CSE. Your CSE will occasionally give radically less results from a site than the main Google does. Google doesn’t explain why this is, or the mechanism behind it. Perhaps there are several different versions of the Google index. Results are often much better when using a more sophisticated search method than simple keywords, searching “for phases” for instance. Sometimes you have to give up on trying to get your CSE to “see” the PDFs you want (although these are visible to the main Google) — and instead find a way to index just the linked table-of-contents pages (which will usually show up in your CSE).

Solution: A lot of extra work. Google could offer a “full Google” CSE to worthy non-profits.

3) Academics love to store the real content at some location that has a different URL than their home-page does. An unoptimised CSE may thus index a website containing ten pages, but not the 10,000 articles that they point to.

Solution: A lot of extra work, of the sort that JURN has undertaken, to find and then optimise the real “content location URL”.

4) Initial URL gathering can be arduous. Techies and web editorial staff at universities love to juggle directory structures, often for no discernible reason, and thus break links. Link-rot is severe in ejournal lists from more than two years ago, and lists over four years old often have around 80% dead links.

Solution: Techies need to set up robust redirects if they really have to break URLs. “Self-destruct tags” that delete a links-list page after a certain date, if it hasn’t been updated for more than two years.

5) Google CSEs cannot pick specific content (e.g.: a run of journal issues) from the meaningless database-driven URLs commonly found in academic repositories, since there is no repeating URL structure to grab onto. It’s a question of indexing “all or nothing”.

Solution: URL re-mapping services that are recognised and can be “unwrapped” by Google? Plain HTML “overlay” TOCs.

6) Editors don’t enforce proper file-names on published documents, which means many CSE search results are titled in the Google results as something like “&63! print_only sh4d7gh.indd” rather than “My Useful Title”. Nor do people add the home location URL and website title to the body of their document — which means that scholars can waste several minutes per article trying to find out where it came from. Some students may never manage to find the journal title for the article they downloaded.

Solution: Better publication standards at open access and independent ejournals.

7) Large Google CSE are easy to make, but take a lot of hand-crafting to properly optimise and maintain. “Dead” CSEs from late 2006, when the CSE service first appeared, litter the web. Most of these were also un-optimised. Despite the potential of CSEs, it’s really hard to find large subject-specific CSE that are both optimised and maintained. Most people now seem to use CSEs for indexing a single site or a small cluster of sites that they own.

Solution: Users should remove old circa-2006 CSEs from the web. Subject-specific academic and business groups should consider building a collaborative CSE rather than a wiki.

8) Google’s search result ranking doesn’t work as well as it might in tightly defined academic searches. The PageRank wants to evenly “spread the results” across a variety of sites, and thus you’ll rarely see results from just one site dominating the first ten hits – although that may be exactly what a tight academic search requires.

Solution: For some types of CSE, this could probably be solved by delving into the optimisation features that Google offers for linked CSEs. Update: Google appears to have tweaked the algorithms to fix this problem.

9) Google searches have a problem with finding text at the end of long article titles, of the kind which are common in academia.

Solution: Authors and publishers should work to keep article and page titles under 50 characters.

10) You can’t have your CSE do a “search within search results”.

Solution: Manually build a set of pages containing the result URLs you want indexed, then get Google to see these as static pages which can then be added to your CSE.

Five new titles added

03 Thursday Dec 2009

Posted by futurilla in New titles added to JURN

≈ Leave a comment

Added to the JURN site-index today. JURN is now indexing over 3,500 titles:—

Site/Lines (“a literary forum for essays and reviews of books, exhibitions, and designs dealing with landscape themes and projects” – publication of the Foundation for Landscape Studies)

Pli : the Warwick Journal of Philosophy

Journal of Religious Culture

Ivy Journal of Ethics (applied bioethics, published by the Bioethics Society of Cornell)

Ecclesiology Today (British church buildings and furnishings)

JISC e-books report

02 Wednesday Dec 2009

Posted by futurilla in Official and think-tank reports

≈ Leave a comment

A new report from the UK, JISC national e-books observatory project: Key findings and recommendations (PDF link, 1.2Mb)…

“Behavioural evidence from the Observatory project strongly suggests that [university] course text e-books are currently used for quick fact extraction and brief viewing rather than for continuous reading, which may conflict with the assumptions about their use made by publishers (and authors). They are being used as though they are encyclopedias or dictionaries rather than extended continuous text.”

Eno on classification and the death of uncool

02 Wednesday Dec 2009

Posted by futurilla in My general observations, Spotted in the news

≈ Leave a comment

Brian Eno in Prospect magazine, on the death of uncool…

“There’s a whole generation of people able to access almost anything from almost anywhere, and they don’t have the same localised stylistic sense that my generation grew up with. It’s all alive, all “now,” in an ever-expanding present, be it Hildegard of Bingen or a Bollywood soundtrack. The idea that something is uncool because it’s old or foreign has left the collective consciousness.”

Why is this interesting here on the JURN blog? Because Eno relates this apparent change to increasingly nuanced classifications of cultural products. Which must arise partly from our ability to tag and generally re-clump cultural products into ever finer categories (Amazon Listmania lists, Spotify playlists, etc) online, although one can see ample evidence that this was starting to happen in music before 1995 and the Web. Possibly there’s also some spillover from huge genre blockbusters, since better classification and cultural navigation routes mean that far more people can now migrate out from quality blockbuster experiences to similar but much more obscure product (e.g. from Harry Potter to The Giant Under The Snow).

Eno perhaps misses some subtleties. Category-proliferation is inclusive in the online world (Wikipedia pages which easily explain the finer points of said classification to the un-initiated, and searches that quickly offer up frictionless samples of it, easy-access online communities of interest). This plenitude helps to spread the range of sustained interests people have, which means British politeness has to go into overdrive to keep up, when we meet someone in person and they start talking about their interests — thus possibly contributing to the demise of “uncool”. But the real-world groups forming around / promoting these categories remain exclusionary, since age-related group dynamics and simple shyness kicks in (you won’t see many over-40s at your 8-bit electropop game-music night, or groups of eager adolescents at a classical concert). And perhaps even more exclusionary because the categories are so niche, and so the fragile boundaries need all the more patrolling. “Uncool” still potently exists in the real-world of cultural events, and in musical terms it’s still tightly intertwined with social class and age and personal prettiness.

Hopefully, though, Eno concludes by suggesting that…

“The sharing of art is a precursor to the sharing of other human experiences” … “what is pleasurable in art becomes thinkable in life”

I’m not sure that’s likely, at least not in the British context. The British climate has always been conducive to us drawing the curtains and “living in our imaginations” for six months of the year, often while sampling all sorts of exotic and fantastical influences and stories, but it doesn’t seem to have made the national character any the less reserved.

And I think it might be more useful to consider “old or foreign” as separate issues. Eno is being quietly political, by casually conflating them. Although, in the end, it’s true that they’re part of the same process of cultural assimilation and re-invention.

The British have always seen “the foreign” as potential material to be quietly appropriated and re-worked into the national culture and national identity. Be wary when the British start to pay serious cultural attention to “the foreign” — we usually want to assimilate it and neuter it. The attitude is that we don’t openly talk much about that process, though — hence the social usefulness of “uncool” at the moment of appropriation, while under the surface we’re actually quietly exotic-ising it so as to extract all the cool we can, ready for eventual re-shaping and re-deployment in the “taste wars” that have long served as a useful proxy for all sorts of other polite social conflicts in the British Isles. And then 30 years on, once it’s safely drained, to claim bits of it as our own and to forget its origin.

And popular unashamed interest in “the old” is nothing new. This neo-romantic antiquarian strain can perennially be seen everywhere in British pop culture since the circa 1966/7, from Pink Floyd weaving references to Hereward the Wake into their lyrics, to the Beatles neo-Victorian dress and moustaches on Sgt. Pepper, Peter Gabriel on Salisbury Hill, Jarman’s re-imagining of Shakespeare, Morrissey’s love of graveyards, Vivian Westwood’s clothes, Edward Larrikin warbling “everything that I adore came well before 1984”, to modern antiquarians such as Julian Cope. There are many parallels in art, film, and literature. There’s always been a sense that the past is a mine to be plundered for contemporary cultural production. What has changed recently in the culture is perhaps the sudden breakdown of the Blairite hegemony around Englishness and history, and that is perhaps what Eno is picking up on where he talks of…

“The idea that something is uncool because it’s old … has left the collective consciousness.”

Although this is certainly not the case with our architecture, where the credo among planners is still very much “old = neglect it, so we can demolish it”.

News from JURN

~ search tool for open access content

Monthly Archives: December 2009

Medical Historian

Evolving academic publishing

Quaderno

How to extract a CSV list of search-result URLs, along with their anchor titles

Four titles added

Teaching Classical Languages

Ten problems with Google Custom Search Engines

Five new titles added

JISC e-books report

Eno on classification and the death of uncool