In this simple tutorial I’ll show you how to rip a page of search result links into a .csv file, along with their link titles, using nothing more than Notepad and a simple bit of javascript.
(Update: January 2011. This tutorial superseded by a new and better one)
1) Have Google run your search in advanced mode, selecting “100 results on a page”. If you prefer Bing, choose Preferences / Results, and select “50 on a page”.
2) Run the search. Once you have your big page o’ results, just leave the page alone and save it locally — doing things like right-clicking on the links will trigger Google’s “url wrapping” behaviour on the clicked link, which you don’t want. So just save the page (In Firefox: File / Save Page As…), renaming it from search.html to something-more-memorable.html
3) Now open up your saved results page in your favourite web page editor, which will probably add some handy colour-coding to tags so you can see what you’re doing. But you can also just open it up in Notepad, if that’s all you have available. Right click on the file, and “Open with…”.
4) Locate the page header (it’s at the very top of the page, where the other scripts are), make some space in there, and then paste in this javascript script…
A hat-tip to richarduie for the original script. I just hacked it a bit, so as to output the results in handy comma-delimited form.
5) Now locate the start of the BODY of your web page, and paste in this code after the body tag…
Save and exit.
6) Now load up your modified page in your web browser (I’m using Firefox). You’ll see a new button marked “Extract all links and anchor titles as a CSV list”…
Press it, and you’ll get a comma-delimited list of all the links on the page, alongside all the anchor text (aka “link titles”), in this standard format…
Highlight and copy the whole list, and then paste it into a new Notepad document. Save it as a .csv file rather than a .txt file. You can do this by manually changing the file extension when saving a file from Notepad.
7) Now you have a normal .csv file that will open up in MS Excel, with all the database columns correctly and automatically filled (if you don’t own MS Office, the free Open Office Calc should work as an alternative). In Excel, highlight the third column (by clicking so as to highlight its top bar) , then choose “Sort and Filter” and then “A-Z”…
You’ll then be asked if you want “Expand the selection”. Agree to expansion (important!), and the column with the anchor text in it will be sorted by A-Z. Expansion means that all the columns stay in sync, when one is re-sorted like this.
Now you can select and delete all the crufty links in the page that came from Google’s “Cached”, “Similar”, “Translate this page” links, etc. These links will all have the same name, so by listing A-Z we’ve made them easy to delete in one fell swoop.
8) You’re done, other than spending a few minutes ferreting out some more unwanted results. Feel free to paste in more such results from Bing, de-duplicate, etc.
If you wanted to re-create a web page of links from the data, delete the first column of numbers, and then save. Open up your saved .csv in Notepad. Now you can do some very simple search and replace operations, to change the list back into HTML…
(Note: you can also use the excellent £20 Sobelsoft Excel Add Data, Text & Characters To All Cells add-in for complex search & replace operations in Excel)
Ideally there would be free Firefox Greasemonkey scripts, simple freeware utilities, etc, that could do all of this automatically. But, believe me, I’ve looked and there aren’t. Shareware Windows URL extractors are ten-a-penny (don’t waste good money on them, use the free URL Extractor), but not one of them also extracts the anchor text and saves the output as .csv.
Yes, I do know there’s the free Firefox addon Outwit Hub, which via its Data / Lists … option can capture URLs and anchors — but it jumbles everything in the link together, anchor text, snippet, Google gunk, etc, and so the link text requires major cleaning and editing for every link. Even with the hit-and-miss home-brew scraping filters, it’s not a reliable solution.
Borrowind said:
Update: there is now the shareware Excel Import Multiple Google Search Results software. It only works on Google Search — not on Bing, Google Scholar, etc. The free Firefox plugin SEM Tools will also extract to Excel with anchor titles, and do so on any web page. Neither will reformat the results as viable clickable HTML Web links.
Borrowind said:
Update: there’s now a Firefox addon solution that works on any page:
http://jurnsearch.wordpress.com/2011/01/08/how-to-copy-urls-with-their-anchor-text-alongside-them/
It doesn’t re-work the links as valid HTML links, though. You’d still need to use my Excel spreadsheet to do that (see above link).
MegaMustang said:
The OutWit plugin for FF seems to do this now. And it also grabs emails
Borrowind said:
Yes, but Outwit still generates too many columns that then have to be deleted. And it also grabs Google junk, which also has to be sorted and deleted. This new method is faster…
http://jurnsearch.wordpress.com/2011/01/08/how-to-copy-urls-with-their-anchor-text-alongside-them/
W130SN said:
Please tell me, what is the easiest method of extracting just URL’s from a CSV file ? I just want the URLs and nothing else.
David Haden said:
Simply typing the words URL + extractor in Google might have saved you some time here, W130.
http://www.focalmedia.net/urlextract.html
Asif Khan said:
Thank you for that. I did try various searches, but of course not that one. Doh ! Anyway I will try out that software. Cheers.