Juan Michelini
Back Up Blogspot using WGET

So let’s say that you want to download an entire blogspot blog. Wget will consider all links with the character ? in them as different links. So you will have to download them. Furthermore making wget reject those links will still download them. (This is because those files are html files, and wget downloads to follow them recursively and then removes them.)

Fortunately, there is a way to download all pages from a blogspot without having to download the sites with the question mark in them. To do, we depend on that particular blogspot having an archive widget. Most do, so this shouldn’t be a problem.

Here is the code:

wget  -q -O- “http://googleblog.blogspot.com” | grep “archive.html” | sed -e ‘s/>[^<]*<//g’ | sed -e “s/<a\ class=’post-count-link’ href=’//g” | sed -e “s/’\/a>//g” | xargs wget -np -nc -l 1 -r

Now for a little explanation:

First, it downloads the main page that contains the archive widget with all the links.
Then, it extracts the links and removes the tags that surround them.
Finally, Pipes those links into wget that downloads them.

And that is it.

It still has the problem of not working when the blogspot doesn’t have an archive widget. Of course the archive links are still there, so it should be achievable in principle. If you do it, tell me about it!