Friday, 10 February 2012

A very stupid wget crawler

As I mentioned in the last post, my gf wanted to analyse some online texts, and she spent ten hours manually extracting 100 samples from the Internet (basically it was Ctrl+c and Ctrl+v), so I am going to help her.

In Shell, there are two very powerful download tools: wget and curl, and personally I prefer wget. Thus I tried using wget to crawl that website, but unfortunately, I retrieved only an index file of that site. I modified my wget crawler, but still got nothing. One friend helped me to check that site, and found it possibly is a dynamic javascript page, so it would be very difficult to retrieve by a simple wget crawler.

Then, I had two options: map the site and download; or write an address file for wget. For me, the first option was too challenging, and hopefully, the site address is very formative: it can be regarded as an arithmetic series, although the size is very big (about 400,000). Hence, I chose the second option and used Shell to create an address book for wget.

The format of the address is like this: 

http://blabla.com/1 
http://blabla.com/2
...

So, I shall create two individual lists fist: one contains all "http://blabla.com/", and another one is a very big arithmetic series. Then, I can combine together. It sounds very simple, but it is a bit tough.

In Shell, jot can be used to create an infinite repeated string, so it is simple:

$ jot -b http://blabla.com/ 400000 # -b is for word

Thus, a 400000-occurrence prefix list is created.

Also, seq can be used to create an infinite number string:

$ seq 1 400000 # the default increment is 1

then, a 1 to 400000 number list is created.

Fairly simple, right?

The third step is to combine two big lists together.

In Shell, there are several commands can combine files together, such as join, paste, but I failed to use them (maybe I am too stupid, but do please let me know if it is possible to combine two lists from two individual files by these two commands.

So I referred to the Internet, and found a very helpful post here: Bash: Redirecting Input from Multiple FilesIt provides three scripts, and I tried the second one: 

#!/bin/bash

i=0
while read line
do
    f1[$i]="$line"
    let i++
done <$1

i=0
while read line
do
    f2[$i]="$line"
    let i++
done <$2

i=0
while [[ "${f1[$i]}" ]]
do
    echo ${f1[$i]} ${f2[$i]}
    let i++
done

(You can save is as a sh file, and use sh to execute it in Terminal)

It perfectly worked, but a space existed, which was very annoying: 

http://blabla.com/ 2 

By default, these are two columns, so I need to remove the space between. Again, I used another Shell tool to deal with: sed.

Sed is really powerful, but it is not very easy to handle, I feel. To delete the space in between, I wrote:

$ sed -e 's/ //' # I did not use g at the end, but it worked. Anyone could help me?

However, I was very stupid, because when I created the prefix list, I wrote http:blabla.com. Oh, I forgot two slashes! I used sed again:

$ sed 's_:_://_g'

OK, a 12M address file is finished.

Then using the -i option in wget, I could easily crawl the site. To avoid being blocked, I also added --random-wait option to my wget crawler, although it became very slow.

Now, this stupid wget crawler is still working, hmm.


Do please let me know if there is an easier way to do any of above. Thank you!

Notice: the script from Linux Journal is fairly slow: it took about 30 min to manipulate two lists. Hope I can figure out how to use join or paste soon.









0 comments:

Post a Comment