Daily Archive for July 20th, 2007

7 Lines Of Linux Shell Scripting

A cousin asked me help him find a solution to extract html from hundreds of links into excel cells.

Telling him the limitations it, he ask me to place them in cvs format.

I start up my putty, ssh into my linux server, and coded this.

run.sh

echo URL,HTML > csv
for link in $(cat links)
do
        wget -Otmp $link
        perl -pi -e "s/\"/\"\"/g;" tmp
        echo $link,\"`cat tmp`\" >> csv
done

Lets take a look closer.
Line 1: Creates/overwrites a file called cvs with the text “URL,HTML” which is the header for the cvs file
Line 2: I placed all the URLs in file call links, 1 per line. The for in structure would loop each link
Line 3: do … done (see line 7) block would repeat the commands until all the links are processed (line 2)
Line 4: using wget to save the url (the variable $link) to a file named tmp
Line 5: This line replaces all single ” to double ” quotes (”") for differentiating in the csv file. This amazing line using perl to perform regex (regular expression) substitution (search & replace) on the tmp file. Take s/baba/haha/g, means substitute(s) baba (/baba) with haha (/haha/) on all matches (g)
Line 6: Inserts the URL, the html code enclosed in quotes (therefore requiring line 5), separated by a comma (hence comma separated values for CSV), appends to the output file csv
Line 7: See line 3

Run it by the usual commands
chmod +x run.sh
./run.sh

Then retrieve the resulting csv file via ftp.

Although not this is not the only way to code it, I pretty satisfied I could code something efficient, does some work, few a few lines of code in a short time.

This are what computers and programming for. To do work for us efficiently, although many a times in the real world it happens the opposite.