A cousin asked me help him find a solution to extract html from hundreds of links into excel cells.
Telling him the limitations it, he ask me to place them in cvs format.
I start up my putty, ssh into my linux server, and coded this.
run.sh
echo URL,HTML > csv for link in $(cat links) do wget -Otmp $link perl -pi -e "s/\"/\"\"/g;" tmp echo $link,\"`cat tmp`\" >> csv done
Lets take a look closer.
Line 1: Creates/overwrites a file called cvs with the text “URL,HTML” which is the header for the cvs file
Line 2: I placed all the URLs in file call links, 1 per line. The for in structure would loop each link
Line 3: do … done (see line 7) block would repeat the commands until all the links are processed (line 2)
Line 4: using wget to save the url (the variable $link) to a file named tmp
Line 5: This line replaces all single ” to double ” quotes (”") for differentiating in the csv file. This amazing line using perl to perform regex (regular expression) substitution (search & replace) on the tmp file. Take s/baba/haha/g, means substitute(s) baba (/baba) with haha (/haha/) on all matches (g)
Line 6: Inserts the URL, the html code enclosed in quotes (therefore requiring line 5), separated by a comma (hence comma separated values for CSV), appends to the output file csv
Line 7: See line 3
Run it by the usual commands
chmod +x run.sh
./run.sh
Then retrieve the resulting csv file via ftp.
Although not this is not the only way to code it, I pretty satisfied I could code something efficient, does some work, few a few lines of code in a short time.
This are what computers and programming for. To do work for us efficiently, although many a times in the real world it happens the opposite.


Recent Comments