Image grabbing script
[PLEASE make sure you are using this script for LEGAL purposes and that images are downloadable as per permission of the site maintainer/owner/copyright holder. I claim no legal responsibility for the use of the following information.]
THIS IS A WORK IN PROGRESS--- Something I am working on for grabbing images from a website. Right now I haven't formulated a script yet, but I have successfully done most of what I would like to achieve with this particular project. Grab images from a site by extension using wget from a parsed input file created from the page of interest.
So for now I will just give some of the basics and what they do. This is still under development so any input would be much appreciated.
Create a directory for the images:
mkdir image_page && cd image_page
Grab the index file:
wget http://wwww.pageofinterest.com
*Now to strip the href links of their HTML syntax:
cat index.html | sed -e 's/>/>
/g' | grep '<a' | while IFS='"' read a b c ; do echo $b; done > sed-index
Now look for the image files.Could grep for jpegs (grep jpg), but in my example I only wanted the images that were being linked to static files on flickr and make sure they were jpegs, so:
grep 'flickr' sed-index.html | grep 'jpg' input-file
Now use the input file for wget and clean up:
wget --input-file=input-file && rm index.html sed-index.html input-file
* The true elegance of this (imho) is from the HTMLparse to strip links of their HTML syntax. This I cannot take credit for the execution. This was adapted from BASH COOKBOOK by Carl Albing (O'Reilly publishing) page 253 Recipe 13.3 Parsing Some HTML. Since this is published, I won't go into detail on the inner workings here. But you can send me a message if it is unclear. In essence you are splitting up the sed command to create multi line coded links so we don't have to deal with naturally multi line vs singular versions. Then we search for the a call followed by a while loop that reads what is between the double quotes as the golden URL we are looking for and outputs to the sed-index.html file.

