Contribute  :  Advanced Search  :  Site Statistics  :  Directory  :  Polls  :  ABOUT MUTAKU.com  :  Folding@Home  :  NerdcoreProductions.com  :  STORE  
Mutaku.com Fresh brewed daily
Welcome to Mutaku.com
Thursday, September 09 2010 @ 12:22 PM EDT
   

Image grabbing script

TUTORIALS/GUIDES[PLEASE make sure you are using this script for LEGAL purposes and that images are downloadable as per permission of the site maintainer/owner/copyright holder. I claim no legal responsibility for the use of the following information.]

THIS IS A WORK IN PROGRESS--- Something I am working on for grabbing images from a website. Right now I haven't formulated a script yet, but I have successfully done most of what I would like to achieve with this particular project. Grab images from a site by extension using wget from a parsed input file created from the page of interest.

So for now I will just give some of the basics and what they do. This is still under development so any input would be much appreciated.

Create a directory for the images:

mkdir image_page && cd image_page


Grab the index file:

wget http://wwww.pageofinterest.com


*Now to strip the href links of their HTML syntax:



cat index.html | sed -e 's/>/>
/g' | grep '<a' | while IFS='"' read a b c ; do echo $b; done > sed-index


Now look for the image files.Could grep for jpegs (grep jpg), but in my example I only wanted the images that were being linked to static files on flickr and make sure they were jpegs, so:

grep 'flickr' sed-index.html | grep 'jpg'  input-file


Now use the input file for wget and clean up:

wget --input-file=input-file && rm index.html sed-index.html input-file


* The true elegance of this (imho) is from the HTMLparse to strip links of their HTML syntax. This I cannot take credit for the execution. This was adapted from BASH COOKBOOK by Carl Albing (O'Reilly publishing) page 253 Recipe 13.3 Parsing Some HTML. Since this is published, I won't go into detail on the inner workings here. But you can send me a message if it is unclear. In essence you are splitting up the sed command to create multi line coded links so we don't have to deal with naturally multi line vs singular versions. Then we search for the a call followed by a while loop that reads what is between the double quotes as the golden URL we are looking for and outputs to the sed-index.html file.

Trackback

Trackback URL for this entry: http://www.mutaku.com/geeklog/trackback.php?id=20070619221508434

No trackback comments for this entry.
Image grabbing script | 1 comments | Create New Account
The following comments are owned by whomever posted them. This site is not responsible for what they say.
Image grabbing script
Authored by: xiao_haozi on Wednesday, June 20 2007 @ 01:44 AM EDT
So I have been working on scripting this up a bit -- crudely at first to get something to work with -- and found that it only works part of the time.

#!/bin/bash
# image_rip : rip images from given website
# usage : image_rip site image-format (jpg,gif,png,etc)

SITE=$1
#IMFORM=$2

if [ -e 'index.html' ]; then
echo 'index.html exists...please choose another directory or output to another file'
exit

else
wget --output-document=index.html $SITE

cat index.html | sed -e 's/>/>\
/g' | grep '<a' | while IFS='"' read a b c ; do echo $b; done > sed-index

grep jpg sed-index > input-file

wget --input-file=input-file #&& rm index.html sed-index.html input-file

exit
fi

When a site has linked to images - it seems to work. However, when they are embedded by using an img src call then it doesn't find these. Next step would be to fix the parse to find these as well and handle the wget accordingly for files such as '../img/img_file.jpg'.