Beyond Browsing the Web
Pages: 1, 2
Getting files from the Web
When I want to save the contents of a URL to a file, I often use
GNU wget to do it. It keeps the file's original
timestamp, it's smaller and faster to use than a browser, and it shows
a visual display of the download progress. (You can get it from the
Debian wget package or direct from any GNU archive).
So if I'm grabbing a webcam image, I'll do something like:
wget http://example.org/cam/cam.jpeg
This will save a copy of the image file as cam.jpeg, which will
have the same timestamp attributes as the file on the
example.org server.
If you interrupt a download before it's finished, use the -c option
to resume from the point it left off:
wget -c http://example.org/cam/image.jpeg
Archiving an entire site
To archive a single Web site, use the -m ("mirror") option, which
saves files with the exact timestamp of the originals, if possible,
and sets the "recursive retrieval" option to download everything. To
specify the number of retries to use when an error occurs in
retrieval, use the -t option with a numeric argument -- -t3 is usually
good for safely retrieving across the net; use -t0 to allow an infinite
number of retries when your network connection is really
bad but you really want to archive something, regardless of how
long it takes. Finally, use the -o option with a filename as an
argument to write a progress log to the file -- it can be useful to
examine in case anything goes wrong. Once the archival process is
complete and you've determined that it was successful, you can delete
the logfile.
For example, to mirror the Web site at http://www.bloofga.org, giving up to three retries for retrieval of files and putting error messages in a logfile called mirror.log, type:
wget -m -t3 -o mirror.log http://www.bloofga.org/
To continue an archive that you've left off, use the -nc ("no
clobber") option; it doesn't retrieve files that have already been
downloaded. For this option to work the way you want it to, be sure to
be in the same directory that you were in when you started to archive
the site.
For example, to continue an interrupted mirror of the www.bloofga.org site, while making sure that existing files aren't downloaded and giving up to three retries for retrieval of files, type:
wget -nc -m -t3 http://www.bloofga.org/
Next week: Quick tools for command-line image transformations.
Michael Stutz was one of the first reporters to cover Linux and the free software movement in the mainstream press.
Read more Living Linux columns.