WebCopy - copy a remote web subtree to the local disk

Fetch the software (Zip file, 87096 bytes).

Jef Poskanzer's original WebCopy Java program was wonderful, but would not follow pages nested in FRAMESET FRAME's, and would not get images used as page BODY BACKGROUND's. So I fixed it so it would. I extended his hand-written HTML parsing finite state automaton. Along the way I simplified his HtmlObserver and HtmlEditObserver Interfaces, making them easier to expand. Other than that, the zipfile contains all the Acme Java support classes necessary to compile and run WebCopy. Enjoy!

Given one or more URLs as arguments, enumerates the files reachable at or below those URLs and copies them to the local disk, creating subdirectories as necessary.

Options:

-v
Verbose. Shows names of files being copied.
-f
Force overwriting of existing files. Otherwise they are left alone.
-d
Maximum depth to copy. Depth refers to how many links to follow. A depth of 0 means just copy the file given on the connald line, don't follow any links at all. Without this flag there is no limit on the depth, the entire subtree is copied.
-e
Edit local URLs. If an HTML file contains a URL that is *unnecessarily* absolute - i.e. it's absolute but it refers to a location within the tree being copied - then convert it to a relative URL. Without this flag, all files are copied verbatim. With it, the copied tree is a self-contained functional snapshot of the remote.

Sample run:


% mkdir flow
% cd flow
% WebCopy -v http://www.acme.com/jef/flow/
Copying http://www.acme.com/jef/flow/ to index.html
Copying http://www.acme.com/jef/flow/troublemaker_small.jpg to troublemaker_small.jpg
Copying http://www.acme.com/jef/flow/cdec.html to cdec.html
Copying http://www.acme.com/jef/flow/snapshots/ to snapshots/index.html
Copying http://www.acme.com/jef/flow/snapshots/16may96.html to snapshots/16may96.html
Copying http://www.acme.com/jef/flow/snapshots/16may96_namerican.gif to snapshots/16may96_namerican.gif
% ls -l
-rw-r--r--   1 jef         39759 Jul  5 14:40 cdec.html
-rw-r--r--   1 jef           993 Jul  5 14:40 index.html
drwxr-x--x   2 jef           512 Jul  5 14:40 snapshots
-rw-r--r--   1 jef          3107 Jul  5 14:40 troublemaker_small.jpg

See the original creator's, Jef Poskanzer's, pages:
Jef Poskanzer <jef@acme.com>

George Ruban <gruban@oocities.com>

This page hosted by GeoCities SiliconValley which will gladly give you your own Free Home Page in exchange for such a link on each page.