WEBCRAWL(1)WEBCRAWL(1)NAME
WebCrawl - download web sites, following links
SYNOPSISwebcrawl [ options ] host[:port]/filename directory
DESCRIPTION
WebCrawl is a program designed to download an entire website without
user interaction (although an interactive mode is available).
WebCrawl will download the page web-address into a directory called
destination-dir under the compiled in server root directory (which can
be changed with the -o option, see below). web-address should not con‐
tain a leading http://
It works simply by starting with a single web page, and following all
links from that page to attempt to recreate the directory structure on
the remote server.
As well as downloading the pages, it also rewrites them to use a local
URL where URLs that would otherwise not work on the local system are
used in the page (eg URLs that begin with http:// or the begin with a
/).
It stores the downloaded files in a directory structure that mirrors
the original site's, under a directory called server.domain.com:port.
This way, multiple sites can all be loaded into the same directory
structure, and if they link to each other, they can be rewritten to
link to the local, rather than remote, versions.
Comprehensive URL selection facilities allow you to describe what docu‐
ments you want to download, so that you don't end up downloading much
more than you need.
WebCrawl is written in ANSI C, and should work on any POSIX system.
With minor modifications, it should be possible to make it work on any
operating system that supports TCP/IP sockets. It has been tested only
on Linux.
OPTIONS
URL selection
-a This causes the program to ask the user whether to download a
page that it hasn't been otherwise instructed to (by default,
this means off-site pages)
-f string
This causes the program to always follow links to URLs that
contain the string. You can use this, for example, to prevent a
crawl from going up beyond a single directory on a site (in
conjunction with the -x option below); say you wanted to get
http://www.web-sites.co.uk/jules but not any other site located
on the same server. You could use the command line:
webcrawl-x -f /jules www.web-sites.co.uk/jules/ mirror
Another use would be if a site contained links to (eg) pic‐
tures, videos or sound clips on a remote server, you could use
the following command line to get them:
webcrawl-f .jpg -f .gif -f .mpg -f .wav -f .au www.site.com/
mirror
Note that webcrawl always downloads inline images.
-d string
The opposite of -f, this option tells webcrawl never to get a
URL containing the string. -d takes priority over all other
URL selection options (except that it won't stop it from down‐
loading inline images, which are always downloaded).
-u filename
Causes webcrawl to log unfollowed links to the file filename.
-x Causes webcrawl not to automatically follow links to pages on
the same server. This is useful in conjuction with the -f
option to specify a subsection of an entire site to download.
-X Causes webcrawl not to automatically download inline images
(which it would otherwise do even when other options did not
indicate that the image should be loaded). This is useful in
conjunction with the -f option to specify a subsection of an
entire site to download, when even the images concerned need
careful selection.
Page re-writing:
-n Turns off page rewriting completely.
-rx Select which URLs to rewrite. Only URLs that begin with / or
http: are considered for rewriting, all others are always left
unchanged. This options selects which of these URLs are
rewritten to point to local files, depending on the value of x.
a all absolute URLs are rewritten
l Only URLs that point to pages on the same site are rewrit‐
ten.
f (default)
URLs for which the file that the rewritten URL would point
to exists are rewritten. Note that rewriting occurs after
all links in a page have been followed (if required), so
this represents probably the most sensible option, and is
therefore the default.
-k Keep original filenames - disables changing of filenames to
remove metacharacters that may confuse a web server, and to
ensure that the extension on the end of the filename is a cor‐
rect .html or .htm whenever the page has a text/html content
type. (See Configuration Files below for a discusssion of how
to achieve this with other file types).
-q Disable process ID insertion into query filenames. Without
this flag, and whenever -k is not in use, webcrawl rewrites the
filenames of queries (defined as any fetch from a web server
that includes a '?' character in the filename) to include the
process ID of the webcrawl fetching the query in hexadecimal
after the (escaped) '?' in the filename; this may be desirable
if performing the same query multiple times to get different
results. This flag disables this behaviour.
Recursion limiting:
-l[x] number
This option is used to limit the depth to which webcrawl will
search the tree (forest) of interlinked pages. There are two
parameters that may be set; with x as l, the initial limit is
set, with x as r, the limit used after jumping to a remote site
is set. If x is missed out, both limits are set.
-v Increases the program's verbosity. Without this option, no
reports on status are made unless errors occur, etc. Used once,
webcrawl will report which URLs it is trying to download, and
also which links it has decided not to follow. -v may be used
more than once, but this is probably only useful for debugging
purposes.
-o dir Change the server root directory. This is the directory that
the path specified at the end of the command line is relative
to.
-p dir Change the URL rewriting prefix. This is prepended to rewritten
URLs, and should be a (relative) URL that points to the current
server root directory. An example of the use of the -o and -p
options is given below:
webcrawl-o /home/jules/public_html -p /~jules
www.site.com/page.html mirrors
HTTP-related options
-A string
Causes webcrawl to send the specified string as the HTTP 'User-
Agent' value, rather than the compiled in default (normally
`Mozilla/4.05 [en] (X11; I; Linux 2.0.27 i586; Nav)', although
this can be changed in the file web.h at compile time).
-t n Specifies a timeout, in seconds. Default behaviour is to give
up after this length of time from the initial connection
attempt.
-T Changes the timeout behaviour. With this flag, the timeout
occurs only if no data is received from the server for the
specified length of time.
CONFIGURATION FILESwebcrawl uses configuration files at present to specify rules for the
rewriting of filenames. It searches for files in /etc/webcrawl.conf,
/usr/local/etc/webcrawl.conf, and $HOME/.webcrawl and processes all
files it finds in that order. Parameters set in one file may be over‐
riden by subsequent files. Note that it is perfectly possible to use
webcrawl without a configuration file - it is only for advanced fea‐
tures that are too complex to configure on the command line that it is
required.
The overall syntax of the webcrawl file is a set of sections, each
headed by a line of the form [section-name].
At present, only the [rename] section is defined. This may contain the
following commands:
meta string
Sets metacharacter list. Any character in the list specified
will be quoted in filenames produced (unless filename rewriting
is disabled with the -k option). Quoting is performed by
prepending the quoting character (default @) to the hexadecimal
ASCII value of the character being quoted. The default
metacharacter list is: ?&*%=#
quote char
Sets the quoting character, as described above. The default is:
@
type content/type preferred [extra extra ...]
Sets the list of acceptable extensions for the specifed MIME
content type. The first item in the list is the preferred
extension; if renaming is not disabled (with the -k option) and
the extension of a file of this type is not on the list, then
the first extension on the list will be appended to its name.
An implicit line is defined internally, which reads:
type text/html html htm
This could be overriden; if say you preferred the 'htm' exten‐
sion over 'html', you could use:
type text/html htm html
in a configuration file to cause .htm extensions to be used
whenever a new extension was added.
AUTHOR
WebCrawl was written by Julian R. Hall <jules@acris.co.uk> with sugges‐
tions and prompting by Andy Smith.
Bugs should be submitted to Julian Hall at the address above. Please
include information about what architecture, version, etc, you are
using.
webcrawlwebcrawlWEBCRAWL(1)