w3mir man page on DragonFly

w3mir man page on DragonFly
Man page or keyword search:
man Server 44335 pages
apropos Keyword Search (all sections)
Output format
W3MIR(1)	      User Contributed Perl Documentation	      W3MIR(1)

NAME
       w3mir - all purpose HTTP-copying and mirroring tool

SYNOPSIS
       w3mir [options] [HTTP-URL]

       w3mir -B [options] <HTTP-URLS>

       w3mir is a all purpose HTTP copying and mirroring tool.	The main focus
       of w3mir is to create and maintain a browsable copy of one, or several,
       remote WWW site(s).

       Used to the max w3mir can retrive the contents of several related sites
       and leave the mirror browseable via a local web server, or from a
       filesystem, such as directly from a CDROM.

       w3mir has options for all operations that are simple enough for
       options.	 For authentication and passwords, multiple site retrievals
       and such you will have to resort to a "CONFIGURATION-FILE".  If
       browsing from a filesystem references ending in '/' needs to be
       rewritten to end in '/index.html', and in any case, if there are URLs
       that are redirected will need to be changed to make the mirror
       browseable, see the documentation of Fixup in the "CONFIGURATION-FILE"
       secton.

       w3mirs default behavior is to do as little as possible and to be as
       nice as possible to the server(s) it is getting documents from.	You
       will need to read through the options list to make w3mir do more
       complex, and, useful things.  Most of the things w3mir can do is also
       documented in the w3mir-HOWTO which is available at the w3mir home-page
       (http://www.math.uio.no/~janl/w3mir/) as well as in the w3mir
       distribution bundle.

DESCRIPTION
       You may specify many options and one HTTP-URL on the w3mir command
       line.

       A single HTTP URL must be specified either on the command line or in a
       URL directive in a configuration file.  If the URL refers to a
       directory it must end with a "/", otherwise you might get surprised at
       what gets retrieved (e.g. rather more than you expect).

       Options must be prefixed with at least one - as shown below, you can
       use more if you want to. -cfgfile is equivalent to --cfgfile or even
       ------cfgfile.  Options cannot be clustered, i.e., -r -R is not
       equivalent to -rR.

       -h | -help | -?
	   prints a brief summary of all command line options and exits.

       -cfgfile file
	   Makes w3mir read the given configuration file.  See the next
	   section for how to write such a file.

       -r  Puts w3mir into recursive mode.  The default is to fetch only one
	   document and then quit.  'recursive' mode means that all the
	   documents linked to the given document that are fetched, and all
	   they link to in turn and so on.  But only Iff they are in the same
	   directory or under the same directory as the start document.	 Any
	   document that is in or under the starting documents directory is
	   said to be within the scope of retrieval.

       -fa Fetch All.  Normally w3mir will only get the document if it has
	   been updated since the last time it was fetched.  This switch turns
	   that check off.

       -fs Fetch Some.	Not the opposite of -fa, but rather, fetch the ones we
	   don't have already.	This is handy to restart copying of a site
	   incompletely copied by earlier, interrupted, runs of w3mir.

       -p n
	   Pause for n seconds between getting each document.  The default is
	   30 seconds.

       -rp n
	   Retry Pause, in seconds.  When w3mir fails to get a document for
	   some technical reason (timeout mainly) the document will be queued
	   for a later retry.  The retry pause is how long w3mir waits between
	   finishing a mirror pass before starting a new one to get the still
	   missing documents.  This should be a long time, so network
	   conditions have a chance to get better.  The default is 600 seconds
	   (10 minutes), which might be a bit too short, for batch running
	   w3mir I would suggest an hour (3600 seconds) or more.

       -t n
	   Number of reTries.  If w3mir cannot get all the documents by the
	   nth retry w3mir gives up.  The default is 3.

       -drr
	   Disable Robot Rules.	 The robot exclusion standard is described in
	   http://info.webcrawler.com/mak/projects/robots/norobots.html.  By
	   default w3mir honors this standard.	This option causes w3mir to
	   ignore it.

       -nnc
	   No Newline Conversion.  Normally w3mir converts the newline format
	   of all files that the web server says is a text file.  However, not
	   all web servers are reliable, and so binary files may become
	   corrupted due to the newline conversion w3mir performs.  Use this
	   option to stop w3mir from converting newlines.  This also causes
	   the file to be regarded as binary when written to disk, to disable
	   the implicit newline conversion when saving text files on most non-
	   Unix systems.

	   This will probably be on by default in version 1.1 of w3mir, but
	   not in version 1.0.

       -R  Remove files.  Normally w3mir will not remove files that are no
	   longer on the server/part of the retrieved web of files.  When this
	   option is specified all files no longer needed or found on the
	   servers will be removed.  If w3mir fails to get a document for any
	   other reason the file will not be removed.

       -B  Batch fetch documents whose URLs are given on the commandline.

	   In combination with the -r and/or -l switch all HTML and PDF
	   documents will be mined for URLs, but the documents will be saved
	   on disk unchanged.  When used with the -r switch only one single
	   URL is allowed.  When not used with the -r switch no HTML/URL
	   processing will be performed at all.	 When the -B switch is used
	   with -r w3mir will not do repeated mirrorings reliably since the
	   changes w3mir needs to do, in the documents, to work reliably are
	   not done.  In any case it's best not to use -R in combination with
	   -B since that can result in deleting rather more documents than
	   expected.  Hwowever, if the person writing the documents being
	   copied is good about making references relative and placing the
	   <HTML> tag at the beginning of documents there is a fair chance
	   that things will work even so.  But I wouln't bet on it.  It will,
	   however, work reliably for repeated mirroring if the -r switch is
	   not used.

	   When the -B switch is specified redirects for a given document will
	   be followed no matter where they point.  The redirected-to document
	   will be retrieved in the place of the original document.  This is a
	   potential weakness, since w3mir can be directed to fetch any
	   document anywhere on the web.

	   Unless used with -r all retrived files will be stored in one
	   directory using the remote filename as the local filename.  I.e.,
	   http://foo/bar/gazonk.html will be saved as gazonk.html.
	   http://foo/bar/ will be saved as bar-index.html so as to avoid name
	   colitions for the common case of URLs ending in /.

       -I  This switch can only be used with the -B switch, and only after it
	   on the commandline or configuration file.  When given w3mir will
	   get URLs from standard input (i.e., w3mir can be used as the end of
	   a pipe that produces URLs.)	There should only be one URL pr. line
	   of input.

       -q  Quiet.  Turns off all informational messages, only errors will be
	   output.

       -c  Chatty.  w3mir will output more progress information.  This can be
	   used if you're watching w3mir work.

       -v  Version.  Output w3mirs version.

       -s  Copy the given document(s) to STDOUT.

       -f  Forget.  The retrieved documents are not saved on disk, they are
	   just forgotten.  This can be used to prime the cache in proxy
	   servers, or not save documents you just want to list the URLs in
	   (see -l).

       -l  List the URLs referred to in the retrieved document(s) on STDOUT.

       -umask n
	   Sets the umask, i.e., the permission bits of all retrieved files.
	   The number is taken as octal unless it starts with a 0x, in which
	   case it's taken as hexadecimal.  No matter what you set this to
	   make sure you get write as well as read access to created files and
	   directories.

	   Typical values are:

	   022	   let everyone read the files (and directories), only you can
		   change them.

	   027	   you and everyone in the same file-group as you can read,
		   only you can change them.

	   077	   only you can read the files, only you can change them.

	   0	   everyone can read, write and change everything.

	   The default is whatever was set when w3mir was invoked.  022 is a
	   reasonable value.

	   This option has no meaning, or effect, on Win32 platforms.

       -P server:port
	   Use the given server and port is a HTTP proxy server.  If no port
	   is given port 80 is assumed (this is the normal HTTP port).	This
	   is useful if you are inside a firewall, or use a proxy server to
	   save bandwidth.

       -pflush
	   Proxy flush, force the proxy server to flush it's cache and re-get
	   the document from the source.  The Pragma: no-cache HTTP/1.0 header
	   is used to implement this.

       -ir referrer
	   Initial Referrer.  Set the referrer of the first retrieved
	   document.  Some servers are reluctant to serve certain documents
	   unless this is set right.

       -agent agent
	   Set the HTTP User-Agent fields value.  Some servers will serve
	   different documents according to the WWW browsers capabilities.
	   w3mir normally has w3mir/version in this header field.  Netscape
	   uses things like Mozilla/3.01 (X11; I; Linux 2.0.30 i586) and MSIE
	   uses things like Mozilla/2.0 (compatible; MSIE 3.02; Windows NT)
	   (remember to enclose agent strings with spaces in with double
	   quotes ("))

       -lc Lower Case URLs. Some OSes, like W95 and NT, are not case sensitive
	   when it comes to filenames.	Thus web masters using such OSes can
	   case filenames differently in different places (apps.html,
	   Apps.html, APPS.HTML).  If you mirror to a Unix machine this can
	   result in one file on the server becoming many in the mirror.  This
	   option lowercases all filenames so the mirror corresponds better
	   with the server.

	   If given it must be the first option on the command line.

	   This option does not work perfectly.	 Most especially for mixed
	   case host-names.

       -d n
	   Set the debug level.	 A debug level higher than 0 will produce lots
	   of extra output for debugging purposes.

       -abs
	   Force all URLs to be absolute.  If you retrive
	   http://www.ifi.uio.no/~janl/index.html and it references foo.html
	   the referense is absolutified into
	   http://www.ifi.uio.no/~janl/foo.html.  In other words, you get
	   absolute references to the origin site if you use this option.

CONFIGURATION-FILE
       Most things can be mirrored with a (long) command line.	But multi
       server mirroring, authentication and some other things are only
       available through a configuration file.	A configuration file can
       either be specified with the -cfgfile switch, but w3mir also looks for
       .w3mirc (w3mir.ini on Win32 platforms) in the directory where w3mir is
       started from.

       The configuration file consists of lines of comments and directives.  A
       directive consists of a keyword followed by a colon (:) and then one or
       several arguments.

	# This is a comment.  And the next line is a directive:
	Options: recurse, remove

       A comment can only start at the beginning of a line.  The directive
       keywords are not case-sensitive, but the arguments might be.

       Options: recurse | no-date-check | only-nonexistent | list-urls |
       lowercase | remove | batch | input-urls | no-newline-conv | list-
       nonmirrored
	   This must be the first directive in a configuration file.

	   recurse see -r switch.

	   no-date-check
		   see -fa switch.

	   only-nonexistent
		   see -fs switch.

	   list-urls
		   see -l option.

	   lowercase
		   see -lc option.

	   remove  see -R option.

	   batch   see -B option.

	   input-urls
		   see -I option.

	   no-newline-conv
		   see -nnc option.

	   list-nonmirrored
		   List URLs not mirrored in a file called .notmirrored
		   ('notmir' on win32).	 It will contain a lot of duplicate
		   lines and quite possebly be quite large.

       URL: HTTP-URL [target-directory]
	   The URL directive may only appear once in any configuration file.

	   Without the optional target directory argument it corresponds
	   directly to the single-HTTP-URL argument on the command line.

	   If the optional target directory is given all documents from under
	   the given URL will be stored in that directory, and under.  The
	   target directory is most likely only specified if the Also
	   directive is also specified.

	   If the URL given refers to a directory it must end in a "/",
	   otherwise you might get quite surprised at what gets retrieved.

	   Either one URL: directive or the single-HTTP-URL at the command-
	   line must be given.

       Also: HTTP-URL directory
	   This directive is only meaningful if the recurse (or -r) option is
	   given.

	   The directive enlarges the scope of a recursive retrieval to
	   contain the given HTTP-URL and all documents in the same directory
	   or under.  Any documents retrieved because of this directive will
	   be stored in the given directory of the mirror.

	   In practice this means that if the documents to be retrieved are
	   stored on several servers, or in several hierarchies on one server
	   or any combination of those.	 Then the Also directive ensures that
	   we get everything into one single mirror.

	   This also means that if you're retrieving

	     URL: http://www.foo.org/gazonk/

	   but it has inline icons or images stored in
	   http://www.foo.org/icons/ which you will also want to get, then
	   that will be retrieved as well by entering

	     Also: http://www.foo.org/icons/ icons

	   As with the URL directive, if the URL refers to a directory it must
	   end in a "/".

	   Another use for it is when mirroring sites that have several names
	   that all refer to the same (logical) server:

	     URL: http://www.midifest.com/
	     Also: http://midifest.com/ .

	   At this point in time w3mir has no mechanism to easily enlarge the
	   scope of a mirror after it has been established.  That means that
	   you should survey the documents you are going to retrieve to find
	   out what icons, graphics and other things they refer to that you
	   want.  And what other sites you might like to retrieve.  If you
	   find out that something is missing you will have to delete the
	   whole mirror, add the needed Also directives and then reestablish
	   the mirror.	This lack of flexibility in what to retrieve will be
	   addressed at a later date.

	   See also the Also-quene directive.

       Also-quene: HTTP-URL directory
	   This is like Also, except that the URL itself is also quened.  The
	   Also directive will not cause any documents to be retrived UNLESS
	   they are referenced by some other document w3mir has already
	   retrived.

       Quene: HTTP-URL
	   This is quenes the URL for retrival, but does not enlarge the scope
	   of the retrival.  If the URL is outside the scope of retrival it
	   will not be retrived anyway.

	   The observant reader will see that Also-quene is like Also combined
	   with Quene.

       Initial-referer: referer
	   see -ir option.

       Ignore: wildcard
       Fetch: wildcard
       Ignore-RE: regular-expression
       Fetch-RE: regular-expression
	   These four are used to set up rules about which documents, within
	   the scope of retrieval, should be gotten and which not.  The
	   default is to get anything that is within the scope of retrieval.
	   That may not be practical though.  This goes for CGI scripts, and
	   especially server side image maps and other things that are
	   executed/evaluated on the server.  There might be other things you
	   want unfetched as well.

	   w3mir stores the Ignore/Fetch rules in a list.  When a document is
	   considered for retrieval the URL is checked against the list in the
	   same order that the rules appeared in the configuration file.  If
	   the URL matches any rule the search stops at once.  If it matched a
	   Ignore rule the document is not fetched and any URLs in other
	   documents pointing to it will point to the document at the original
	   server (not inside the mirror).  If it matched a Fetch rule the
	   document is gotten.	If not matched by any ruoes the document is
	   gotten.

	   The wildcards are a very limited subset of Unix-wildcards.  w3mir
	   understands only '?', '*', and '[x-y]' ranges.

	   The perl-regular-expression is perls superset of the normal Unix
	   regular expression syntax.  They must be completely specified,
	   including the prefixed m, a delimiter of your choice (except the
	   paired delimiters: parenthesis, brackets and braces), and any of
	   the RE modifiers. E.g.,

	     Ignore-RE: m/.gif$/i

	   or

	     Ignore-RE: m~/.*/.*/.*/~

	   and so on.  "#" cannot be used as delimiter as it is the comment
	   character in the configuration file.	 This also has the bad side-
	   effect of making you unable to match fragment names (#foobar)
	   directly.  Fortunately perl allows writing ``#'' as ``\043''.

	   You must be very carefull of using the RE anchors (``^'' and ``$''
	   with the RE versions of these and the Apply directive. Given the
	   rules:

	     Fetch-RE: m/foobar.cgi$/
	     Ignore: *.cgi

	   the all files called ``foobar.cgi'' will be fetched. However, if
	   the file is referenced as ``foobar.cgi?query=mp3'' it will not be
	   fetched since the ``$'' anchor will prevent it from matching the
	   Fetch-RE directive and then it will match the Ignore directive
	   instead. If you want to match ``foobar.cgi'' but not
	   ``foobar.cgifu'' you can use perls ``\b'' character class which
	   matches a word boundrary:

	     Fetch-RE: m/foobar.cgi\b/
	     Ignore: *.cgi

	   which will get ``foobar.cgi'' as well as ``foobar.cgi?query=mp3''
	   but not ``foobar.cgifu''. BUT, you must keep in mind that a lot of
	   diffetent characters make a word boundrary, maybe something more
	   subtle is needed.

       Apply: regular-expression
	   This is used to change a URL into another URL.  It is a potentially
	   very powerful feature, and it also provides ample chance for you to
	   shoot your own foot. The whole aparatus is somewhat tenative, if
	   you find there is a need for changes in how Apply rules work please
	   E-mail. If you are going to use this feature please read the
	   documentation for Fetch-RE and Ignore-RE first.

	   The Apply expressions are applied, in sequence, to the URLs in
	   their absolute form. I.e., with the whole
	   http://host:port/dir/ec/tory/file URL. It is only after this w3mir
	   checks if a document is within the scope of retrieval or not. That
	   means that Apply rules can be used to change certain URLs to fall
	   inside the scope of retrieval, and vice versa.

	   The regular-expression is perls superset of the usual Unix regular
	   expressions for substitution.  As with Fetch and Ignore rules it
	   must be specified fully, with the s and delimiting character.  It
	   has the same restrictions with regards to delimiters. E.g.,

	     Apply: s~/foo/~/bar/~i

	   to translate the path element foo to bar in all URLs.

	   "#" cannot be used as delimiter as it is the comment character in
	   the configuration file.

	   Please note that w3mir expects that URLs identifying 'directories'
	   keep idenfifying directories after application of Apply rules.
	   Ditto for files.

       Agent: agent
	   see -agent option.

       Pause: n
	   see -p option.

       Retry-Pause: n
	   see -rp option.

       Retries: n
	   see -t option.

       debug: n
	   see -d option.

       umask n
	   see -umask option.

       Robot-Rules: on | off
	   Turn robot rules on of off.	See -drr option.

       Remove-Nomirror: on | off
	   If this is enabled sections between two consecutive

	     <!--NO MIRROR-->

	   comments in a mirrored document will be removed.  This editing is
	   performed even if batch getting is specified.

       Header: html/text
	   Insert this complete html/text into the start of the document.
	   This will be done even if batch is specified.

       File-Disposition: save | stdout | forget
	   What to do with a retrieved file.  The save alternative is default.
	   The two others correspond to the -s and -f options.	Only one may
	   be specified.

       Verbosity: quiet | brief | chatty
	   How much w3mir informs you of it's progress.	 Brief is the default.
	   The two others correspond to the -q and -c switches.

       Cd: directory
	   Change to given directory before starting work.  If it does not
	   exist it will be quietly created.  Using this option breaks the
	   'fixup' code so consider not using it, ever.

       HTTP-Proxy: server:port
	   see the -P switch.

       HTTP-Proxy-user: username
       HTTP-Proxy-passwd: password
	   These two are is used to activate authentication with the proxy
	   server.  w3mir only supports basic proxy autentication, and is
	   quite simpleminded about it, if proxy authentication is on w3mir
	   will always give it to the proxy.  The domain concept is not
	   supported with proxy-authentication.

       Proxy-Options: no-pragma | revalidate | refresh | no-store
	   Set proxy options.  There are two ways to pass proxy options,
	   HTTP/1.0 compatible and HTTP/1.1 compatible.	 Newer proxy-servers
	   will understand the 1.1 way as well as 1.0.	With old proxy-servers
	   only the 1.0 way will work.	w3mir will prefer the 1.0 way.

	   The only 1.0 compatible proxy-option is refresh, it corresponds to
	   the -pflush option and forces the proxy server to pass the request
	   to a upstream server to retrieve a fresh copy of the document.

	   The no-pragma option forces w3mir to use the HTTP/1.1 proxy control
	   header, use this only with servers you know to be new, otherwise it
	   won't work at all.  Use of any option but refresh will also cause
	   HTTP/1.1 to be used.

	   revalidate forces the proxy server to contact the upstream server
	   to validate that it has a fresh copy of the document.  This is
	   nicer to the net than refresh option which forces re-get of the
	   document no matter if the server has a fresh copy already.

	   no-store forbids the proxy from storing the document in other than
	   in transient storage.  This can be used when transferring sensitive
	   documents, but is by no means any warranty that the document can't
	   be found on any storage device on the proxy-server after the
	   transfer.  Cryptography, if legal in your contry, is the solution
	   if you want the contents to be secret.

	   refresh corresponds to the HTTP/1.0 header Pragma: no-cache or the
	   identical HTTP/1.1 Cache-control option.  revalidate and no-store
	   corresponds to max-age=0 and no-store respectively.

       Authorization
	   w3mir supports only the basic authentication of HTTP/1.0.  This
	   method can assign a password to a given user/server/realm.  The
	   "user" is your user-name on the server.  The "server" is the
	   server.  The realm is a HTTP concept.  It is simply a grouping of
	   files and documents.	 One file or a whole directory hierarchy can
	   belong to a realm.  One server may have many realms.	 A user may
	   have separate passwords for each realm, or the same password for
	   all the realms the user has access to.  A combination of a server
	   and a realm is called a domain.

	   Auth-Domain: server:port/realm
		   Give the server and port, and the belonging realm (making a
		   domain) that the following authentication data holds for.
		   You may specify "*" wildcard for either of server:port and
		   realm, this will work well if you only have one usernme and
		   password on all the servers mirrored.

	   Auth-User: user
		   Your user-name.

	   Auth-Passwd: password
		   Your password.

	   These three directives may be repeated, in clusters, as many times
	   as needed to give the necessary authentication information

       Disable-Headers: referer | user
	   Stop w3mir from sending the given headers.  This can be used for
	   anonymity, making your retrievals harder to track.  It will be even
	   harder if you specify a generic Agent, like Netscape.

       Fixup: ...
	   This directive controls some aspects of the separate program
	   w3mfix.  w3mfix uses the same configuration file as w3mir since it
	   needs a lot of the information in the w3mir configuration file to
	   do it's work correctly.  w3mfix is used to make mirrors more
	   browseable on filesystems (disk or CDROM), and to fix redirected
	   URLs and some other URL editing.  If you want a mirror to be
	   browseable of disk or CDROM you almost certainly need to run
	   w3mfix.  In many cases it is not necessary when you run a mirror to
	   be used through a WWW server.

	   To make w3mir write the data files w3mfix needs, and do nothing
	   else, simply put

	     Fixup: on

	   in the configuration file.  To make w3mir run w3mfix automatically
	   after each time w3mir has completed a mirror run specify

	     Fixup: run

	   w3mfix is documented in a separate man page in a effort to not
	   prolong this manpage unnecessarily.

       Index-name: name-of-index-file
	   When retriving URLs ending in '/' w3mir needs to append a filename
	   to store it localy.	The default value for this is 'index.html'
	   (this is the most used, its use originated in the NCSA HTTPD as far
	   as I know).	Some WWW servers use the filename 'Welcome.html' or
	   'welcome.html' instead (this was the default in the old CERN
	   HTTPD).  And servers running on limited OSes frequently use
	   'index.htm'.	 To keep things consistent and sane w3mir and the
	   server should use the same name.  Put

	     Index-name: welcome.html

	   when mirroring from a site that uses that convention.

	   When doing a multiserver retrival where the servers use two or more
	   different names for this you should use Apply rules to make the
	   names consistent within the mirror.

	   When making a mirror for use with a WWW server, the mirror should
	   use the same name as the new server for this, to acomplish that
	   Index-name should be combined with Apply.

	   Here is an example of use in the to latter cases when Welcome.html
	   is the prefered index name:

	     Index-name: Welcome.html
	     Apply: s~/index.html$~/Welcome.html~

	   Similarly, if index.html is the prefered index name.

	     Apply: s~/Welcome.html~/index.html~

	   Index-name is not needed since index.html is the default index
	   name.

EXAMPLES
       ·   Just get the latest Dr-Fun if it has been changed since the last
	   time

	    w3mir http://sunsite.unc.edu/Dave/Dr-Fun/latest.jpg

       ·   Recursively fetch everything on the Star Wars site, remove what is
	   no longer at the server from the mirror:

	    w3mir -R -r http://www.starwars.com/

       ·   Fetch the contents of the Sega site through a proxy, pausing for 30
	   seconds between each document

	    w3mir -r -p 30 -P www.foo.org:4321 http://www.sega.com/

       ·   Do everything according to w3mir.cfg

	    w3mir -cfgfile w3mir.cfg

       ·   A simple configuration file

	    # Remember, options first, as many as you like, comma separated
	    Options: recurse, remove
	    #
	    # Start here:
	    URL: http://www.starwars.com/
	    #
	    # Speed things up
	    Pause: 0
	    #
	    # Don't get junk
	    Ignore: *.cgi
	    Ignore: *-cgi
	    Ignore: *.map
	    #
	    # Proxy:
	    HTTP-Proxy: www.foo.org:4321
	    #
	    # You _should_ cd away from the directory where the config file is.
	    cd: starwars
	    #
	    # Authentication:
	    Auth-domain: server:port/realm
	    Auth-user: me
	    Auth-passwd: my_password
	    #
	    # You can use '*' in place of server:port and/or realm:
	    Auth-domain: */*
	    Auth-user: otherme
	    Auth-user: otherpassword

       ·   Also:

	    # Retrive all of janl's home pages:
	    Options: recurse
	    #
	    # This is the two argument form of URL:.  It fetches the first into the second
	    URL: http://www.math.uio.no/~janl/ math/janl
	    #
	    # These says that any documents refered to that lives under these places
	    # should be gotten too.  Into the named directories.  Two arguments are
	    # required for 'Also:'.
	    Also: http://www.math.uio.no/drift/personer/ math/drift
	    Also: http://www.ifi.uio.no/~janl/ ifi/janl
	    Also: http://www.mi.uib.no/~nicolai/ math-uib/nicolai
	    #
	    # The options above will result in this directory hierarchy under
	    # where you started w3mir:
	    # w3mir/math/janl		   files from http://www.math.uio.no/~janl
	    # w3mir/math/drift		   from http://www.math.uio.no/drift/personer/
	    # w3mir/ifi/janl		   from http://www.ifi.uio.no/~janl/
	    # w3mir/math-uib/nicolai	   from http://www.mi.uib.no/~nicolai/

       ·   Ignore-RE and Fetch-RE

	    # Get only jpeg/jpg files, no gifs
	    Fetch-RE: m/\.jp(e)?g$/
	    Ignore-RE: m/\.gif$/

       ·   Apply

	   As I said earlier, Apply has not been used for Real Work yet, that
	   I know of.  But Apply could, be used to map all web servers at the
	   university of Oslo inside the scope of retrieval very easily:

	     # Start at the main server
	     URL: http://www.uio.no/
	     # Change http://*.uio.no and http://129.240.* to be a subdirectory
	     # of http://www.uio.no/.
	     Apply: s~^http://(.*\.uio\.no(?:\d+)?)/~http://www.uio.no/$1/~i
	     Apply: s~^http://(129\.240\.[^:]*(?:\d+)?)/~http://www.uio.no/$1/~i

       There are two rather extensive example files in the w3mir distribution.

BUGS
       The -lc switch does not work too well.

FEATURES
       These are not bugs.

       URLs with two /es ('//') in the path component does not work as some
       might expect.  According to my reading of the URL spec. it is an
       illegal construct, which is a Good Thing, because I don't know how to
       handle it if it's legal.
       If you start at http://foo/bar/ then index.html might be gotten twice.
       Some documents point to a point above the server root, i.e.,
       http://some.server/../stuff.html.  Netscape, and other browsers, in
       defiance of the URL standard documents will change the URL to
       http://some.server/stuff.html.  W3mir will not.
       Authentication is only tried if the server requests it.	This might
       lead to a lot of extra connections going up and down, but that's the
       way it's gotta work for now.

SEE ALSO
       w3mfix

AUTHORS
       w3mirs authors can be reached at w3mir-core@usit.uio.no.	 w3mirs home
       page is at http://www.math.uio.no/~janl/w3mir/

POD ERRORS
       Hey! The above document had some coding errors, which are explained
       below:

       Around line 2582:
	   Expected text after =item, not a number

       Around line 2587:
	   Expected text after =item, not a number

       Around line 2591:
	   Expected text after =item, not a number

       Around line 2836:
	   Non-ASCII character seen before =encoding in 'ruoes'. Assuming
	   ISO8859-1

       Around line 3185:
	   Expected '=item *'

       Around line 3207:
	   Expected '=item *'

       Around line 3213:
	   Expected '=item *'

perl v5.20.2			  2015-08-31			      W3MIR(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome