[plug] Need utility/plugin to download links
Arie Hol
arie99 at ozemail.com.au
Sun Dec 11 17:32:05 WST 2005
On 11 Dec 2005 at 14:02, Jim Householder wrote:
> Hi
>
> In the past, I have seen utilities that, given a url, will download
that
> page and all the pages to which it has links.
>
> Can someone point me in the right direction? It would be nice if the
> pages acquired could be restricted to the site of the specified url so
I
> don't end up downloading the entire net.
>
In the past I have used HTTrack - this was a utility that was available
as a command line utility and as a plugin for the old version of Firefox
known as Firebird.
I used it on my Windows PC - I do not know if it is available for Linux.
Following are extracts from the (14k) user readme file for the command
line utility:
========================================================
HTTrack version 3.30+swf (compiled Oct 11 2003)
usage: C:\TLUMAC~1\SPIDER~1\WINDA\CONTENT\HTTRACK\HTTRACK.EXE <URLs> [-
option] [+<FILTERs>] [-<FILTERs>]
with options listed below: (* is the default value)
8<------------- snip ---------------->8
Action options:
w *mirror web sites (--mirror)
W mirror web sites, semi-automatic (asks questions) (--mirror-wizard)
g just get files (saved in the current directory) (--get-files)
i continue an interrupted mirror using the cache (--continue)
Y mirror ALL links located in the first level pages (mirror links) (--
mirrorlinks)
8<------------- snip ---------------->8
Limits options:
rN set the mirror depth to N (* r9999) (--depth[=N])
%eN set the external links depth to N (* %e0) (--ext-depth[=N])
mN maximum file length for a non-html file (--max-files[=N])
mN,N2 maximum file length for non html (N) and html (N2)
MN maximum overall size that can be uploaded/scanned (--max-size[=N])
EN maximum mirror time in seconds (60=1 minute, 3600=1 hour) (--max-
time[=N])
AN maximum transfer rate in bytes/seconds (1000=1KB/s max) (--max-
rate[=N])
%cN maximum number of connections/seconds (*%c10) (--connection-per-
second[=N])
GN pause transfer if N bytes reached, and wait until lock file is
deleted (--max-pause[=N])
8<------------- snip ---------------->8
Spider options:
bN accept cookies in cookies.txt (0=do not accept,* 1=accept) (--
cookies[=N])
u check document type if unknown (cgi,asp..) (u0 don't check, * u1
check but /, u2 check always) (--check-type[=N])
j *parse Java Classes (j0 don't parse) (--parse-java[=N])
sN follow robots.txt and meta robots tags (0=never,1=sometimes,*
2=always) (--robots[=N])
%h force HTTP/1.0 requests (reduce update features, only for old
servers or proxies) (--http-10)
%k use keep-alive if possible, greately reducing latency for small
files and test requests (%k0 don't use) (--keep-alive)
%B tolerant requests (accept bogus responses on some servers, but not
standard!) (--tolerant)
%s update hacks: various hacks to limit re-transfers when updating
(identical size, bogus response..) (--updatehack)
%u url hacks: various hacks to limit duplicate URLs (strip //,
www.foo.com==foo.com..) (--urlhack)
%A assume that a type (cgi,asp..) is always linked with a mime type (-
%A php3,cgi=text/html;dat,bin=application/x-zip) (--assume <param>)
shortcut: '--assume standard' is equivalent to -%A
php2,php3,php4,php,cgi,asp,jsp,pl,cfm,nsf=text/html
@iN internet protocol (0=both ipv6+ipv4, 4=ipv4 only, 6=ipv6 only) (--
protocol[=N])
Browser ID:
F user-agent field (-F "user-agent name") (--user-agent <param>)
%F footer string in Html code (-%F "Mirrored [from host %s [file %s [at
%s]]]" (--footer <param>)
%l preffered language (-%l "fr, en, jp, *" (--language <param>)
8<------------- snip ---------------->8
Expert options:
pN priority mode: (* p3) (--priority[=N])
p0 just scan, don't save anything (for checking links)
p1 save only html files
p2 save only non html files
*p3 save all files
p7 get html files before, then treat other files
S stay on the same directory (--stay-on-same-dir)
D *can only go down into subdirs (--can-go-down)
U can only go to upper directories (--can-go-up)
B can both go up&down into the directory structure (--can-go-up-and-
down)
a *stay on the same address (--stay-on-same-address)
d stay on the same principal domain (--stay-on-same-domain)
l stay on the same TLD (eg: .com) (--stay-on-same-tld)
e go everywhere on the web (--go-everywhere)
%H debug HTTP headers in logfile (--debug-headers)
8<------------- snip ---------------->8
Shortcuts:
--mirror <URLs> *make a mirror of site(s) (default)
--get <URLs> get the files indicated, do not seek other URLs (-
qg)
--list <text file> add all URL located in this text file (-%L)
--mirrorlinks <URLs> mirror all links in 1st level pages (-Y)
--testlinks <URLs> test links in pages (-r1p0C0I0t)
--spider <URLs> spider site(s), to test links: reports Errors &
Warnings (-p0C0I0t)
--testsite <URLs> identical to --spider
--skeleton <URLs> make a mirror, but gets only html files (-p1)
--update update a mirror, without confirmation (-iC2)
--continue continue a mirror, without confirmation (-iC1)
--catchurl create a temporary proxy to capture an URL or a
form post URL
--clean erase cache & log files
--http10 force http/1.0 requests (-%h)
8<------------- snip ---------------->8
example: httrack www.someweb.com/bob/
means: mirror site www.someweb.com/bob/ and only this site
example: httrack www.someweb.com/bob/ www.anothertest.com/mike/
+*.com/*.jpg
means: mirror the two sites together (with shared links) and accept any
.jpg files on .com sites
example: httrack www.someweb.com/bob/bobby.html +* -r6
means get all files starting from bobby.html, with 6 link-depth, and
possibility of going everywhere on the web
example: httrack www.someweb.com/bob/bobby.html --spider -P
proxy.myhost.com:8080
runs the spider on www.someweb.com/bob/bobby.html using a proxy
example: httrack --update
updates a mirror in the current folder
example: httrack
will bring you to the interactive mode
example: httrack --continue
continues a mirror in the current folder
HTTrack version 3.30+swf (compiled Oct 11 2003)
Copyright (C) Xavier Roche and other contributors
HTH
Regards Arie
------------------------------------------------------------------
For the concert of life, nobody has a program.
------------------------------------------------------------------
More information about the plug
mailing list