[plug] Need utility/plugin to download links

Arie Hol arie99 at ozemail.com.au
Sun Dec 11 17:32:05 WST 2005



On 11 Dec 2005 at 14:02, Jim Householder wrote:

> Hi
> 
> In the past, I have seen utilities that, given a url, will download 
that
> page and all the pages to which it has links.
> 
> Can someone point me in the right direction?  It would be nice if the
> pages acquired could be restricted to the site of the specified url so 
I
> don't end up downloading the entire net.
> 

In the past I have used HTTrack - this was a utility that was available 
as a command line utility and as a plugin for the old version of Firefox 
known as Firebird. 

I used  it on my Windows PC - I do not know if it is available for Linux.

Following are extracts from the (14k) user readme file for the command 
line utility:

========================================================
HTTrack version 3.30+swf (compiled Oct 11 2003)
	usage: C:\TLUMAC~1\SPIDER~1\WINDA\CONTENT\HTTRACK\HTTRACK.EXE <URLs> [-
option] [+<FILTERs>] [-<FILTERs>]
	with options listed below: (* is the default value)

8<------------- snip ---------------->8

Action options:
  w *mirror web sites (--mirror)
  W  mirror web sites, semi-automatic (asks questions) (--mirror-wizard)
  g  just get files (saved in the current directory) (--get-files)
  i  continue an interrupted mirror using the cache (--continue)
  Y   mirror ALL links located in the first level pages (mirror links) (--
mirrorlinks)

8<------------- snip ---------------->8

Limits options:
  rN set the mirror depth to N (* r9999) (--depth[=N])
 %eN set the external links depth to N (* %e0) (--ext-depth[=N])
  mN maximum file length for a non-html file (--max-files[=N])
  mN,N2 maximum file length for non html (N) and html (N2)
  MN maximum overall size that can be uploaded/scanned (--max-size[=N])
  EN maximum mirror time in seconds (60=1 minute, 3600=1 hour) (--max-
time[=N])
  AN maximum transfer rate in bytes/seconds (1000=1KB/s max) (--max-
rate[=N])
 %cN maximum number of connections/seconds (*%c10) (--connection-per-
second[=N])
  GN pause transfer if N bytes reached, and wait until lock file is 
deleted (--max-pause[=N])

8<------------- snip ---------------->8

Spider options:
  bN accept cookies in cookies.txt (0=do not accept,* 1=accept) (--
cookies[=N])
  u  check document type if unknown (cgi,asp..) (u0 don't check, * u1 
check but /, u2 check always) (--check-type[=N])
  j *parse Java Classes (j0 don't parse) (--parse-java[=N])
  sN follow robots.txt and meta robots tags (0=never,1=sometimes,* 
2=always) (--robots[=N])
 %h  force HTTP/1.0 requests (reduce update features, only for old 
servers or proxies) (--http-10)
 %k  use keep-alive if possible, greately reducing latency for small 
files and test requests (%k0 don't use) (--keep-alive)
 %B  tolerant requests (accept bogus responses on some servers, but not 
standard!) (--tolerant)
 %s  update hacks: various hacks to limit re-transfers when updating 
(identical size, bogus response..) (--updatehack)
 %u  url hacks: various hacks to limit duplicate URLs (strip //, 
www.foo.com==foo.com..) (--urlhack)
 %A  assume that a type (cgi,asp..) is always linked with a mime type (-
%A php3,cgi=text/html;dat,bin=application/x-zip) (--assume <param>)
     shortcut: '--assume standard' is equivalent to -%A 
php2,php3,php4,php,cgi,asp,jsp,pl,cfm,nsf=text/html
 @iN internet protocol (0=both ipv6+ipv4, 4=ipv4 only, 6=ipv6 only) (--
protocol[=N])

Browser ID:
  F  user-agent field (-F "user-agent name") (--user-agent <param>)
 %F  footer string in Html code (-%F "Mirrored [from host %s [file %s [at 
%s]]]" (--footer <param>)
 %l  preffered language (-%l "fr, en, jp, *" (--language <param>)

8<------------- snip ---------------->8

Expert options:
  pN priority mode: (* p3) (--priority[=N])
      p0 just scan, don't save anything (for checking links)
      p1 save only html files
      p2 save only non html files
     *p3 save all files
      p7 get html files before, then treat other files
  S  stay on the same directory (--stay-on-same-dir)
  D *can only go down into subdirs (--can-go-down)
  U  can only go to upper directories (--can-go-up)
  B  can both go up&down into the directory structure (--can-go-up-and-
down)
  a *stay on the same address (--stay-on-same-address)
  d  stay on the same principal domain (--stay-on-same-domain)
  l  stay on the same TLD (eg: .com) (--stay-on-same-tld)
  e  go everywhere on the web (--go-everywhere)
 %H  debug HTTP headers in logfile (--debug-headers)

8<------------- snip ---------------->8

Shortcuts:
--mirror      <URLs> *make a mirror of site(s) (default)
--get         <URLs>  get the files indicated, do not seek other URLs (-
qg)
--list   <text file>  add all URL located in this text file (-%L)
--mirrorlinks <URLs>  mirror all links in 1st level pages (-Y)
--testlinks   <URLs>  test links in pages (-r1p0C0I0t)
--spider      <URLs>  spider site(s), to test links: reports Errors & 
Warnings (-p0C0I0t)
--testsite    <URLs>  identical to --spider
--skeleton    <URLs>  make a mirror, but gets only html files (-p1)
--update              update a mirror, without confirmation (-iC2)
--continue            continue a mirror, without confirmation (-iC1)

--catchurl            create a temporary proxy to capture an URL or a 
form post URL
--clean               erase cache & log files

--http10              force http/1.0 requests (-%h)

8<------------- snip ---------------->8

example: httrack www.someweb.com/bob/
means:   mirror site www.someweb.com/bob/ and only this site

example: httrack www.someweb.com/bob/ www.anothertest.com/mike/ 
+*.com/*.jpg
means:   mirror the two sites together (with shared links) and accept any 
.jpg files on .com sites

example: httrack www.someweb.com/bob/bobby.html +* -r6
means get all files starting from bobby.html, with 6 link-depth, and 
possibility of going everywhere on the web

example: httrack www.someweb.com/bob/bobby.html --spider -P 
proxy.myhost.com:8080
runs the spider on www.someweb.com/bob/bobby.html using a proxy

example: httrack --update
updates a mirror in the current folder

example: httrack
will bring you to the interactive mode

example: httrack --continue
continues a mirror in the current folder

HTTrack version 3.30+swf (compiled Oct 11 2003)
Copyright (C) Xavier Roche and other contributors


HTH


Regards Arie
------------------------------------------------------------------
 For the concert of life, nobody has a program.
------------------------------------------------------------------



More information about the plug mailing list