[plug] mapping out a website
Arie Hol
arie99 at ozemail.com.au
Thu Jun 10 20:24:55 WST 2004
On 10 Jun 2004 at 13:35, David Buddrige wrote:
> Hi all,
>
> I have been asked to map out all the pages in a given intranet website. So
> for example, given website url:
>
> http://abc.com/
>
> They want a list of every url that can be got at from the links on the
> initial page, sort of like this:
>
8<---------------------------snip-------------------------------->8
David, in the past I have made use of javascript that lists all URLs contained in a web page, the
URLs are listed in a separate window and can be saved to disk as a regular HTML page or you can
highlight and copy the URLs and paste them into a text file - this javascript is known as "List all
links" and is available as a "bookmarklet" which is available from :
http://www.bookmarklets.com/tools/categor.html
There is an amazing array of these bookmarklets available - some of them are real gems.
What I would suggest you try is :
In a GUI environment
1 - Download the bookmarklet.
2 - Save it as a bookmark that sits on your 'Bookmarks' toolbar ( so it's easy to get at)
3 - Access the web page where you want to start gathering URLs.
4 - Click on "List all links"
5 - Highlight and copy all the URLs from the newly opened window
6 - Paste into a new text file
7 - Repeat this sequence for as many web pages that you want to collect URLs from
At the command line
8 - Sort the text file in 'ascending alphabetical order'
9 - Then run the text file through the 'uniq' utility (to remove duplicate lines)
10 - peruse the text file which hopefully will give you the results you seek.
I have used this procedure many times to collect and collate URLs into a list which I then feed to
a script which does a 'BULK' download of the pages on the list (HTML only - leaving all the junk on
the servers where it belongs) - this allows me to cut down on the amount of data that I have
download.
I hope this advice is helpful.
More information about the plug
mailing list