[plug] mapping out a website

Arie Hol arie99 at ozemail.com.au
Thu Jun 10 20:24:55 WST 2004



On 10 Jun 2004 at 13:35, David Buddrige wrote:

> Hi all, 
> 
> I have been asked to map out all the pages in a given intranet website.  So 
> for example, given website url: 
> 
> http://abc.com/ 
> 
> They want a list of every url that can be got at from the links on the 
> initial page, sort of like this: 
> 
8<---------------------------snip-------------------------------->8

David, in the past I have made use of javascript that lists all URLs contained in a web page, the 
URLs are listed in a separate window and can be saved to disk as a regular HTML page or you can 
highlight and copy the URLs and paste them into a text file - this javascript is known as "List all 
links" and is available as a "bookmarklet" which is available from :

http://www.bookmarklets.com/tools/categor.html

There is an amazing array of these bookmarklets available - some of them are real gems.

What I would suggest you try is :

In a GUI environment
1 - Download the bookmarklet.
2 - Save it as a bookmark that sits on your 'Bookmarks' toolbar ( so it's easy to get at)
3 - Access the web page where you want to start gathering URLs.
4 - Click on "List all links"
5 - Highlight and copy all the URLs from the newly opened window
6 - Paste into a new text file 
7 - Repeat this sequence for as many web pages that you want to collect URLs from

At the command line
8 - Sort the text file in 'ascending alphabetical order'
9 - Then run the text file through the 'uniq' utility (to remove duplicate lines)
10 - peruse the text file which hopefully will give you the results you seek.

I have used this procedure many times to collect and collate URLs into a list which I then feed to 
a script which does a 'BULK' download of the pages on the list (HTML only - leaving all the junk on 
the servers where it belongs) - this allows me to cut down on the amount of data that I have 
download.

I hope this advice is helpful. 



More information about the plug mailing list