[plug] Best platform/language to setup a simple web scraper

Luke Woollard luke.woollard at osmahi.com
Wed Mar 13 04:41:44 UTC 2013


if you know a little JavaScript and have used jquery at all, its
fairly easy to put something simple together in node.js with jsdom and
jquery.


example-reiwa-title-scraper.js
---
var jsdom = require('jsdom')

jsdom.env({
    html: "http://reiwa.com.au/home/default.aspx",
    scripts: ['http://code.jquery.com/jquery-1.6.min.js']
  }, function(err, window){
    var $ = window.jQuery;
    var reiwatitle = $("#wrapReiwaMenuCtrl #header a span").text()
    console.log(reiwatitle)
  }
})
---

To get node.js and npm going on ubuntu quickly

sudo apt-get install python-software-properties python g++ make
sudo add-apt-repository ppa:chris-lea/node.js
sudo apt-get update
sudo apt-get install nodejs npm
// from https://github.com/joyent/node/wiki/Installing-Node.js-via-package-manager

and then you just install jsdom with
npm install jsdom -g
// -g will install it as a global module to have it installed locally
to your scrapper just npm install jsdom from your project directory.

more info on jsdom is available at https://github.com/tmpvar/jsdom in
particular how to not to fetch resources like images, stylesheets and
scripts.

Kind Regards
Luke John


On Wed, Mar 13, 2013 at 3:16 PM, Michael Van Delft <michael at hybr.id.au> wrote:
> I’ve been using the reiwa website (and others) to look for houses. In
> particular the apartments on 120~130 Terrace Road that sometimes come
> up for < $400,000 but usually sell in a week or less. reiwa has a way
> you can save advanced searches and setup email alerts. Unfortunately
> when nothing matches your search, instead not sending an email or even
> an email that says “No matches found today” it spams you with a bunch
> of houses that have nothing to do with your search.
>
> I thought I can fix this I’ll just setup a simple web scraping script
> to do the job for me and I can have fun learning a new tool at the
> same time. So far the three options that I am looking at are Yahoo
> Pipes, Google App Engine and Scrapy/cron job on a Linode VPS I have.
>
> I’ve never used any of those before so I’m looking for advice, is
> there something else I should be looking at? Or is there any reason to
> pick one of those methods over another? How would you approach this?
>
> Regards,
> Michael
> _______________________________________________
> PLUG discussion list: plug at plug.org.au
> http://lists.plug.org.au/mailman/listinfo/plug
> Committee e-mail: committee at plug.org.au
> PLUG Membership: http://www.plug.org.au/membership


More information about the plug mailing list