[plug] Best platform/language to setup a simple web scraper

Michael Van Delft michael at hybr.id.au
Wed Mar 13 13:58:50 UTC 2013


Hi Guys,

Thanks for the advice, just to clarify why I'd run it on a server;
Half the reason I'm doing this is because I'd like to get an alert
when a new house is posted online, I could just run my saved search
every day but that seems very manual, I'd like something that is just
set and forget.

The other half of the reason is learning I want to know a bit more
about what Google App Engine is/dose and I find I learn best if I've
got a hands on project. So far I think I'll go with a python script
and maybe play with BeautifulSoup, I'd say I have a basic
understanding of python but I'd like to be dangerous.

Cheers,
Michael

On Wed, Mar 13, 2013 at 8:19 PM, Andrew Elwell <andrew.elwell at gmail.com> wrote:
>> To be honest, I'd use whatever language you are naturally proficient in. Web
>> scraping is not exactly a black art, or overly difficult. I've done plenty
>> of it with wget and bash.
>
> +1 to this (I used to use Perl before becoming more confident with
> BeautifulSoup)
> test on a local copy of the page 1st (curl -Lo test.html
> http://example.com/foo.html) so you can tweak your parsing without
> having to wait for the remote end (esp if page generation takes a
> while)
>
> use an interactive session with whatever tool you use to tweak and
> keep checking variables match what you expect
>
> finally double check there's not a machine readable version of the
> info you want hidden away on the site - it;s nicer to both ends if you
> can simply pull in some json without having to parse tables. - if
> remote site has a decent (ha!) web team / they may be happy to work
> with you. YMMV.
>
> Andrew
> _______________________________________________
> PLUG discussion list: plug at plug.org.au
> http://lists.plug.org.au/mailman/listinfo/plug
> Committee e-mail: committee at plug.org.au
> PLUG Membership: http://www.plug.org.au/membership


More information about the plug mailing list