[plug] Best platform/language to setup a simple web scraper

Michael Bramwell mbramwell at gmail.com
Wed Apr 3 07:03:59 UTC 2013


I'm a little late to this thread but if you like learning new things I
highly recommend using golang on gae. Its a rather nice language and is
made by the likes of Ken Thompson.

On 13/03/13 10:44 PM, Fred Janon wrote:
> Sounds like a simple enough project for GAE. You can develop your app
> locally and then deploy it with the GAE tools.
>
> Fred
>
> --- On *Wed, 3/13/13, Michael Van Delft /<michael at hybr.id.au>/* wrote:
>
>
>     From: Michael Van Delft <michael at hybr.id.au>
>     Subject: Re: [plug] Best platform/language to setup a simple web
>     scraper
>     To: plug at plug.org.au
>     Date: Wednesday, March 13, 2013, 2:58 PM
>
>     Hi Guys,
>
>     Thanks for the advice, just to clarify why I'd run it on a server;
>     Half the reason I'm doing this is because I'd like to get an alert
>     when a new house is posted online, I could just run my saved search
>     every day but that seems very manual, I'd like something that is just
>     set and forget.
>
>     The other half of the reason is learning I want to know a bit more
>     about what Google App Engine is/dose and I find I learn best if I've
>     got a hands on project. So far I think I'll go with a python script
>     and maybe play with BeautifulSoup, I'd say I have a basic
>     understanding of python but I'd like to be dangerous.
>
>     Cheers,
>     Michael
>
>     On Wed, Mar 13, 2013 at 8:19 PM, Andrew Elwell
>     <andrew.elwell at gmail.com </mc/compose?to=andrew.elwell at gmail.com>>
>     wrote:
>     >> To be honest, I'd use whatever language you are naturally
>     proficient in. Web
>     >> scraping is not exactly a black art, or overly difficult. I've
>     done plenty
>     >> of it with wget and bash.
>     >
>     > +1 to this (I used to use Perl before becoming more confident with
>     > BeautifulSoup)
>     > test on a local copy of the page 1st (curl -Lo test.html
>     > http://example.com/foo.html) so you can tweak your parsing without
>     > having to wait for the remote end (esp if page generation takes a
>     > while)
>     >
>     > use an interactive session with whatever tool you use to tweak and
>     > keep checking variables match what you expect
>     >
>     > finally double check there's not a machine readable version of the
>     > info you want hidden away on the site - it;s nicer to both ends
>     if you
>     > can simply pull in some json without having to parse tables. - if
>     > remote site has a decent (ha!) web team / they may be happy to work
>     > with you. YMMV.
>     >
>     > Andrew
>     > _______________________________________________
>     > PLUG discussion list: plug at plug.org.au
>     </mc/compose?to=plug at plug.org.au>
>     > http://lists.plug.org.au/mailman/listinfo/plug
>     > Committee e-mail: committee at plug.org.au
>     </mc/compose?to=committee at plug.org.au>
>     > PLUG Membership: http://www.plug.org.au/membership
>     _______________________________________________
>     PLUG discussion list: plug at plug.org.au
>     </mc/compose?to=plug at plug.org.au>
>     http://lists.plug.org.au/mailman/listinfo/plug
>     Committee e-mail: committee at plug.org.au
>     </mc/compose?to=committee at plug.org.au>
>     PLUG Membership: http://www.plug.org.au/membership
>
>
>
> _______________________________________________
> PLUG discussion list: plug at plug.org.au
> http://lists.plug.org.au/mailman/listinfo/plug
> Committee e-mail: committee at plug.org.au
> PLUG Membership: http://www.plug.org.au/membership



More information about the plug mailing list