[RndTbl] Scrape active web page
ummar143 at shaw.ca
Wed Apr 4 09:30:00 CDT 2012
Thanks for your reply, Trevor.
I re-read your article from Feb 2011. Mechanize + MozRepl looks like a good combination - Mechanize (normally headless) to allow scripts, and MozRepl (appears to allow brief commands telnet style) to show the browser working.
I completed a working prototype using Ruby and Watir and I was able (with difficulty) to port it to a XP machine.
I was not able to get the behaviour I wanted. I wanted the user to launch my program after the browser displayed the information with all the correct checkboxes, etc were selected (a la Ajax). The program would then attach to the open page and scrape it, producing output (possibly onto a newly launched second page). I have been able to do this in the past using Firewatir, which uses jssh to run Firefox, but this only works on Firefox 3 or earlier.
I have nearly the same result. The user must launch the program first, which in turn launches the browser. Then the user interacts with the browser as usual while the program 'watches'. An event triggered by the user (in this case leaving the target page then coming back to it) triggers the scraping.
One nice thing is, it should work equally well (or better) with IE. But I haven't tried it.
The Ruby Watir API is by far the easiest way to scrape that I've seen. No need for XPath - I can get anywhere on a complex page I'm scraping in 2 hops, in part of a line of code. For this reason I plan to use Watir even when I don't need to drive the browser.
On 2012-04-02, at 8:15 AM, Trevor Cordes wrote:
> On 2012-03-22 Dan Martin wrote:
>> A programmable browser would be ideal. Does anyone know of one that
>> is multi-platform and can be installed without special services /
>> privileges? Has anyone used XUL for something like this?
> I do this type of thing all the time:
> perl's WWW::Mechanize::Firefox on cpan. It remote controls FF using
> MozRepl. You could probably have it run on Windows in cygwin (MozRepl
> should run natively but perl w/weird modules is tougher).
> I've had to hack the suite a touch to get it to work with complicated
> pages that don't always fire their "done loading" event, though that
> should not be required and in most cases it works well OOTB.
> (And if you need some help, my programming time can be for-hire.)
> Roundtable mailing list
> Roundtable at muug.mb.ca
GP Hospital Practitioner
ummar143 at shaw.ca
answering machine always on
More information about the Roundtable