Pathetic Python Blogging

Dear Lazyweb – can anyone work out why I can’t get useful data out of this page with BeautifulSoup and Python 2.5?

The information is in an HTML table, enclosed by td tags nested in tr tags, and governed by three CSS classes, “flight-data”, “data-head” and “data-row2”. The latter pair are used only within the first. So you would think something like this would work:

for item in soup.findAll('td', {'class': 'flight-data'}):
...output.append(item)

The ellipsis is there to make the indentation obvious in this post. Where soup is naturally an instance of BeautifulSoup that’s been fed the webpage as a file-like object. But it doesn’t; it does grab some of the data, but it also grabs much of the webpage as raw html, including the header and a gaggle of javascript. And it’s slow, dammit. I can’t be too far off beam, because I’m successfully parsing another very similar website using a near-identical parse command.

I’ve tried various interlocking restrictions, and searching for both data-head and data-row2, but these usually find nothing.

4 Comments on "Pathetic Python Blogging"


  1. oh yes, as of speed, first:

    soup = soup.find(‘table’, {‘id’: ‘dgArrivals’, ‘class’: ‘flight-data’})

    (only ‘id’ is enough, though)

    if you need more speed, you’d want to use lxml.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.