Parsing HTML with Perl

Efficiently manipulate documents on the Web

The need to extract interesting bits of an HTML document comes up often enough that by now we have all seen many ways of doing it wrong and some ways of doing it right for some values of “right”.

One might think that one of the most fascinating answers on Stackoverflow has put an end to the desire to parse HTML using regular expressions, but time and again such a desire proves too tempting.

Let’s say you want to check all the links on a page to identify stale ones, using regular expressions:

In this self-contained example, I put a small document in the __DATA__ section. This example corresponds to a situation where the maintainer of the page commented out a previously broken link, and replaced it with the correct link.

When run, this script produces the output:

It is surprisingly easy to fix using HTML::TokeParser::Simple. Just replace the body of the script above with:

When run, this script correctly prints:

And, it looks like we made it much more readable in the process!

Of course, interesting HTML parsing jobs involve more than just extracting links. While even that task can be made ever-increasingly complex for the regular expression jockey by, say, adding some interesting attributes between the a and the href, code using HTML::TokeParser::Simple would not be affected.

Another specialized HTML parsing module is HTML::TableExtract. In most cases, it makes going through tables on a page a breeze. For example, the State Actions to Address Health Insurance Exchanges contains State Table 2: Snapshot of State Actions and Figures. The contents of this page may change with new developments, so here is a screenshot of the first few lines of the table:

screen-shot-state-actions-table-2

Parsing this table using HTML::TableExtract is straightforward:

Running this script yields:

Note that I did not even have to look at the underlying HTML code at all for this code to work. If it hadn’t, I would have had to delve into that mess to find the specific problem, but, in this case, as in many others in my experience, HTML::TableExtract gave me just what I wanted. So long as the substrings I picked continue to match the content, my script will extract the desired columns even if some of the underlying HTML changes.

Both HTML::TokeParser::Simple (based on HTML::PullParser) and HTML::TableExtract (which subclasses HTML::Parser parse a stream rather than loading the entire document to memory and building a tree. This made them performant enough for whatever I was able to throw at them in the past.

With HTML::TokeParser::Simple, it is also easy to stop processing a file once you have extracted what you need. That helps when you are dealing with thousands of documents, each several megabytes in size where the interesting content is located towards the beginning. With HTML::TablExtract, performance can be improved by switching to less robust table identifiers such as depths and counts. However, in certain pathological conditions I seem to run into a lot, you may need to play with regexes to first extract the exact region of the HTML source that contains the content of interest.

In one case I had to process large sets of HTML files I had to process where each file was about 8 Mb. The interesting table occurred about 3/4 through the HTML source, and it was clearly separated from the rest of the page by <!-- interesting content here --> style comments. In this particular case, slurping each file, extracting the interesting bit, and passing the content to HTML::TableExtract helped. Throw a little Parallel::ForkManager into the mix, and a task that used to take a few hours went down to less than half an hour.

Sometimes, you just need to be able to extract the contents of the third span within the sixth paragraph of the first content div on the right. Especially if you need to extract multiple pieces of information depending on various parts of the document, creating a tree structure will make that task simpler. It may have a huge performance cost, however, depending on the size of the document. Building trees out of the smallest possible HTML fragments can help here.

Once you have the tree structure, you can address each element or sets of elements. XPath is a way of addressing those elements. HTML::TreeBuilder builds a tree representation of HTML documents. HTML::TreeBuilder::XPath adds the ability to locate nodes in that representation using XPath expressions. So, if I wanted to get the table of contents of the same document, I could have used something along the lines of:

Mojo::DOM is an excellent module that uses JQuery style selectors to address individual elements. It is extremely helpful when dealing with documents were HTML elements, classes, and ids were used in intelligent ways.

XML::Twig will also work for some HTML documents, but in general, using an XML parser to parse HTML documents found in the wild is perilious. On the other hand, if you do have well-formed documents, or HTML::Tidy can make them nice, XML::Twig is a joy to use. Unfortunately, it is depressingly too common to find documents pretending to be HTML, using a mish-mash of XML and HTML styles, and doing all sorts of things which browsers can accommodate, but XML parsers cannot.

And, if your purpose is just to clean some wild HTML document, use HTML::Tidy. It gives you an interface to the command line utility tidyp. For really convoluted HTML, it sometimes pays to pass through tidyp first before feeding it into one of the higher level modules.

Thanks to others who have built on HTML::Parser, I have never had to write a line of event handler code myself for real work. It is not that they are difficult to write. I do recommend you study the examples bundled with the distribution to see how the underlying machinery works. It is just that the modules others have built on top of and beyond HTML::Parser make life so much easier that I never had to worry much about going to the lowest possible level.

That’s a good thing.

Related

Sign up for the O'Reilly Programming Newsletter to get weekly insight from industry insiders.
topic: Programming
tags: , , ,
  • Sillymoos

    I agree it’s simpler to use a parser, but for regex parsing of HTML, have you seen this Tom Christiansen classic: http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234582#answer-4234491

    • http://www.unur.com/ A. Sinan Unur

      As usual, his treatment is insightful and beautiful. My advice is for the rest of us who’d just like to extract that bit of information from a messy chink of HTML. As he also notes most people underestimate the difficulty of parsing HTML with regular expressions and therefore do so poorly.

  • Grant McLean

    I’ve long been a fan of using XPath to extract data from HTML – usually via XML::LibXML’s HTML support. An interesting twist is to add Miyagawa’s HTML::Selector::XPath into the mix (perhaps via something like HTML::Grabber) which allows you to identify the target elements with CSS-style selectors. The syntax is much simpler than XPath and more natural to people familiar with HTML, CSS and jQuery.

  • jpr65

    Hello, the is a nice and easy approach to get input. For more comfort in output I started developing of the “Perl Open Report Framework” last year. Text, HTML and CSV are already possible. See at my web page http://www.jupiter-programs.de/prj_public/porf/example.pl.txt and next German PerlWorkshop (2014).
    Greetings, Ralf.

  • Parthiban Nagarajan

    Hi … What about Marpa package?

  • Craig Roush

    This is good stuff, but I am curious how you cleaned up all the spaces in the output of the table parser, removal of whitespace / tabs etc doesn’t seem to do the trick