Rick's Tech Talk

All Tech Talk, Most of the Time

Creating RSS Feeds (Part 1 of 3)

I have a separate BLOG on sports at The Sporting News. However, if you wanted to point your RSS feed reader (say, Bloglines or Google Reader) to that BLOG, you'd be out of luck. For some reason, Sporting News doesn't provide RSS feeds for their blogging public. So I set out to create my own RSS feed.

Creating your own RSS feed out of some web page means reading and processing that web page programmatically. You'll need to suck down the page to your computer, walk the HTML of that page, pull out the interesting bits and assemble it into a valid RSS XML document.

The first two steps of sucking down the page and walking down its HTML was done with two Perl modules. (Yes, I used Perl.) WWW::Mechanize sucks down the page, and HTML::TokeParser pulls out the interesting bits. There's lots of code examples that show how to do this ([1] [2]). Here's an excerpt from my script:

  1. my $agent = WWW::Mechanize->new();
  2. $agent->get("http://www.sportingnews.com/blog/rickumali");
  3.  
  4. my $stream = HTML::TokeParser->new(\$agent->{content});
  5.  
  6. while (my $div_tag = $stream->get_tag("div")) {
  7.         if ($div_tag->[1]{class} && $div_tag->[1]{class} eq "MBEntry") {
  8.                 my $id = $div_tag->[1]{id};
  9.                 $subject{$id} = get_subject($stream);
  10.                 $pubDate{$id} = get_pubdate($stream);
  11.                 $text{$id} = get_entry($stream);
  12.         }
  13. }
Following along, you can see that I'm getting the web page (line 2), then walking down the HTML (line 4 and 6). Specifically, I walk down all the div elements, looking for each BLOG post (div class of "MBEntry"). (I got that by looking at the BLOG's HTML.) From there, all that's left is the crying: I pull out the subject, the publication date, and the text itself. I put all of this into some hashes, keyed on the ID of the BLOG post. At the completion of this while loop, I have hashes that I can loop through to construct the RSS XML. I'll save that explanation for another day.

The full source code is at feedsn.pl.

Tags: