Terminally Incoherent

Utterly random, incoherent and disjointed rants and ramblings...

Friday, January 27, 2006

Screen Scraping for RSS

I like to read online comics. Unfortunately some of them do not publish RSS feeds which is retarded. I ranted about this on Monday. But hey, if they don't make one, I will do it for them.

I wrote a nice little perl script that screen scrapes a page for an image, and then generates an RSS feed. It requires WWW::Mechanize and XML::RSS modules that can be downloaded from CPAN or some other repository.

How does it work? You simply call it with:

perl grab.pl url pattern

Where url is the url of your web comic, and pattern is some string that is unique to the URL of the actual comic image. For example, extralife is easy because the front page image is always current.gif (you can use this as a pattern). DorkTower on the other uses variable image names, but all the pictures are stored in /comics/dorktower/images/comics/ directory. Furthermore, none of the advertisement, or background images are stored in a dir called comics - so I picked "comics" as a pattern.

Essentially, you have to look closely at the code of the page you are scraping once, and pick a good pattern attribute. The feed is created in the same directory as the script. To generate the file name I drop the http:// part from the url, remove all the slashes and append .xml at the end. I could add another optional attribute to specify the feed name, but I don't really care about it. Feel free to do it yourself.

Just a side note, if you plan running this on windows with ActiveState perl and you use ppm for your module management make sure you get WWW::Mechanize 1.4 or higher. The 0.72 package that can be downloaded from the ActiveState repository does not support the find_image function I'm using.

You might want to add http://theoryx5.uwinnipeg.ca/ppms/ to the ppm repository list. You can download a more recent version from there.

0 Comments:

Post a Comment

<< Home