SYNOPSIS

  use XML::RSSLite;

  . . .

  parseRSS(\%result, \$content);

  print "=== Channel ===\n",
        "Title: $result{'title'}\n",
        "Desc:  $result{'description'}\n",
        "Link:  $result{'link'}\n\n";

  foreach $item (@{$result{'item'}}) {
  print "  --- Item ---\n",
        "  Title: $item->{'title'}\n",
        "  Desc:  $item->{'description'}\n",
        "  Link:  $item->{'link'}\n\n";
  }

DESCRIPTION

This module attempts to extract the maximum amount of content from available documents, and is less concerned with \s-1XML\s0 compliance than alternatives. Rather than rely on XML::Parser, it uses heuristics and good old-fashioned Perl regular expressions. It stores the data in a simple hash structure, and \*(L"aliases\*(R" certain tags so that when done, you can count on having the minimal data necessary for re-constructing a valid \s-1RSS\s0 file. This means you get the basic title, description, and link for a channel and its items.

This module extracts more usable links by parsing \*(L"scriptingNews\*(R" and \*(L"weblog\*(R" formats in addition to \s-1RDF\s0 & \s-1RSS\s0. It also \*(L"sanitizes\*(R" the output for best results. The munging includes:

Remove html tags to leave plain text
Remove characters other than 0-9~!@#$%^&*()-+=a-zA-Z[];',.:
Remove leading whitespace from URIs
Use <url> tags when <link> is empty
Use misplaced urls in <title> when <link> is empty
Exract links from <a href=...> if required
Limit links to ftp and http(s)
Join relative item urls (beginning with / or #) to the site base

\s-1EXPORT\s0

$inScalarRef is a reference to a scalar containing the document to be parsed, the contents will effectively be destroyed. $outHashRef is a reference to the hash within which to store the parsed content.

\s-1EXPORTABLE\s0

parsedTree - required

Reference to hash to store the parsed document within.

parseThis - required

Reference to scalar containing the document to parse.

topTag - optional

Tag to consider the root node, leaving this undefined is not recommended.

comments - optional
false will remove contents from parseThis
true will not remove comments from parseThis
array reference is true, comments are stored here

\s-1CAVEATS\s0

This is not a conforming parser. It does not handle the following

  • <foo bar=">">

  • <foo><bar> <bar></bar> <bar></bar> </bar></foo>

  • <![CDATA[ ]]>

  • PI

It's non-validating, without a \s-1DTD\s0 the following cannot be properly addressed

entities
namespaces

This may or may not be arriving in some future release.

RELATED TO XML::RSSLite…

perl\|(1), \*(C`XML::RSS\*(C', \*(C`XML::SAX::PurePerl\*(C', \*(C`XML::Parser::Lite\*(C', <XML::Parser>

AUTHOR

Jerrad Pierce <[email protected]>.

Scott Thomason <[email protected]>

LICENSE

Portions Copyright (c) 2002,2003,2009 Jerrad Pierce, (c) 2000 Scott Thomason. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.