MKDoc::XML::TreeBuilder: Builds a parsed tree from xml data

SYNOPSIS

  my @top_nodes = MKDoc::XML::TreeBuilder->process_data ($some_xml);

SUMMARY

MKDoc::XML::TreeBuilder uses MKDoc::XML::Tokenizer to turn \s-1XML\s0 data into a parsed tree. Basically it smells like an \s-1XML\s0 parser, looks like an \s-1XML\s0 parser, and awfully overlaps with \s-1XML\s0 parsers.

But it's not an \s-1XML\s0 parser.

\s-1XML\s0 parsers are required to die if the \s-1XML\s0 data is not well formed. MKDoc::XML::TreeBuilder doesn't give a rip: it'll parse whatever as long as it's good enough for it to parse.

\s-1XML\s0 parsers expand entities. MKDoc::XML::TreeBuilder doesn't. At least not yet.

\s-1XML\s0 parsers generally support namespaces. MKDoc::XML::TreeBuilder doesn't - and probably won't.

DISCLAIMER

This module does low level \s-1XML\s0 manipulation. It will somehow parse even broken \s-1XML\s0 and try to do something with it. Do not use it unless you know what you're doing.

API

Returns all the top nodes of the $some_xml parsed tree.

Although the \s-1XML\s0 spec says that there can be only one top element in an \s-1XML\s0 file, you have to take two things into account:

1. Pseudo-elements such as \s-1XML\s0 declarations, processing instructions, and comments.

2. MKDoc::XML::TreeBuilder is not an \s-1XML\s0 parser, it's not its job to care about the \s-1XML\s0 specification, so having multiple top elements is just fine. Same as MKDoc::XML::TreeBuilder->process_data ($some_xml), except that it reads $some_xml from '/some/file.xml'.

Returned parsed tree - data structure

I have tried to make MKDoc::XML::TreeBuilder look enormously like HTML::TreeBuilder. So most of this section is stolen and slightly adapted from the HTML::Element man page.

\s-1START\s0 \s-1PLAGIARISM\s0 \s-1HERE\s0

It may occur to you to wonder what exactly a \*(L"tree\*(R" is, and how it's represented in memory. Consider this \s-1HTML\s0 document:

<html lang='en-US'> <head> <title>Stuff</title> <meta name='author' content='Jojo' /> </head> <body> <h1>I like potatoes!</h1> </body> </html>

Building a syntax tree out of it makes a tree-structure in memory that could be diagrammed as:

html (lang='en-US') / \ / \ / \ head body /\ \ / \ \ / \ \ title meta h1 | (name='author', | "Stuff" content='Jojo') "I like potatoes"

This is the traditional way to diagram a tree, with the \*(L"root\*(R" at the top, and it's this kind of diagram that people have in mind when they say, for example, that \*(L"the meta element is under the head element instead of under the body element\*(R". (The same is also said with \*(L"inside\*(R" instead of \*(L"under\*(R" \*(-- the use of \*(L"inside\*(R" makes more sense when you're looking at the \s-1HTML\s0 source.)

Another way to represent the above tree is with indenting:

html (attributes: lang='en-US') head title "Stuff" meta (attributes: name='author' content='Jojo') body h1 "I like potatoes"

Incidentally, diagramming with indenting works much better for very large trees, and is easier for a program to generate. The $tree->dump method uses indentation just that way.

However you diagram the tree, it's stored the same in memory \*(-- it's a network of objects, each of which has attributes like so:

element #1: _tag: 'html' _parent: none _content: [element #2, element #5] lang: 'en-US'

element #2: _tag: 'head' _parent: element #1 _content: [element #3, element #4]

element #3: _tag: 'title' _parent: element #2 _content: [text segment "Stuff"]

element #4 _tag: 'meta' _parent: element #2 _content: none name: author content: Jojo

element #5 _tag: 'body' _parent: element #1 _content: [element #6]

element #6 _tag: 'h1' _parent: element #5 _content: [text segment "I like potatoes"]

The \*(L"treeness\*(R" of the tree-structure that these elements comprise is not an aspect of any particular object, but is emergent from the relatedness attributes (_parent and _content) of these element-objects and from how you use them to get from element to element.

\s-1STOP\s0 \s-1PLAGIARISM\s0 \s-1HERE\s0

This is pretty much the kind of data structure MKDoc::XML::TreeBuilder returns. More information on different nodes and their type is available in MKDoc::XML::Token.

NOTES

Did I mention that MKDoc::XML::TreeBuilder is \s-1NOT\s0 an \s-1XML\s0 parser?

AUTHOR

Author: Jean-Michel Hiver

This module is free software and is distributed under the same license as Perl itself. Use it at your own risk.

RELATED TO MKDoc::XML::TreeBuilder…

MKDoc::XML::Token MKDoc::XML::Tokenizer

MKDoc::XML::TreeBuilder (3pm)