Sophie

Sophie

distrib > Mageia > 5 > x86_64 > media > core-release > by-pkgid > dfce8336dd70c3b935b893f091ca354c > files > 3

php-pear-XML_SaxFilters-0.3.0-18.mga5.noarch.rpm

$Id: SaxFiltersGuide,v 1.1.1.1 2003/08/08 22:31:41 harryf Exp $

+++XML_SaxFilters: A Rough Guide
The purpose of SaxFilters is to provide a stucture to make it easy to parse
XML documents with the SAX API. This is a rough guide to working with
Sax Filters

+++What is SAX?
SAX is the Simple API for XML. It is not an official standard but rather an 
approach to parsing XML documents which has proved effective.

The approach used by the SAX API is to regard an XML document as a list of
"events". A SAX parser reads an XML document from start to finished and every
time it encounters an "event", it calls a function or class method which is
has been told is the "handler" for that event.

The SAX API distinguishes between different "parts" of an XML document as
being different types of event. For example the following snippet of an
XML document is equivalent to three events;

<myTag>Hello World!</myTag>

The opening tag, <myTag>, the data between the tags (Hello World!) and the
closing tag </myTag> are all distinguisable events. A SAX parser will generally
have three different handlers defined to deal with these, for example;

function open($name,$attribs) {
    echo ( 'The opening tag name is: '.$name ); // outputs "myTag"
    echo ( 'The tag attributes are:' );
    echo ( '<pre>' );
    print_r($attribs);  // No attributes in this case
    echo ( '</pre>' );
}

data($data) {
    echo ( 'The data is: '.$data ); // output "Hello World!"
}

close($name) {
    echo ( 'The close tag is: '.$name ); // outputs "myTag"
}

Note that other elements of an XML document such as entities may also trigger
seperate events, depending on the parser

+++Parsing with PHP and Expat
PHP comes with it's own SAX parser (James Clarks Expat parser), described at
http://www.php.net/xml. This provides a fully functional (and very reliable)
SAX XML parser.

+++The Problem
The "normal" approach to parsing XML with the native SAX parser in PHP typically
involves implementing a "concrete" parser for a given document, the handlers being
effectively "embedded" in the code that does the parsing.

A common example in PHP is parsing an RSS feed such as described at
http://www.sitepoint.com/print/560, the completed concrete parser being available
at http://www.sitepoint.com/article/560/4

While there's nothing wrong with the approach prescribed, an RSS document
has a fairly simple structure which is also predicable. For more complex XML documents
with a greater depth in nodes or a document where the heirarchy itself
may vary from document to document (e.g. an [X]HTML page), attempting to tie
handlers directly to the parser will result in giant switch statements an
complex if/else conditions and quickly lead to an unmaintainable and easy to break
parser.

+++Sax Filters
The general approach taken by Sax Filters is to seperate the XML event handlers (referred
to as a Filter from now on) from the code which reads the incoming XML "stream" and
triggers the handlers (referred to as the Parser from now on).

By doing so, it's possible to "plug in" a specific set of Filters to a generic Parser,
meaning the coding effort concentrates purely on Filter and not the Parser.

What's more it's possible to chain of multiple Filters together, making it easier
to deal with XML documents with greater depth in nodes.

For example a document like;

<?xml version="1.0"?>
<books>
  <book>
    <title>Programming PHP</title>
    <authors>
      <name>Rasmus Lerdorf</name>
      <name>Kevin Tatroe</name>
    </authors>
  </book>
  <book>
    <title>PHP and MySQL Web Development, Second Edition</title>
    <authors>
      <name>Luke Welling</name>
      <name>Laura Thomson</name>
    </authors>
  </book>
</books>

Using SAX filters the solution might be to define two handlers, one for <book />
elements, say BookFilter and another for <authors /> elements; AuthorsFilter.
The AuthorsFilter would then by chained to the BookFilter, the latter allowing
the AuthorsFilter to look at the events it receives and respond if it finds
events it's interested in handling. From a code perspective, setting this up 
with XML_SaxFilters might look like;

<?php
// Create a steamer for fetching XML
$reader = & new FileReader('books.xml');

// Instantiate the generic parsing, giving it the XML stream
$parser = & new ExpatParser($reader);

// Create the BookFilter object
$books = & new BooksFilter();

// Create the AuthorsFilter object
$authors = & new AuthorsFilter();

// Chain the BookFilter to the parser
$parser->setChild($books);

// Chain the AuthorsFilter to the BookFilter
$books->setChild($authors);

// Parse the document
$parser->parse();
?>

+++XML_SaxFilters specifics
PEAR::XML_SaxFilters differs a little from SaxFilters implemented in
other languages in that it allows for "dynamic" chaining of Filters.
This makes it possible to deal with XML documents of a given format
where the heirarchy of XML elements can vary significantly from document
to document (e.g. an XHTML document). More on that in a moment.

-Readers
From the perspective of how XML_SaxFilters "receives" an XML document,
a set of "Reader" classes has been defined, which implement the
ReaderInterface. These are used to stream the XML data source to a parser,
and make it possible to break up the XML stream so the parser does not
need hold alot of data in memory.

In practice the readers act as a simple Iterator, providing a read() method
to get some data an an inFinal() method which is really only there to help
the native PHP SAX parser, which likes to know when the XML stream has finished.

Right now the following Readers have been defined;

FileReader - this uses PHP's fopen to stream data.
StringReader - streams the data from a string
ListReader - this is NOT meant to used by a parser (see Writers below)
               but only for getting data from a ListWriter

-Writers
The Writer classes, which implement the WriterInterface, provide an 
"outgoing data stream" to which the Filters can "write" data.

Although there's no requirement to use a writer, it makes an effective
mechanism to have a single object which all filters can use, without needing
to resort to global variables.

To fetch data from a Writer class, the getReader() method returns an instance
of one of the Readers (above). This provides the additional option of having
a second (or more) Parser (and Filters) which reads from the outgoing data stream,
should that be required.

The available methods in a writer are write() to add data, getReader() to return
a Reader and close() which is provided specifically for the file writer, to close
the file.

The available writers right now are;
FileWriter - uses fopen to write to a file - getReader() returns a FileReader
StringWriter - writes data to a string - getReader() returns a StringReader
ListWriter - used to write objects to an internal array- getReader() returns
               a ListReader

Note: Will be looking at possibilities to build a writer for XML::Tree at
some point, so there's a writer for building heirarchical data structures.

- Parsers
XML_SaxFilters provides two concrete parsers today, ExpatParser which is
uses the native PHP Sax parser and HTMLSaxParser, which uses PEAR::XML_HTMLSax,
a parser for badly formed formed XML written purely in PHP (be warned,
HTMLSax is alot slower than then native Sax parser - use only when you have to).

The parsers define three handlers, startElementHandler, endElementHandler
and characterDataHandler which when triggered, do nothing but pass on the
event to the filter chained to the parser. These are implemented in the 
AbstractParser class.

The AbstractParser also provides two methods, setChild() and unsetChild() which
are used to chain a filter to the parser and later detach it, respectively.

The ParserInterface class defines two methods, setParserOption() and parse()
which every concrete parser must implement. The ParserInterface is not actually
used but is provided to guide developers writing their own concrete parsers.

Note: the parsers do not check to see if you've already set a child filter
when instructed to parse right now, so make sure you've set one before calling
parse.

- Filters
For the filters, the AbstractFilter provides the following methods;

- setChild(): for chaining another filter to the current one. Note that the
              child has no "awareness" of it's parent unless setParent() is
              used (see below)
- unsetChild(): for removing a chained filter

- setParent(): this connects a child back to the filter it's chained to. This
               allows the child to manipulate it's parent, perhaps detaching itself
               when it's finished filtering an XML element (after it's final
               endElementHandler() method).
- unsetParent(): this breaks the connect from child to parent established with
                 setParent().

- attachToParent(): this is a "shortcut" which allows a child to connect another
                    child to it's parent (assuming setParent() was already used),
                    by calling the parent setChild() method.
- detachFromParent(): another "shortcut" which allows a child to disconnect a child
                      from it's parent.

- setWriter(): this adds a Writer class (see above) to a filter
- getWriter(): gets the Writer

Note that in most cases (fairly simple XML documents) it will only be necessary to
use the setChild() and optionally the setWriter() methods. This rest are used for
more complex XML documents where one filter may instantiate another filter
"internally" depending on the XML events it encounters.

***Dynamic Filtering
For complex XML documents, where the element names are known but not the precise structure,
filters can be used to "launch" new filters based on the encountered elements.

The example provided with XML_SaxFilters, template.php, gives some idea of how this works.

Look at that example in a little detail, the document to be parsed looks like this;

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title> Runtime Components Example </title>
</head>

<body>
<h1>Template for building runtime components</h1>
<a href="http://www.php.net">
    <img
        id=phpLogo
        src=http://static.php.net/www.php.net/images/php.gif
        border=0
        alt=PHP
        width=120
        height=67
        hspace=3
        runat=server>
</a>
<p>Tags marked with the attribute runat="server" are parsed into runtime components.</p>
<table>
<caption>The Founding Fathers</caption>
<list id="myList" dataSource="myIterator" runat="server">
    <row>
        <tr>
            <td><item id="first"></td>
            <td><item id="last"></td>
        </tr>
    </row>
</list>
</caption>
</body>
</html>

Now I want my filters to deal with any tag that contains the attribute: runat="server". But
I need different filters depending on what element has that attribute.

First I define the filter which will be chained directly to the parser, HTMLFilter. In the 
constructor I "register" an array which will be used to map element names to the Filter to
be instantiated;

<?php
    function HTMLFilter()
    {
        $this->registry['img']='HTMLImageFilter';
        $this->registry['list']='ListFilter';
    }
?>

Next in the startElementHandler I check to see if this filter already has a child, in
which case I delegate any event to it;

<?php
    function open($name,$attribs)
    {
        if ( isset ( $this->child ) )
        {
            $this->child->startElementHandler($name,$attribs);
        }
?>

Simplifying the next step a little, I check for the runat="server" attribute which,
if I find, tells me I need to instantiate a new filter;

<?php
        else if ( isset($attribs['runat']) && $attribs['runat']=='server' )
        {
            $childFilter = & new $this->registry[$name]();
            $childFilter->setParent($this);
            $childFilter->setWriter($this->getWriter());
            $this->setChild($childFilter);
        }
?>

Above I first find the class name from the registered elements then create an object
using that name. Then I make a connection *from* the child *to* the parent, using the
childs setParent() method. Following that I add the Writer to the child then
make a connection *from* the parent *to* the child using setChild().

Once the child filter is set, the next time HTMLFilter's startElementHandler is called,
it will delegate the event to the child, rather than handling itself.

Looking at one of the child filters, ListFilter, it's worth examing then endElementHandler.
Remember that ListFilter will know about it's parent, HTMLFilter, because I used it's
setParent() method just after it was created.

<?php
    function close($name)
    {
        if ( $name == 'list' )
        {
            $this->detachFromParent();
        }
?>

What the end element handler does is watch for the closing </list> tag which, if
encountered, results in the detachFromParent() method being called, breaking the
connection *from* Parent *to* Child. By doing so the parent filter, HTMLFilter,
will no longer delegate events to the child, returning the responsibility to
it for handling events.

While this may seem confusing at first, once grasped it can be a very powerful
approach to parsing XML, allowing the kind of flexibility with SAX that is
normally only possible with DOM.