xmlEventParse              package:XML              R Documentation

_X_M_L _E_v_e_n_t/_C_a_l_l_b_a_c_k _e_l_e_m_e_n_t-_w_i_s_e _P_a_r_s_e_r

_D_e_s_c_r_i_p_t_i_o_n:

     This is the event-driven or SAX (Simple API for XML) style parser
     which process XML without building the tree but rather identifies
     tokens in the stream of characters and passes them to handlers
     which can make sense of them in context. This reads and processes
     the contents of an XML file or string by invoking user-level
     functions associated with different components of the XML tree.
     These components include the beginning and end  of XML elements,
     e.g '<myTag x="1">' and '</myTag>' respectively, comments, CDATA
     (escaped character data), entities, processing instructions, etc.
     This allows the caller to create the appropriate data structure
     from the XML document contents rather than the default tree (see
     xmlTreeParse) and so avoids having the entire document in memory.
     This is important for large documents and where we would end up
     with essentially 2 copies of the data in memory at once, i.e the
     tree and the R data structure containing the information taken
     from the tree. When dealing with classes of XML documents whose
     instances could be large, this approach is desirable but a little
     more cumbersome to program than the standard DOM (Document Object
     Model) approach provided by 'XMLTreeParse'.

     Note that 'xmlTreeParse' does  allow a hybrid style of processing
     that allows us to apply handlers to nodes in the tree as they are
     being converted to R objects.  This is a style of event-driven or
     asynchronous calling

     In addition to the generic token event handlers such as "begin an
     XML element" (the 'startElement' handler), one can  also provide
     handler functions for specific tags/elements such as '<myTag>'
     with handler elements  with the same name as the XML element of
     interest, i.e. '"myTag" = function(x, attrs)'.

_U_s_a_g_e:

     xmlEventParse(file, handlers=xmlEventHandler(), ignoreBlanks = FALSE, addContext=TRUE,
                    useTagName=TRUE, asText = FALSE, trim=TRUE, useExpat=FALSE, isURL = FALSE,
                     state = NULL, replaceEntities = TRUE, validate = FALSE,
                      saxVersion = 1, branches = NULL,
                      useDotNames = length(grep("^\.", names(handlers))) > 0)

_A_r_g_u_m_e_n_t_s:

    file: the source of the XML content. This can be a string giving
          the name of a file or remote URL, the XML itself, a
          connection object, or a function. If this is a string, and
          'asText' is 'TRUE', the value is the XML content. This allows
          one to read the content separately from parsing without
          having to write it to a file. If 'asText' is 'FALSE' and a
          string is passed for 'file', this is taken as the name of a
          file or remote URI. If one is using the libxml parser (i.e.
          not expat), this can be a URI accessed via HTTP or FTP or a
          compressed local file. If it is the name of a local file, it
          can include '~', environment variables, etc. which will be
          expanded by R. (Note this is not the case in S-Plus, as far
          as I know.)

          If a connection is given, the parser incrementally reads one
          line at a time by calling the function 'readLines' with the
          connection as the first argument (and '1' as the number of
          lines to read).  The parser calls this function each time it
          needs more input.

          If invoking the 'readLines' function to get each line is
          excessively slow or is inappropriate, one can provide a
          function as the value of 'fileName'. Again, when the XML
          parser needs more content to process, it invokes this
          function to get a string. This function is called with a
          single argument, the maximum size of the string that can be
          returned. The function is responsible for accessing the
          correct connection(s), etc. which is typically done via
          lexical scoping/environments. This mechanism allows the user
          to control how the XML content is retrieved in very general
          ways. For example, one might read from a set of files,
          starting one when the contents of the previous file have been
          consumed. This allows for the use of hybrid connection
          objects.

          Support for connections and functions in this form is only
          provided if one is using libxml2 and not libxml version 1. 

handlers: a closure object that contains  functions which will be
          invoked as the XML components in the document are encountered
          by the parser.  The standard function or handler names are
          'startElement()', 'endElement()' 'comment()', 'getEntity',
          'entityDeclaration()', 'processingInstruction()', 'text()',
          'cdata()', 'startDocument()', and 'endDocument()', or
          alternatively and preferrably, these names  prefixed with a
          '.', i.e. .startElement, .comment, ...

          The call signature for the entityDeclaration function was
          changed in version 1.7-0.  Note that in earlier versions, the
          C routine did not invoke any R function and so no code will
          actually break. Also, we have renamed 'externalEntity' to
          'getEntity'. These were based on the expat parser.

          The new signature is 'c(name = "character", type = "integer",
          content = "", system = "character", public = "character" )'
          'name' gives the name of the entity being defined. The 'type'
          identifies the type of the entity using the value of a
          C-level enumerated constant used in libxml2, but also gives
          the human-readable form as the name of the single element in
          the integer vector. The possible values are
          '"Internal_General"', '"External_General_Parsed"',
          '"External_General_Unparsed"', '"Internal_Parameter"',
          '"External_Parameter"', '"Internal_Predefined"'.

          If we are dealing with an internal entity, the content will
          be the string containing the value of the entity. If we are
          dealing with an external entity, then 'content' will be a
          character vector of length 0, i.e. empty. Instead, either or
          both of the system and public arguments will be non-empty and
          identify the location of the external content. 'system' will
          be a string containing a URI, if non-empty, and 'public'
          corresponds to the PUBLIC identifier used to identify content
          using an SGML-like approach. The use of PUBLIC identifiers is
          less common. 

ignoreBlanks: a logical value indicating whether text elements made up
          entirely of white space should be included in the resulting
          `tree'. 

addContext: logical value indicating whether the callback functions  in
          `handlers' should be invoked with contextual  information
          about the parser and the position in the tree, such as node
          depth,  path indices for the node relative the root, etc. If
          this is True, each callback function  should support  .... 

useTagName: logical value indicating whether  the callback mechanism
          should look  for a function matching the tag name in the
          startElement and endElement events, before calling the
          default handler functions. This allows the caller to handle
          different element types for a particular DTD with their own
          functions directly, rather than performing a second dispatch
          in 'startElement()'. 

  asText: logical value indicating that the first argument, `file', 
          should be treated as the XML text to parse, not the name of 
          a file. This allows the contents of documents to be retrieved
           from different sources (e.g. HTTP servers, XML-RPC, etc.)
          and still use this parser.

    trim: whether to strip white space from the beginning and end of
          text strings. 

useExpat: a logical value indicating whether to use the expat SAX
          parser, or to default to the libxml. If this is TRUE, the
          library must have been compiled with support for expat. See
          supportsExpat. 

   isURL: indicates whether the 'file'  argument refers to a URL
          (accessible via ftp or http) or a regular file on the system.
          If 'asText' is TRUE, this should not be specified. 

   state: an optional S object that is passed to the callbacks and can
          be modified to communicate state between the callbacks. If
          this is given, the callbacks should accept an argument  named
          '.state' and it should return an object that will be used as
          the updated value of this state object. The new value can be
          any S object and will be passed to the next  callback where
          again it will be updated by that functions return value, and
          so on.  If this not specified in the call to 'xmlEventParse',
          no '.state' argument is passed to the callbacks. This makes
          the interface compatible with previous releases. 

replaceEntities: logical value indicating whether to substitute entity
          references with their text directly. This should be left as
          False. The text still appears as the value of the node, but
          there is more information about its source, allowing the
          parse to be reversed with full reference information. 

saxVersion: an integer value which should be either 1 or 2. This
          specifies which SAX interface to use in the C code. The
          essential difference is the number of arguments passed to the
          'startElement' handler function(s).  In addition to the name
          of the element and the named-attributes vector, two
          additional arguments are provided. The first identifies the
          namespace of the element. This is a named character vector of
          length 1, with the value being the URI of the namespace and
          the name being the prefix that identifies that namespace
          within the document. For example,
          'xmlns:r="http://www.r-project.org"' would be passed as 'c(r
          = "http://www.r-project.org")'. If there is no prefix because
          the namespace is being used as the default, the result of
          calling 'names' on the string is '""'. The fourth argument
          gives the collection of all the namespaces defined within
          this element. Again, this is a named character vector. 

validate: Currently, this has no effect as the libxml2 parser uses a
          document structure to do validation. a logical indicating
          whether to use a validating parser or not, or in other words
          check the contents against the DTD specification. If this is
          true, warning messages will be displayed about errors in the
          DTD and/or document, but the parsing  will proceed except for
          the presence of terminal errors. 

branches: a named list of functions. Each element identifies an XML
          element name. If an XML element of that name is encountered
          in the SAX stream, the stream is processed until the end of
          that element and an internal node (see 'xmlTreeParse' and its
          'useInternalNodes' parameter) is created. The function in our
          branches list corresponding to this XML element is then
          invoked with the (internal) node as the only argument. This
          allows one to use the DOM model on a sub-tree of the entire
          document and thus use both SAX and DOM together to get the
          efficiency of SAX and the simpler programming model of DOM.

          Note that the branches mechanism works top-down and does not
          work for nested tags. If one specifies an element name in the
          'branches' argument, e.g. myNode, and there is a nested
          myNode instance within a branch, the branches handler will
          not be called for that nested instance. If there is an
          instance where this is problematic, please contact the
          maintainer of this package. 

useDotNames: a logical value indicating whether to use the newer format
          for identifying general element function handlers with the
          '.' prefix, e.g. .text, .comment, .startElement. If this is
          'FALSE', then the older format text, comment, startElement,
          ... are used. This causes problems when there are indeed
          nodes named text or comment or startElement as a
          node-specific handler are confused with the corresponding
          general handler of the same name. Using 'TRUE' means that
          your list of handlers should have names that use the '.'
          prefix for these general element handlers. This is the
          preferred way to write new code. 

_D_e_t_a_i_l_s:

     This is now implemented using the libxml parser. Originally, this
     was implemented via the Expat XML parser by Jim Clark (<URL:
     http://www.jclark.com>).

_V_a_l_u_e:

     The return value is the `handlers' argument. It is assumed that
     this is a closure and that the callback functions have manipulated
     variables local to it and that the caller knows how to extract
     this.

_N_o_t_e:

     The libxml parser can read URLs via http or ftp. It does not
     require the support of wget as used in other parts of R, but uses
     its own facilities to connect to remote servers.

     The idea for the hybrid SAX/DOM mode where we consume tokens in
     the stream to create an entire node for a sub-tree of the document
     was first suggested to me by Seth Falcon at the Fred Hutchinson
     Cancer Research Center.  It is similar to the  XML::Twig module in
     Perl by Michel Rodriguez.

_A_u_t_h_o_r(_s):

     Duncan Temple Lang

_R_e_f_e_r_e_n_c_e_s:

     <URL: http://www.w3.org/XML>, <URL: http://www.jclark.com/xml>

_S_e_e _A_l_s_o:

     xmlTreeParse

_E_x_a_m_p_l_e_s:

      fileName <- system.file("exampleData", "mtcars.xml", package="XML")

        # Print the name of each XML tag encountered at the beginning of each
        # tag.
        # Uses the libxml SAX parser.
      xmlEventParse(fileName,
                     list(startElement=function(name, attrs){
                                         cat(name,"\n")
                                       }),
                     useTagName=FALSE, addContext = FALSE)

     ## Not run: 
       # Parse the text rather than a file or URL by reading the URL's contents
       # and making it a single string. Then call xmlEventParse
     xmlURL <- "http://www.omegahat.org/Scripts/Data/mtcars.xml"
     xmlText <- paste(scan(xmlURL, what="",sep="\n"),"\n",collapse="\n")
     xmlEventParse(xmlText, asText=TRUE)
     ## End(Not run)

         # Using a state object to share mutable data across callbacks
     f <- system.file("exampleData", "gnumeric.xml", package = "XML")
     zz <- xmlEventParse(f,
                         handlers = list(startElement=function(name, atts, .state) {
                                                          .state = .state + 1
                                                          print(.state)
                                                          .state
                                                      }), state = 0)
     print(zz)



         # Illustrate the startDocument and endDocument handlers.
     xmlEventParse(fileName,
                    handlers = list(startDocument = function() {
                                                      cat("Starting document\n")
                                                    },
                                    endDocument = function() {
                                                      cat("ending document\n")
                                                  }),
                    saxVersion = 2)



     if(libxmlVersion()$major >= 2) {

      startElement = function(x, ...) cat(x, "\n")

      xmlEventParse(file(f), handlers = list(startElement = startElement))

      # Parse with a function providing the input as needed.
      xmlConnection = 
       function(con) {

        if(is.character(con))
          con = file(con, "r")
       
        if(isOpen(con, "r"))
          open(con, "r")

        function(len) {

          if(len < 0) {
             close(con)
             return(character(0))
          }

           x = character(0)
           tmp = ""
         while(length(tmp) > 0 && nchar(tmp) == 0) {
           tmp = readLines(con, 1)
           if(length(tmp) == 0)
             break
           if(nchar(tmp) == 0)
             x = append(x, "\n")
           else
             x = tmp
         }
         if(length(tmp) == 0)
           return(tmp)
       
         x = paste(x, collapse="")

         x
       }
      }

      ff = xmlConnection(f)
      xmlEventParse(ff, handlers = list(startElement = startElement))

       # Parse from a connection. Each time the parser needs more input, it
       # calls readLines(<con>, 1)
      xmlEventParse(file(f),  handlers = list(startElement = startElement))

       # using SAX 2
      h = list(startElement = function(name, attrs, namespace, allNamespaces){ 
                                      cat("Starting", name,"\n")
                                      if(length(attrs))
                                          print(attrs)
                                      print(namespace)
                                      print(allNamespaces)
                              },
               endElement = function(name, uri) {
                               cat("Finishing", name, "\n")
                 }) 
      xmlEventParse(system.file("exampleData", "namespaces.xml", package="XML"), handlers = h, saxVersion = 2)

      # This example is not very realistic but illustrates how to use the
      # branches argument. It forces the creation of complete nodes for
      # elements named <b> and extracts the id attribute.
      # This could be done directly on the startElement, but this just
      # illustrates the mechanism.
      filename = system.file("exampleData", "branch.xml", package="XML")
      b.counter = function() {
                     nodes <- character()
                     f = function(node) { nodes <<- c(nodes, xmlGetAttr(node, "id"))}
                     list(b = f, nodes = function() nodes)
                  }

       b = b.counter()
       invisible(xmlEventParse(filename, branches = b["b"]))
       b$nodes()

       filename = system.file("exampleData", "branch.xml", package="XML")
        
       invisible(xmlEventParse(filename, branches = list(b = function(node) {print(names(node))})))
       invisible(xmlEventParse(filename, branches = list(b = function(node) {print(xmlName(xmlChildren(node)[[1]]))})))
     }

