getNodeSet                package:XML                R Documentation

_F_i_n_d _m_a_t_c_h_i_n_g _n_o_d_e_s _i_n _a_n _i_n_t_e_r_n_a_l _X_M_L _t_r_e_e/_D_O_M

_D_e_s_c_r_i_p_t_i_o_n:

     These functions provide a way to find XML nodes that match a
     particular criterion. It uses the XPath syntax and allows quite
     powerful expressions for identifying nodes of interest.  The XPath
     language requires some knowledge, but tutorials are available on
     the Web and in books. XPath queries can result in different types
     of values such as numbers, strings, and node sets. It allows
     simple identification of nodes by name, by path (i.e. hierarchies
     or sequences of node-child-child...), with a particular attribute 
     or matching a particular attribute with a given value. It also
     supports functionality for navigating nodes in the tree within a
     query (e.g. 'ancestor()', 'child()', 'self()'), and also for
     manipulating the content of one or more nodes (e.g. 'text'). And
     it allows for criteria identifying nodes by position, etc. using
     some counting operations.  Combining  XPath with R allows for
     quite flexible node identification and manipulation. XPath offers
     an alternative way to find nodes of interest  than recursively or
     iteratively navigating the entire tree in R and performing the
     navigation explicitly.

     The set of matching nodes corresponding to an XPath expression are
     returned in R as a list.  One can then  iterate over these
     elements to process the  nodes in whatever way one wants.
     Unfortunately, this involves two loops - one in the XPath query
     over the entire tree, and another in R. Typically, this is fine as
     the number of matching nodes is reasonably small. However, if
     repeating this on numerous files, speed may become an issue. We
     can avoid the second loop (i.e. the one in R) by applying a
     function to each node before it is returned to R as part of the
     node set.  The result of the function call is then returned,
     rather than the node itself.

     One can provide an R expression rather than an R function for
     'fun'. This is expected to be a call and the first argument of the
     call will be replaced with the node.

     Dealing with expressions that relate to the default namespaces in
     the XML document can be confusing. 

     'xpathSApply' is a version of 'xpathApply' which attempts to
     simplify the result if it can be converted to a vector or matrix
     rather than left as a list. In this way, it has the same
     relationship to  'xpathApply' as 'sapply' has to  'lapply'.

     'matchNamespaces' is a separate function that is used to
     facilitate specifying the mappings from namespace prefix  used in
     the XPath expression and their definitions, i.e. URIs, and
     connecting these with the namespace definitions in the target XML
     document in which the XPath expression will be evaluated.

     'matchNamespaces' uses rules that are very slightly awkard or
     specifically involve a special case. This is because this mapping
     of namespaces from  XPath to XML targets is difficult, involving
     prefixes in the XPath expression, definitions in the XPath
     evaluation context and matches of URIs with those in the XML
     document. The function aims to avoid having to specify all the
     prefix=uri pairs by using "sensible" defaults and also matching
     the prefixes in the XPath expression to the  corresponding
     definitions in the XML document.

     The rules are as follows. 'namespaces' is a character vector. Any
     element that has a non-trivial name (i.e. other than "") is left
     as is and the name and value define the prefix = uri mapping. Any
     elements that have a trivial name (i.e. no name at all or "") are
     resolved by first matching the prefix to those of the defined
     namespaces anywhere within the target document, i.e. in any node
     and not just the root one. If there is no match for the first
     element of the 'namespaces' vector, this is treated specially and
     is mapped to the default namespace of the target document. If
     there is no default namespace defined, an error occurs.

     It is best to give explicit the argument in the form 'c(prefix =
     uri, prefix = uri)'. However, one can use the same namespace
     prefixes as in the document if one wants.  And one can use an
     arbitrary namespace prefix for the default namespace URI of the
     target document provided it is the first element of 'namespaces'.

     See the 'Details' section below for some more information.

_U_s_a_g_e:

     getNodeSet(doc, path, namespaces = xmlNamespaceDefinitions(doc, simplify = TRUE), 
                         fun = NULL, ...)
     xpathApply(doc, path, fun, ... , namespaces =  xmlNamespaceDefinitions(doc, simplify = TRUE),
                         resolveNamespaces = TRUE)
     xpathSApply(doc, path, fun = NULL, ... , namespaces = xmlNamespaceDefinitions(doc, simplify = TRUE),
                         resolveNamespaces = TRUE, simplify = TRUE)
     matchNamespaces(doc, namespaces,
                                    nsDefs = xmlNamespaceDefinitions(doc, recursive = TRUE),
                                      defaultNs = getDefaultNamespace(doc))

_A_r_g_u_m_e_n_t_s:

     doc: an object of class 'XMLInternalDocument'

    path: a string (character vector of length 1) giving the XPath
          expression to evaluate.

namespaces: a named character vector giving the namespace prefix and
          URI pairs that are to be used in the XPath expression and
          matching of nodes. The prefix is just a simple string that
          acts as a short-hand  or alias for the URI that is the unique
          identifier for the namespace. The URI is the element in this
          vector and the prefix is the corresponding element name. One
          only needs to specify the namespaces in the XPath expression
          and for the nodes of interest rather than requiring all the
          namespaces for the entire document. Also note that the prefix
          used in this vector is local only to the path. It does not
          have to be the same as the prefix used in the document to
          identify the namespace. However, the URI in this argument
          must be identical to the target namespace URI in the
          document.  It is the namespace URIs that are matched
          (exactly) to find correspondence. The prefixes are used only
          to refer to that URI. 

     fun: a function object, or an expression or call, which is used
          when the result is a node set and evaluated for each node
          element in the node set.  If this is a call, the first
          argument is replaced  with the current node. 

     ...: any additional arguments to be passed to 'fun' for each node
          in the node set.

resolveNamespaces: a logical value indicating whether to process the
          collection of namespaces and resolve those that have no name
          by looking in the default namespace and the namespace
          definitions within the target document to match by prefix.

  nsDefs: a list giving the namespace definitions in which to match any
          prefixes. This is typically computed directly from the target
          document and the default value is most appropriate.

defaultNs: the default namespace prefix-URI mapping given as a named
          character vector. This is not a namespace definition object.
          This is used when matching a simple prefix that has no
          corresponding entry in 'nsDefs' and is the first element in
          the 'namespaces' vector. 

simplify: a logical value indicating whether the function should
          attempt to perform the simplification of the result into a
          vector rather  than leaving it as a list. This is the same as
          'sapply' does in comparison to 'lapply'. 

_D_e_t_a_i_l_s:

     When a namespace is defined on a node in the XML document, an
     XPath expressions must use a namespace, even if it is the default
     namespace for the XML document/node. For example, suppose we have
     an XML document '<help
     xmlns="http://www.r-project.org/Rd"><topic>...</topic></help>'  
     To find all the topic nodes, we might want to use the XPath
     expression '"/help/topic"'. However, we must use an explicit
     namespace prefix that is associated with the URI
     'http://www.r-project.org/Rd' corresponding to the one in the XML
     document. So we would use 'getNodeSet(doc, "/r:help/r:topic", c(r
     = "http://www.r-project.org/Rd"))'.

     As described above, the functions attempt to allow the namespaces
     to be specified easily by the R user and matched to the namespace
     definitions in the target document.

     This calls the libxml routine 'xmlXPathEval'.

_V_a_l_u_e:

     The results can currently be different based on the returned value
     from the XPath expression evaluation: 

    list: a node set

 numeric: a number

 logical: a boolean

character: a string, i.e. a single character element.


     If 'fun' is supplied and the result of the XPath query is a node
     set,  the result in R is a list.

_N_o_t_e:

     In order to match nodes in the default name space for documents
     with a non-trivial default namespace, e.g. given as
     'xmlns="http://www.omegahat.org"', you will need to use a prefix
     for the default namespace in this call. When specifying the
     namespaces, give a name - any name - to the default namespace URI
     and then use this as the prefix in the XPath expression, e.g.
     'getNodeSet(d, "//d:myNode", c(d = "http://www.omegahat.org"))' to
     match myNode in the default name space 'http://www.omegahat.org'.

     This default namespace of the document is now computed for us and
     is the default value for the namespaces argument. It can be
     referenced using the prefix 'd', standing for default but
     sufficiently short to be easily used within the XPath expression.

     More of the XPath functionality provided by libxml can and may be
     made available to the R package. Facilities such as compiled XPath
     expressions, functions, ordered node information are examples.

     Please send requests to the package maintainer.

_A_u_t_h_o_r(_s):

     Duncan Temple Lang <duncan@wald.ucdavis.edu>

_R_e_f_e_r_e_n_c_e_s:

     <URL: http://xmlsoft.org>,  <URL: http://www.w3.org/xml> <URL:
     http://www.w3.org/TR/xpath> <URL: http://www.omegahat.org/RSXML>

_S_e_e _A_l_s_o:

     'xmlTreeParse' with 'useInternalNodes' as 'TRUE'.

_E_x_a_m_p_l_e_s:

      doc = xmlTreeParse(system.file("exampleData", "tagnames.xml", package = "XML"), useInternalNodes = TRUE)
      
      els = getNodeSet(doc, "/doc//a[@status]")
      sapply(els, function(el) xmlGetAttr(el, "status"))

        # use of namespaces on an attribute.
      getNodeSet(doc, "/doc//b[@x:status]", c(x = "http://www.omegahat.org"))
      getNodeSet(doc, "/doc//b[@x:status='foo']", c(x = "http://www.omegahat.org"))

        # Because we know the namespace definitions are on /doc/a
        # we can compute them directly and use them.
      nsDefs = xmlNamespaceDefinitions(getNodeSet(doc, "/doc/a")[[1]])
      ns = structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))
      getNodeSet(doc, "/doc//b[@omegahat:status='foo']", ns)[[1]]

      # free(doc) 

      #####
      f = system.file("exampleData", "eurofxref-hist.xml.gz", package = "XML") 
      e = xmlTreeParse(f, useInternal = TRUE)
      ans = getNodeSet(e, "//o:Cube[@currency='USD']", "o")
      sapply(ans, xmlGetAttr, "rate")

       # or equivalently
      ans = xpathApply(e, "//o:Cube[@currency='USD']", xmlGetAttr, "rate", namespaces = "o")
      # free(e)


       # Using a namespace
      f = system.file("exampleData", "SOAPNamespaces.xml", package = "XML") 
      z = xmlTreeParse(f, useInternal = TRUE)
      getNodeSet(z, "/a:Envelope/a:Body", c("a" = "http://schemas.xmlsoap.org/soap/envelope/"))
      getNodeSet(z, "//a:Body", c("a" = "http://schemas.xmlsoap.org/soap/envelope/"))
      # free(z)

       # Get two items back with namespaces
      f = system.file("exampleData", "gnumeric.xml", package = "XML") 
      z = xmlTreeParse(f, useInternalNodes = TRUE)
      getNodeSet(z, "//gmr:Item/gmr:name", c(gmr="http://www.gnome.org/gnumeric/v2"))

      #free(z)

      #####
      # European Central Bank (ECB) exchange rate data

       # Data is available from "http://www.ecb.int/stats/eurofxref/eurofxref-hist.xml"
       # or locally.

      uri = system.file("exampleData", "eurofxref-hist.xml.gz", package = "XML")
      doc = xmlTreeParse(uri, useInternalNodes = TRUE)

        # The default namespace for all elements is given by
      namespaces <- c(ns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref")

          # Get the data for Slovenian currency for all time periods.
          # Find all the nodes of the form <Cube currency="SIT"...>

      slovenia = getNodeSet(doc, "//ns:Cube[@currency='SIT']", namespaces )

         # Now we have a list of such nodes, loop over them 
         # and get the rate attribute
      rates = as.numeric( sapply(slovenia, xmlGetAttr, "rate") )
         # Now put the date on each element
         # find nodes of the form <Cube time=".." ... >
         # and extract the time attribute
      names(rates) = sapply(getNodeSet(doc, "//ns:Cube[@time]", namespaces ), 
                           xmlGetAttr, "time")

         #  Or we could turn these into dates with strptime()
      strptime(names(rates), "%Y-%m-%d")

        #  Using xpathApply, we can do
      rates = xpathApply(doc, "//ns:Cube[@currency='SIT']", xmlGetAttr, "rate", namespaces = namespaces )
      rates = as.numeric(unlist(rates))

        # Using an expression rather than  a function and ...
      rates = xpathApply(doc, "//ns:Cube[@currency='SIT']", quote(xmlGetAttr(x, "rate")), namespaces = namespaces )

      #free(doc)

        #
       uri = system.file("exampleData", "namespaces.xml", package = "XML")
       d = xmlTreeParse(uri, useInternalNodes = TRUE)
       getNodeSet(d, "//c:c", c(c="http://www.c.org"))

       getNodeSet(d, "/o:a//c:c", c("o" = "http://www.omegahat.org", "c" = "http://www.c.org"))

        # since http://www.omegahat.org is the default namespace, we can
        # just the prefix "o" to map to that.
       getNodeSet(d, "/o:a//c:c", c("o", "c" = "http://www.c.org"))

        # the following, perhaps unexpectedly but correctly, returns an empty
        # with no matches
        
       getNodeSet(d, "//defaultNs", "http://www.omegahat.org")

        # But if we create our own prefix for the evaluation of the XPath
        # expression and use this in the expression, things work as one
        # might hope.
       getNodeSet(d, "//dummy:defaultNs", c(dummy = "http://www.omegahat.org"))

        # And since the default value for the namespaces argument is the
        # default namespace of the document, we can refer to it with our own
        # prefix given as 
       getNodeSet(d, "//d:defaultNs", "d")

        # And the syntactic sugar is 
       d["//d:defaultNs", namespaces = "d"]

        # this illustrates how we can use the prefixes in the XML document
        # in our query and let getNodeSet() and friends map them to the
        # actual namespace definitions.
        # "o" is used to represent the default namespace for the document
        # i.e. http://www.omegahat.org, and "r" is mapped to the same
        # definition that has the prefix "r" in the XML document.

       tmp = getNodeSet(d, "/o:a/r:b/o:defaultNs", c("o", "r"))
       xmlName(tmp[[1]])

       #free(d)

        # Work with the nodes and their content (not just attributes) from the node set.
        # From bondsTables.R in examples/

       doc = htmlTreeParse("http://finance.yahoo.com/bonds/composite_bond_rates", useInternalNodes = TRUE)
       if(is.null(xmlRoot(doc))) 
          doc = htmlTreeParse("http://finance.yahoo.com/bonds", useInternalNodes = TRUE)

          # Use XPath expression to find the nodes 
          #  <div><table class="yfirttbl">..
          # as these are the ones we want.

       if(!is.null(xmlRoot(doc))) {

        o = getNodeSet(doc, "//div/table[@class='yfirttbl']")

         # Write a function that will extract the information out of a given table node.
        readHTMLTable =
        function(tb)
         {
               # get the header information.
           colNames = sapply(tb[["thead"]][["tr"]]["th"], xmlValue)
           vals = sapply(tb[["tbody"]]["tr"],  function(x) sapply(x["td"], xmlValue))
           matrix(as.numeric(vals[-1,]),
                   nrow = ncol(vals),
                   dimnames = list(vals[1,], colNames[-1]),
                   byrow = TRUE
                 )
         }  

          # Now process each of the table nodes in the o list.
         tables = lapply(o, readHTMLTable)
         names(tables) = lapply(o, function(x) xmlValue(x[["caption"]]))
       }

          # this illustrates an approach to doing queries on a sub tree
          # within the document.
          # Note that there is a memory leak incurred here as we create a new
          # XMLInternalDocument in the getNodeSet().

         f = system.file("exampleData", "book.xml", package = "XML")
         doc = xmlTreeParse(f, useInternal = TRUE)
         ch = getNodeSet(doc, "//chapter")
         xpathApply(ch[[2]], "//section/title", xmlValue)

           # To fix the memory leak, we explicitly create a new document for
           # the subtree, perform the query and then free it _when_ we are done
           # with the resulting nodes.
         subDoc = xmlDoc(ch[[2]])
         xpathApply(subDoc, "//section/title", xmlValue)
         free(subDoc)

         txt = '<top xmlns="http://www.r-project.org" xmlns:r="http://www.r-project.org"><r:a><b/></r:a></top>'
         doc = xmlInternalTreeParse(txt, asText = TRUE)

     ## Not run: 
          # Will fail because it doesn't know what the namespace x is
          # and we have to have one eventhough it has no prefix in the document.
         xpathApply(doc, "//x:b")
     ## End(Not run)    
           # So this is how we do it - just  say x is to be mapped to the
           # default unprefixed namespace which we shall call x!
         xpathApply(doc, "//x:b", namespaces = "x")

            # Here r is mapped to the the corresponding definition in the document.
         xpathApply(doc, "//r:a", namespaces = "r")
            # Here, xpathApply figures this out for us, but will raise a warning.
         xpathApply(doc, "//r:a")

            # And here we use our own binding.
         xpathApply(doc, "//x:a", namespaces = c(x = "http://www.r-project.org"))

