getLongestSubstring        package:Rlibstree        R Documentation

_C_o_m_p_u_t_e _l_o_n_g_e_s_t _r_e_p_e_a_t_e_d _o_r _c_o_m_m_o_n _s_u_b_s_t_r_i_n_g _i_n _a _S_u_f_f_i_x_T_r_e_e

_D_e_s_c_r_i_p_t_i_o_n:

     This function works with a suffix tree, either passed to it
     directly or by building one from a character vector or a
     StringSet. The function can be used to find the longest common
     substring shared by two or more words, or alernatively to find the
     longest substring that is repeated, i.e. occurs at least twice,
     within a word or across two or more words.

     When finding the common substring, the string must be present in
     each of the words. When finding the repeated substring, the
     substring can be found across two

     If one is going to do multiple operations on the same collection
     of strings, it is sensible to first build the SuffixTree (using
     'SuffixTree') and then pass this object in each of the calls.

     This function is a relatively straightforward interface to the
     libstree routines 'lst_alg_longest_repeated_substring'  and
     'lst_alg_longest_common_substring'. Therefore, more information
     can be found from their documentation.

_U_s_a_g_e:

     getLongestRepeatedSubstring(words, range = c(1, 0), asCharacter = TRUE)
     getLongestCommonSubstring(words, range = c(1, 0), asCharacter = TRUE)
     getLongestSubstring(stree, repeated = TRUE, range = c(1, 0), asCharacter = TRUE)

_A_r_g_u_m_e_n_t_s:

stree, words: the collection of strings which are to be searched for
          the longest substring. This can be a character vector, a
          'StringSet' or a 'SuffixTree'. 

repeated: a logical value.  If this is 'TRUE', then we look for
          repeated substrings. If it is 'FALSE', then we look for
          common substrings.  See the document for  libstree, <URL: >

   range: a pair of integers giving the minimum and maximum length of
          the substrings over which to search.  If the second value is
          0, this means substrings of all possible length, i.e. the
          maximum of the longest string in the set. If the caller
          supplies just a single integer, the trailing 0 is assumed. 

asCharacter: a logical value indicating whether the result should be 
          converted to a character vector in R or, alternatively
          (FALSE), left as a 'StringSet-class'.

_D_e_t_a_i_l_s:

     This uses the libstree routines
     'lst_alg_longest_repeated_substring' and 
     'lst_alg_longest_common_substring'.

_V_a_l_u_e:

     If 'asCharacter' is 'TRUE', the default, the result  is a
     character vector. Otherwise, it is an object of class
     'StringSet-class'.

_N_o_t_e:

     The libstree distribution has some bugs. If possible, test any
     anomalies with the executables in libstree's 'test' directory to
     determine if they are due to the code in this package or libstree
     itself.

_A_u_t_h_o_r(_s):

     Duncan Temple Lang <duncan@wald.ucdavis.edu>

_R_e_f_e_r_e_n_c_e_s:

     <URL: http://www.cl.cam.ac.uk/~cpk25/libstree/>{libstree} <URL:
     http://www.omegahat.org/Rlibstree>

_S_e_e _A_l_s_o:

     'SuffixTree' 'StringSet' 'getCommonPrefix'

_E_x_a_m_p_l_e_s:

      els = c("aaabbbaaabbb", "aaa", "aabb")
       # "aaabbb"
      getLongestRepeatedSubstring(els)

       # "aa" 
      getLongestCommonSubstring(els)
       # Same call but with the geneal getLongestSubstring() function.
      getLongestSubstring(els, repeated = FALSE)

      
       words = c("stemming", "boing", "springs")
       tree = SuffixTree(words)

         # The longest common or repeated substring for these is the same - "ing"
         # Longest repeated substring
      getLongestRepeatedSubstring(tree)

         # Longest common substring.
      getLongestCommonSubstring(tree)

      # Find the repeated substring. 
      # Note it finds aaaa twice in the second string aaaax and xaaaa
      # where x is an arbitrary character, admittedly also a.
     getLongestRepeatedSubstring(c("aaa sdsd", "aaaaa", "xyz"))


       # This returns "aa" which is repeated as subsequences 1:2 and 2:3,
       # i.e. repeating the use of the middle "a"
     getLongestRepeatedSubstring("aaa")

      # Get the return value as a StringSet
     set = getLongestSubstring(tree, asCharacter = FALSE)
     length(set)

      # The word mississipi and the same word backword and we can find the
      # longest palindrome.  Taken from the Perl module Tree::Suffix by Gray

      # First, a function to reverse the order of the characters in each word
      reverseWord = function(word)
                       sapply(strsplit(word, ""), function(x) paste(rev(x), collapse = ""))

      # Just check it does it correctly, round trip the word
     "mississippi" == reverseWord(reverseWord("mississippi"))

       # We get "ississi 
      getLongestSubstring(c("mississippi", reverseWord("mississippi")), TRUE, c(0, 0))


      # just of the word itself.
      #   "issi"
     getLongestSubstring("mississippi", TRUE, c(0, 0))

     # Longest repeated substring is esday
     getLongestSubstring(c("Monday", "Tuesday", "Wednesday"), TRUE)

     # Longest common substring is day
     getLongestSubstring(c("Monday", "Tuesday", "Wednesday"), FALSE)

       # We get the common prefix as the longest substring
       # [1] "ABCDEF_"
      getLongestSubstring(paste("ABCDEF_", c("Monday", "Tuesday", "Wednesday"), sep = ""), TRUE, c(0, 0))


      # The names of enumerated constants in Microsoft Word's
      # scripting interface.  We want to find the common prefix.

     enumNames = c('wdSummaryModeHighlight',
                   'wdSummaryModeHideAllButSummary',
                   'wdSummaryModeInsert',
                   'wdSummaryModeCreateNew')

      # common substring
     x = getLongestCommonSubstring(enumNames)

     x == "wdSummaryMode"

      # longest repeated substring
      # This is "wdSummaryModeHi" shared by the first two elements.

     x = getLongestSubstring(enumNames)

     x == "wdSummaryModeHi"

     # A series of examples of repeated substrings within a single string

      # "first a"
     getLongestSubstring("first and first again and again")

      # [1] "first " " again"
     getLongestSubstring("first then first again and again")

      # [1] "first " " again"
     getLongestSubstring(c("first then first again and again", "first"))

      # This finds " again and again" 
     getLongestSubstring(c("first then first again and again", "Or again and again"))


       # We take this very long place name in New Zealand and find the
       # repeated substrings.
       # "ata" "aka" "ang" "mat" "tan" "nga" 
       nzPlaceName = "Taumatawhakatangihangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu"
       getLongestRepeatedSubstring(nzPlaceName)

