Type: | Package |
Title: | Hyphenation and Syllable Counting for Text Analysis |
Description: | Provides the hyphenation algorithm used for 'TeX'/'LaTeX' and similar software, as proposed by Liang (1983, https://tug.org/docs/liang/). Mainly contains the function hyphen() to be used for hyphenation/syllable counting of text objects. It was originally developed for and part of the 'koRpus' package, but later released as a separate package so it's lighter to have this particular functionality available for other packages. Support for various languages needs be added on-the-fly or by plugin packages (https://undocumeantit.github.io/repos/); this package does not include any language specific data. Due to some restrictions on CRAN, the full package sources are only available from the project homepage. To ask for help, report bugs, request features, or discuss the development of the package, please subscribe to the koRpus-dev mailing list (http://korpusml.reaktanz.de). |
Depends: | R (≥ 3.0.0) |
Imports: | methods |
Suggests: | testthat,knitr,rmarkdown,sylly.de,sylly.en,sylly.es |
VignetteBuilder: | knitr |
URL: | https://reaktanz.de/?c=hacking&s=sylly |
BugReports: | https://github.com/unDocUMeantIt/sylly/issues |
Additional_repositories: | https://undocumeantit.github.io/repos/l10n |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
LazyLoad: | yes |
Version: | 0.1-6 |
Date: | 2020-09-19 |
RoxygenNote: | 7.1.1 |
Collate: | '00_environment.R' '01_class_01_kRp.hyph.pat.R' '01_class_02_kRp.hyphen.R' '02_method_correct.R' '02_method_hyphen.R' '02_method_kRp.hyphen.R' '02_method_show.kRp.hyphen.R' '02_method_summary.kRp.hyphen.R' 'available.sylly.lang.R' 'get.sylly.env.R' 'install.sylly.lang.R' 'manage.hyph.pat.R' 'read.hyph.pat.R' 'set.hyph.support.R' 'set.sylly.env.R' 'sylly-internal.R' 'sylly-internal_langpack_generator.R' 'sylly-package.R' |
NeedsCompilation: | no |
Packaged: | 2020-09-19 21:36:38 UTC; m |
Author: | Meik Michalke [aut, cre] |
Maintainer: | Meik Michalke <meik.michalke@hhu.de> |
Repository: | CRAN |
Date/Publication: | 2020-09-20 04:40:02 UTC |
Hyphenation and Syllable Counting for Text Analysis
Description
Provides the hyphenation algorithm used for 'TeX'/'LaTeX' and similar software, as proposed by Liang (1983, <https://tug.org/docs/liang/>). Mainly contains the function hyphen() to be used for hyphenation/syllable counting of text objects. It was originally developed for and part of the 'koRpus' package, but later released as a separate package so it's lighter to have this particular functionality available for other packages. Support for various languages needs be added on-the-fly or by plugin packages (<https://undocumeantit.github.io/repos/>); this package does not include any language specific data. Due to some restrictions on CRAN, the full package sources are only available from the project homepage. To ask for help, report bugs, request features, or discuss the development of the package, please subscribe to the koRpus-dev mailing list (<http://korpusml.reaktanz.de>).
Details
The DESCRIPTION file:
Package: | sylly |
Type: | Package |
Version: | 0.1-6 |
Date: | 2020-09-19 |
Depends: | R (>= 3.0.0) |
Encoding: | UTF-8 |
License: | GPL (>= 3) |
LazyLoad: | yes |
URL: | https://reaktanz.de/?c=hacking&s=sylly |
Author(s)
NA
Maintainer: NA
See Also
Useful links:
List available language packages
Description
Get a list of all currently available language packages for sylly from the official l10n repository.
Usage
available.sylly.lang(repos = "https://undocumeantit.github.io/repos/l10n/")
Arguments
repos |
The URL to additional repositories to query. You should probably leave this to the
default, but if you would like to use a third party repository, you're free to do so. The
value is temporarily appended to the repos currently returned by |
Details
sylly's language support is modular by design, meaning you can load
an extension package for each language you want to work with in a given session.
These language support packages are named sylly.**
, where **
is replaced by a valid language identifier (like en
for English or de
for German). See set.hyph.support
for more details.
This function downloads the package list from (also) the official localization repository for sylly and lists all currently available language packages that you could install and load. Apart from than it does not download or install anything.
You can install the packages by either calling the convenient wrapper function
install.sylly.lang
, or
install.packages
(see examples).
Value
Returns an invisible character vector with all available language packages.
See Also
Examples
## Not run:
# see all available language packages
available.sylly.lang()
# install support for German
install.sylly.lang("de")
# alternatively, you could call install.packages directly
install.packages("sylly.de", repos="https://undocumeantit.github.io/repos/l10n/")
## End(Not run)
Correct kRp.hyphen objects
Description
The method correct.hyph
can be used to alter objects of class kRp.hyphen
.
Usage
correct.hyph(obj, word = NULL, hyphen = NULL, cache = TRUE)
## S4 method for signature 'kRp.hyphen'
correct.hyph(obj, word = NULL, hyphen = NULL, cache = TRUE)
Arguments
obj |
An object of class |
word |
A character string,
the (possibly incorrectly hyphenated) |
hyphen |
A character string,
the new manually hyphenated version of |
cache |
Logical, if |
Details
Although hyphenation should turn out to be rather accurate, the algorithm does ususally produce some errors. If you want to correct for these flaws, this method can be of help, because it might prevent you from introducing new errors. That is, it will do some sanitiy checks before the object is actually manipulated and returned.
That is,
correct.hyph
checks whether word
and hyphen
are actually hyphenations of the
same token before proceeding. If so,
it will also recalculate the number of syllables and update the syll
field.
If both word
and hyphen
are NULL
,
correct.hyph
will try to simply recalculate the syllable count
for each word,
by counting the hyphenation marks (and adding 1 to the number). This can be usefull if you changed hyphenation
some other way, e.g. in a spreadsheet GUI,
but don't want to have to correct the syllable count yourself as well.
Value
An object of the same class as obj
.
Examples
## Not run:
hyphenated.txt <- correct.hyph(hyphenated.txt, "Hilfe", "Hil-fe")
## End(Not run)
Getter/setter methods for sylly objects
Description
These methods should be used to get or set values of hyphenated text objects
generated by functions like hyphen()
.
Usage
describe(obj, ...)
## S4 method for signature 'kRp.hyphen'
describe(obj)
describe(obj, ...) <- value
## S4 replacement method for signature 'kRp.hyphen'
describe(obj, ...) <- value
language(obj)
## S4 method for signature 'kRp.hyphen'
language(obj)
language(obj) <- value
## S4 replacement method for signature 'kRp.hyphen'
language(obj) <- value
hyphenText(obj)
## S4 method for signature 'kRp.hyphen'
hyphenText(obj)
hyphenText(obj) <- value
## S4 replacement method for signature 'kRp.hyphen'
hyphenText(obj) <- value
## S4 method for signature 'kRp.hyphen'
x[i, j]
## S4 replacement method for signature 'kRp.hyphen'
x[i, j] <- value
## S4 method for signature 'kRp.hyphen'
x[[i]]
## S4 replacement method for signature 'kRp.hyphen'
x[[i]] <- value
Arguments
obj |
An object of class |
... |
Additional arguments as defined by respective methods. |
value |
A value to set. |
x |
An object of class |
i |
Row index. |
j |
Column index. |
Details
describe()
returns thedesc
slot.language()
returns thelang
slot.hyphenText()
returns thehyphen
slot from objects of classkRp.hyphen
.[
/[[
Can be used as a shortcut to index the results ofhyphenText()
.
Examples
## Not run:
hyphenText(hyphenated.txt)
## End(Not run)
Get sylly session settings
Description
The function get.sylly.env
returns information on your session environment regarding the sylly package,
e.g.
whether a cache file should be used,
if it was set before using set.sylly.env
.
Usage
get.sylly.env(..., errorIfUnset = TRUE)
Arguments
... |
Named parameters to get from the sylly environment. Valid arguments are:
|
errorIfUnset |
Logical, if |
Details
For the most part,
get.sylly.env
is a convenient wrapper for getOption
.
Value
A character string or list, possibly including:
lang |
The specified language |
hyph.cache.file |
The specified hyphenation cache file for |
See Also
Examples
set.sylly.env(hyph.cache.file=file.path(tempdir(), "cache_file.RData"))
get.sylly.env(hyph.cache.file=TRUE)
Automatic hyphenation
Description
These methods implement word hyphenation, based on Liang's algorithm.
Usage
hyphen(words, ...)
## S4 method for signature 'character'
hyphen(
words,
hyph.pattern = NULL,
min.length = 4,
rm.hyph = TRUE,
quiet = FALSE,
cache = TRUE,
as = "kRp.hyphen"
)
hyphen_df(words, ...)
## S4 method for signature 'character'
hyphen_df(
words,
hyph.pattern = NULL,
min.length = 4,
rm.hyph = TRUE,
quiet = FALSE,
cache = TRUE
)
hyphen_c(words, ...)
## S4 method for signature 'character'
hyphen_c(
words,
hyph.pattern = NULL,
min.length = 4,
rm.hyph = TRUE,
quiet = FALSE,
cache = TRUE
)
Arguments
words |
Either a character vector with words/tokens to be hyphenated,
or any tagged text object generated with the |
... |
Only used for the method generic. |
hyph.pattern |
Either an object of class |
min.length |
Integer,
number of letters a word must have for considering a hyphenation. |
rm.hyph |
Logical, whether appearing hyphens in words should be removed before pattern matching. |
quiet |
Logical. If |
cache |
Logical. |
as |
A character string defining the class of the object to be returned. Defaults to |
Details
For this to work the function must be told which pattern set it should use to
find the right hyphenation spots. The most straight forward way to add support
for a particular language during a session is to load an appropriate language
package (e.g., the package sylly.en
for English or sylly.de
for German).
See available.sylly.lang
and
install.sylly.lang
for more informatin on how
to get language support packages.
After such a package was loaded, you can simply use the language abbreviation as
the value for the hyph.pattern
argument (like "en"
for the English
pattern set). If words
is an object that was tokenized and tagged with
the koRpus
package, its language definition can be used instead, i.e. you
don't need to specify hyph.pattern
, hyphen
will pick the language
automatically.
In case you'd rather use your own pattern set, hyph.pattern
can be an
object of class kRp.hyph.pat
, alternatively.
Value
An object of class kRp.hyphen
,
data.frame
or a numeric vector, depending on the value
of the as
argument.
References
Liang, F.M. (1983). Word Hy-phen-a-tion by Com-put-er. Dissertation, Stanford University, Dept. of Computer Science.
See Also
read.hyph.pat
,
manage.hyph.pat
,
available.sylly.lang
, and
install.sylly.lang
Examples
## Not run:
library(sylly.en)
sampleText <- c("This", "is", "a", "rather", "stupid", "demonstration")
hyphen(sampleText, hyph.pattern="en")
hyphen_df(sampleText, hyph.pattern="en")
hyphen_c(sampleText, hyph.pattern="en")
# using a koRpus object
hyphen(tagged.text)
## End(Not run)
Install language support packages
Description
This is a wrapper for install.packages
,
making it more
convenient to install additional language support packages for sylly.
Usage
install.sylly.lang(
lang,
repos = "https://undocumeantit.github.io/repos/l10n/",
...
)
Arguments
lang |
Character vector,
one or more valid language identifiers (like |
repos |
The URL to additional repositories to query. You should probably leave this to the
default, but if you would like to use a third party repository, you're free to do so. The
value is temporarily appended to the repos currently returned by |
... |
Additional options for |
Details
For a list of currently available language packages see available.sylly.lang
.
See set.hyph.support
for more details on sylly's language support in general.
Value
Does not return any useful objects,
just calls install.packages
.
See Also
install.packages
,
available.sylly.lang
Examples
## Not run:
# install support for German
install.sylly.lang("de")
# load the package
library("sylly.de")
## End(Not run)
S4 Class kRp.hyph.pat
Description
This class is used for objects that are returned by read.hyph.pat
.
Details
Since this package has been a part of the koRpus
package before,
you might run into old pattern
files. You will know that this is the case if using them automatically tries to load the koRpus
package.
In these cases,
you might want to strip the defunct reference to koRpus
by calling the private
function sylly:::koRpus2sylly
which take the path to the old file as its first argument. Be aware that
calling this function will overwrite the old file in-place,
so you should make a backup first!
Slots
lang
A character string, naming the language that is assumed for the patterns in this object
pattern
A matrix with three colums:
orig
:The unchanged patgen patterns.
char
:Only the characters used for matching.
nums
:The hyphenation number code for the pattern.
Contructor function
Should you need to manually generate objects of this class (which should rarely be the case),
the contructor function
kRp_hyph_pat(...)
can be used instead of
new("kRp.hyph.pat", ...)
. Whenever possible, stick to
read.hyph.pat
.
S4 Class kRp.hyphen
Description
This class is used for objects that are returned by hyphen
.
Slots
lang
A character string, naming the language that is assumed for the analized text in this object
desc
Descriptive statistics of the analyzed text.
hyphen
A data.frame with two columns:
syll
:Number of recognized syllables
word
:The hyphenated word
Contructor function
Should you need to manually generate objects of this class (which should rarely be the case),
the contructor function
kRp_hyphen(...)
can be used instead of
new("kRp.hyphen", ...)
. Whenever possible, stick to
hyphen
.
Handling hyphenation pattern objects
Description
This function can be used to examine and change hyphenation pattern objects be used with hyphen
.
Usage
manage.hyph.pat(
hyph.pattern,
get = NULL,
set = NULL,
rm = NULL,
word = NULL,
min.length = 3L,
rm.hyph = TRUE
)
Arguments
hyph.pattern |
Either an object of class |
get |
A character string, part of a word to look up in the pattern set, i.e., without the numbers indicating split probability. |
set |
A character string, a full pattern to be added to the pattern set, i.e., including the numbers indicating split probability. |
rm |
A character string, part of a word to remove from the pattern set, i.e., without the numbers indicating split probability. |
word |
A character string, a full word to hyphenate using the given pattern set. |
min.length |
Integer, number of letters a word must have for considering a hyphenation. |
rm.hyph |
Logical, whether appearing hyphens in words should be removed before pattern matching. |
Details
You can only run one of the possible actions at a time. If any of these arguments is not NULL
,
the corresponding action is done in the following order, and every additional discarded:
get
Searches the pattern set for a given word partset
Adds or replaces a pattern in the set (duplicates are removed)rm
Removes a word part and its pattern from the setword
Hyphenates a word and returns all parts examined as well as all matching patterns
If all action arguments are NULL
,
manage.hyph.pat
returns the full pattern object.
Value
If all action arguments are NULL
,
returns an object of class kRp.hyph.pat
.
The same is true if set
or rm
are set and hyph.pattern
is itself an object of that class; if you refer to a language
instead,
pattern changes will be done internally for the running session and take effect immediately.
The get
argument will return a caracter vector, and word
a data frame.
References
[1] http://tug.ctan.org/tex-archive/language/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/
See Also
Examples
## Not run:
manage.hyph.pat("en", set="r3ticl")
manage.hyph.pat("en", get="rticl")
manage.hyph.pat("en", word="article")
manage.hyph.pat("en", rm="rticl")
## End(Not run)
Reading patgen-compatible hyphenation pattern files
Description
This function reads hyphenation pattern files,
to be used with hyphen
.
Usage
read.hyph.pat(file, lang, fileEncoding = "UTF-8")
Arguments
file |
A connection or character string with a valid path to a file with hyphenation patterns (one pattern per line). |
lang |
A character string, usually two letters short, naming the language the patterns are meant to be used with (e.g. "es" for Spanish). |
fileEncoding |
A character string defining the character encoding of the file to be read. Unless you have a really good reason to do otherwise, your pattern files should all be UTF-8 encoded. |
Details
Hyphenation patterns that can be used are available from CTAN[1]. But actually any file with only the patterns themselves, one per line, should work.
The language designation is of no direct consequence here,
but if the resulting pattern object is to be
used by other functions in this package or koRpus
,
it should resamble the designation that's used for the
same language there.
Value
An object of class kRp.hyph.pat
.
References
[1] http://tug.ctan.org/tex-archive/language/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/
See Also
Examples
## Not run:
read.hyph.pat("~/patterns/hyph-en-us.pat.txt", lang="en_us")
## End(Not run)
Add support for new languages
Description
You can use this function to add new languages to be used with sylly
.
Usage
set.hyph.support(value)
Arguments
value |
A named list that upholds exactly the structure defined above. |
Details
Language support in this package is designed to be extended easily. You could call it modular, although it's actually more "environemntal", but nevermind.
To add new language support, say for Xyzedish, you basically have to call this function once and provide respective hyphenation patterns. If you would like to re-use this language support, you should consider making it a package.
If it succeeds,
it will fill an internal environment with the information you have defined.
hyphen
will then know which language patterns are available as data files (which
you must provide also).
You provide the meta data as a named list. It usually has one single entry to tell the new language
abbreviation, e.g., set.hyph.support(list("xyz"="xyz"))
. However,
this will only work if a)
the language support script is a part of the sylly
package itself,
and b) the hyphen pattern
is located in its data
subdirectory.
For your custom hyphenation patterns to be found automatically,
provide it as the value in the named
list, e.g., set.hyph.support(list("xyz"=hyph.xyz))
.
This will directly add the patterns to sylly
's environment,
so it will be found when
hyphenation is requested for language "xyz"
.
If you would like to provide hyphenation as part of a third party language package,
you must name the
object hyph.<lang>
, save it to your package's data
subdirectory named
hyph.<lang>.rda
, and append package="<yourpackage>"
to the named list; e.g.,
set.hyph.support(list("xyz"=c("xyz", package="koRpus.lang.xyz"))
. Only then
sylly
will look for the pattern object in your package,
not its own data
directory.
Hyphenation patterns
To be able to also do syllable count with the newly added language, you should add a hyphenation pattern file as well. Refer to the documentation of read.hyph.pat() to learn how to produce a pattern object from a downloaded hyphenation pattern file. Make sure you use the correct name scheme (e.g. "hyph.xyz.rda") and good compression.
Examples
## Not run:
set.hyph.support(
list("xyz"="xyz")
)
## End(Not run)
A function to set information on your sylly environment
Description
The function set.sylly.env
can be called before any of the hyphenation functions. It writes information
on your current session's settings to your global .Options
.
Usage
set.sylly.env(..., validate = TRUE)
Arguments
... |
Named parameters to set in the sylly environment. Valid arguments are:
To explicitly unset a value again, set it to an empty character string (e.g.,
|
validate |
Logical,
if |
Details
To get the current settings, the function get.sylly.env
should be used. For the most part, set.sylly.env
is a convenient wrapper for
options
. To permanently set some defaults, you could also add
respective options
calls to an .Rprofile
file.
Value
Returns an invisible NULL
.
See Also
Examples
set.sylly.env(hyph.cache.file=file.path(tempdir(), "cache_file.RData"))
get.sylly.env(hyph.cache.file=TRUE)
## Not run:
# example for setting permanent default values in an .Rprofile file
options(
sylly=list(
hyph.cache.file=file.path(tempdir(), "cache_file.RData"),
lang="de"
)
)
# be aware that setting a permamnent default language without loading
# the respective language support package might trigger errors
## End(Not run)
Show method for sylly objects
Description
Show method for S4 objects of class kRp.hyphen
.
Usage
## S4 method for signature 'kRp.hyphen'
show(object)
Arguments
object |
An object of class |
See Also
Examples
## Not run:
hyphen(tagged.text)
## End(Not run)
Summary method for sylly objects
Description
Summary method for S4 objects of class kRp.hyphen
.
Usage
## S4 method for signature 'kRp.hyphen'
summary(object)
Arguments
object |
An object of class |
See Also
Examples
## Not run:
summary(hyphen(tagged.text))
## End(Not run)