Type: | Package |
Title: | Full Corpus Support for the 'koRpus' Package |
Description: | Enhances 'koRpus' text object classes and methods to also support large corpora. Hierarchical ordering of corpus texts into arbitrary categories will be preserved. Provided classes and methods also improve the ability of using the 'koRpus' package together with the 'tm' package. To ask for help, report bugs, suggest feature improvements, or discuss the global development of the package, please subscribe to the koRpus-dev mailing list (https://korpusml.reaktanz.de). |
Author: | m.eik michalke [aut, cre] |
Maintainer: | m.eik michalke <meik.michalke@hhu.de> |
Depends: | R (≥ 3.5.0),koRpus (≥ 0.13-1),sylly (≥ 0.1-6) |
Imports: | methods,parallel,tm,NLP |
Suggests: | koRpus.lang.en,testthat,knitr,rmarkdown |
VignetteBuilder: | knitr |
URL: | https://reaktanz.de/?c=hacking&s=koRpus |
BugReports: | https://github.com/unDocUMeantIt/tm.plugin.koRpus/issues |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
LazyLoad: | yes |
Version: | 0.4-2 |
Date: | 2021-05-17 |
RoxygenNote: | 7.1.1 |
Collate: | '01_class_01_kRp.corpus.R' '02_method_01_kRp.corpus-class_readability.R' '02_method_02_kRp.corpus-class_hyphen.R' '02_method_03_kRp.corpus-class_lex.div.R' '02_method_04_kRp.corpus-class_read.corp.custom.R' '02_method_05_kRp.corpus-class_freq.analysis.R' '02_method_06_kRp.corpus-class_summary.R' '02_method_07_kRp.corpus-class_correct.R' '02_method_08_kRp.corpus-class_query.R' '02_method_09_kRp.corpus-class_filterByClass.R' '02_method_10_kRp.corpus-class_jumbleWords.R' '02_method_11_kRp.corpus-class_clozeDelete.R' '02_method_12_kRp.corpus-class_cTest.R' '02_method_13_kRp.corpus-class_textTransform.R' '02_method_14_kRp.corpus-class_docTermMatrix.R' '02_method_15_kRp.corpus-class_split_by_doc_id.R' '02_method_20_kRp.corpus_get_set_is.R' '02_method_21_kRp.corpus-class_show.R' 'corpus_files.R' 'deprecated.R' 'kRpSource.R' 'readCorpus.R' 'tm.plugin.koRpus-internal.R' 'tm.plugin.koRpus-package.R' |
NeedsCompilation: | no |
Packaged: | 2021-05-18 11:08:16 UTC; m |
Repository: | CRAN |
Date/Publication: | 2021-05-18 12:50:02 UTC |
Full Corpus Support for the 'koRpus' Package
Description
Enhances 'koRpus' text object classes and methods to also support large corpora. Hierarchical ordering of corpus texts into arbitrary categories will be preserved. Provided classes and methods also improve the ability of using the 'koRpus' package together with the 'tm' package. To ask for help, report bugs, suggest feature improvements, or discuss the global development of the package, please subscribe to the koRpus-dev mailing list (<https://korpusml.reaktanz.de>).
Details
The DESCRIPTION file:
Package: | tm.plugin.koRpus |
Type: | Package |
Version: | 0.4-2 |
Date: | 2021-05-17 |
Depends: | R (>= 3.5.0),koRpus (>= 0.13-1),sylly (>= 0.1-6) |
Encoding: | UTF-8 |
License: | GPL (>= 3) |
LazyLoad: | yes |
URL: | https://reaktanz.de/?c=hacking&s=koRpus |
Author(s)
m.eik michalke [aut, cre]
Maintainer: m.eik michalke <meik.michalke@hhu.de>
See Also
Useful links:
Report bugs at https://github.com/unDocUMeantIt/tm.plugin.koRpus/issues
Apply cTest() to all texts in kRp.corpus objects
Description
This method calls cTest
on all tagged text objects
inside the given obj
object (using mclapply
).
Usage
## S4 method for signature 'kRp.corpus'
cTest(obj, mc.cores = getOption("mc.cores", 1L), ...)
Arguments
obj |
An object of class |
mc.cores |
The number of cores to use for parallelization,
see |
... |
options to pass through to |
Value
An object of the same class as obj
.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
taggedText(myCorpus)[20:30,]
myCorpus <- cTest(myCorpus)
taggedText(myCorpus)[20:30,]
} else {}
Apply clozeDelete() to all texts in kRp.corpus objects
Description
This method calls clozeDelete
on all tagged text objects
inside the given obj
object (using mclapply
).
Usage
## S4 method for signature 'kRp.corpus'
clozeDelete(obj, mc.cores = getOption("mc.cores", 1L), ...)
Arguments
obj |
An object of class |
mc.cores |
The number of cores to use for parallelization,
see |
... |
options to pass through to |
Value
An object of the same class as obj
.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
head(taggedText(myCorpus), n=10)
myCorpus <- clozeDelete(myCorpus)
head(taggedText(myCorpus), n=10)
} else {}
Deprecated functions and methods
Description
These functions were used in earlier versions of the package but either replaced or removed.
Usage
corpusTagged(obj, ...)
corpusTTR(obj, ...)
corpusLevel(...)
corpusCategory(...)
corpusID(...)
corpusPath(...)
Arguments
obj |
No longer used. |
... |
No longer used. |
Get a comprehensive data frame describing the files of your corpus
Description
The function translates the hierarchy defintion given into a data frame with one row for each file, including the generated document ID.
Usage
corpus_files(
dir,
hierarchy = list(),
fsep = .Platform$file.sep,
full_list = FALSE
)
Arguments
dir |
File path to the root directory of the text corpus, or a TIF[1] compliant data frame. |
hierarchy |
A named list of named character vectors describing the directory hierarchy level by level.
If |
fsep |
Character string defining the path separator to use. |
full_list |
Logical, see return value. |
Value
Either a data frame with columns doc_id
, file
,
path
and one further factor
column for each hierarchy level,
or (if full_list=TRUE
) a list containing that data frame
(all_files
) and also data frames describing the hierarchy by given names (hier_names
),
directories (hier_dirs
) and relative paths (hier_paths
).
References
[1] Text Interchange Formats (https://github.com/ropensci/tif)
Examples
myCorpusFiles <- corpus_files(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus"
),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
)
)
Methods to correct kRp.corpus objects
Description
These methods enable you to correct errors that occurred during automatic processing, e.g., wrong hyphenation.
Usage
## S4 method for signature 'kRp.corpus'
correct.hyph(obj, word = NULL, hyphen = NULL, cache = TRUE)
Arguments
obj |
An object of class |
word |
A character string,
the (possibly incorrectly hyphenated) |
hyphen |
A character string,
the new manually hyphenated version of |
cache |
Logical, if |
Details
For details on what these methods do on a per text object basis, please refer to the
documentation of correct.hyph
in the sylly
package.
Value
An object of the same class as obj
.
Generate a document-term matrix from a corpus object
Description
Calculates a sparse document-term matrix calculated from a given object of class
kRp.corpus
and adds it to the object's feature list.
You can also calculate the term frequency inverted document frequency value (tf-idf) for each
term.
Usage
## S4 method for signature 'kRp.corpus'
docTermMatrix(
obj,
terms = "token",
case.sens = FALSE,
tfidf = FALSE,
as.feature = TRUE
)
Arguments
obj |
An object of class |
terms |
A character string defining the |
case.sens |
Logical, whether terms should be counted case sensitive. |
tfidf |
Logical,
if |
as.feature |
Logical,
whether the output should be just the sparse matrix or the input object with
that matrix added as a feature. Use |
Details
The settings of terms
, case.sens
,
and tfidf
will be stored in the object's meta
slot,
so you can use corpusMeta(..., "doc_term_matrix")
to fetch it.
See the examples to learn how to limit the analysis to desired word classes.
Value
Either an object of the input class or a sparse matrix of class
dgCMatrix
.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
# get the document-term frequencies in a sparse matrix
myDTMatrix <- docTermMatrix(myCorpus, as.feature=FALSE)
# combine with filterByClass() to, e.g., exclude all punctuation
myDTMatrix <- docTermMatrix(filterByClass(myCorpus), as.feature=FALSE)
# instead of absolute frequencies, get the tf-idf values
myDTMatrix <- docTermMatrix(
filterByClass(myCorpus),
tfidf=TRUE,
as.feature=FALSE
)
} else {}
Apply filterByClass() to all texts in kRp.corpus objects
Description
This method calls filterByClass
on all tagged text objects
inside the given txt
object (using mclapply
).
Usage
## S4 method for signature 'kRp.corpus'
filterByClass(txt, mc.cores = getOption("mc.cores", 1L), ...)
Arguments
txt |
An object of class |
mc.cores |
The number of cores to use for parallelization,
see |
... |
options to pass through to |
Value
An object of the same class as txt
.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
head(taggedText(myCorpus), n=10)
# remove all punctuation
myCorpus <- filterByClass(myCorpus)
head(taggedText(myCorpus), n=10)
} else {}
Apply freq.analysis() to all texts in kRp.corpus objects
Description
This method calls freq.analysis
on all tagged text objects
inside the given txt.file
object.
Usage
## S4 method for signature 'kRp.corpus'
freq.analysis(txt.file, ...)
Arguments
txt.file |
An object of class |
... |
options to pass through to |
Details
If corp.freq
was not specified but a valid object of class kRp.corp.freq
is found in the freq
slot of txt.file
,
it is used automatically. That is the case if you called
read.corp.custom
on the object previously.
Value
An object of the same class as txt.file
.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
myCorpus <- read.corp.custom(myCorpus)
myCorpus <- freq.analysis(myCorpus)
corpusFreq(myCorpus)
} else {}
Apply hyphen() to all texts in kRp.corpus objects
Description
This method calls hyphen
on all tagged text objects
inside the given words
object (using mclapply
).
Usage
## S4 method for signature 'kRp.corpus'
hyphen(words, mc.cores = getOption("mc.cores", 1L), quiet = TRUE,
...)
Arguments
words |
An object of class |
mc.cores |
The number of cores to use for parallelization,
see |
quiet |
Logical,
if |
... |
options to pass through to |
Value
An object of the same class as words
.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner", "Wikipedia_new"
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
myCorpus <- hyphen(myCorpus)
} else {}
Apply jumbleWords() to all texts in kRp.corpus objects
Description
This method calls jumbleWords
on all tagged text objects
inside the given words
object (using mclapply
).
Usage
## S4 method for signature 'kRp.corpus'
jumbleWords(words, mc.cores = getOption("mc.cores", 1L), ...)
Arguments
words |
An object of class |
mc.cores |
The number of cores to use for parallelization,
see |
... |
options to pass through to |
Value
An object of the same class as words
.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
head(taggedText(myCorpus), n=10)
myCorpus <- jumbleWords(myCorpus)
head(taggedText(myCorpus), n=10)
} else {}
S4 Class kRp.corpus
Description
Objects of this class can contain full text corpora in a hierachical structure. It supports both the tm
package's
Corpus
class and koRpus
' own object classes and stores them in separated slots.
Details
Objects should be created using the readCorpus
function.
Slots
lang
A character string, naming the language that is assumed for the tokenized texts in this object.
desc
A named list of descriptive statistics of the tagged texts.
meta
A named list. Can be used to store meta information. Currently, no particular format is defined.
raw
A list of objects of class
Corpus
.tokens
A data frame as used for the
tokens
slot in objects of classkRp.text
. In addition to the columns usually found in those objects, this data frame also has a factor column for each hierarchical category defined (if any).features
A named logical vector, indicating which features are available in this object's
feat_list
slot. Common features are listed in the description of thefeat_list
slot.feat_list
A named list with optional analysis results or other content as used by the defined
features
:hierarchy
A named list of named character vectors describing the directory hierarchy level by level.hyphen
A named list of objects of classkRp.hyphen
.readability
A named list of objects of classkRp.readability
.lex_div
A named list of objects of classkRp.TTR
.freq
Thefreq.analysis
slot of akRp.txt.freq
class object afterfreq.analysis
was called.corp_freq
An object of classkRp.corp.freq
, e.g., results of a call toread.corp.custom
.diff
A named list ofdiff
features of akRp.text
object after a method liketextTransform
was called.summary
A summary data frame for the full corpus, including descriptive statistics on all texts, as well as results of analyses like readability and lexical diversity, if available.doc_term_matrix
A sparse document-term matrix, as produced bydocTermMatrix
.stopwords
A numeric vector with the total number of stopwords in each text, if stopwords were analyzed during tokenizing or POS tagging.
See the
getter and setter methods
for easy access to these sub-slots. There can actually be any number of additional features, the above is just a list of those already defined by this package.
Contructor function
Should you need to manually generate objects of this class (which should rarely be the case),
the contructor function
kRp.corpus(...)
can be used instead of
new("kRp.corpus", ...)
. Whenever possible, stick to
readCorpus
.
Note
There is also getter and setter methods
for objects of this class.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
} else {}
# manual creation
emptyCorpus <- kRp.corpus()
A source function for tm
Description
An rather untested attempt to sketch a Source
function for tm
.
Supposed to be used to translate tagged koRpus
objects into tm
objects.
Usage
kRpSource(obj, encoding = "UTF-8")
Arguments
obj |
An object of class |
encoding |
Character string, defining the character encoding of the object. |
Details
Also provided are the methods getElem
and pGetElem
for S3 class kRpSource
.
Value
An object of class Source
,
also inheriting class kRpSource
.
Apply lex.div() to all texts in kRp.corpus objects
Description
This method calls lex.div
on all tagged text objects
inside the given txt
object (using mclapply
).
Usage
## S4 method for signature 'kRp.corpus'
lex.div(
txt,
summary = TRUE,
mc.cores = getOption("mc.cores", 1L),
char = "",
quiet = TRUE,
...
)
Arguments
txt |
An object of class |
summary |
Logical, determines if the |
mc.cores |
The number of cores to use for parallelization,
see |
char |
Character vector to specify measures of which characteristics should be computed,
see
|
quiet |
Logical, if |
... |
options to pass through to |
Value
An object of the same class as txt
.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
myCorpus <- lex.div(myCorpus)
corpusSummary(myCorpus)
} else {}
Apply query() to all texts in kRp.corpus objects
Description
This method calls query
on all tagged text objects
inside the given object.
Usage
## S4 method for signature 'kRp.corpus'
query(
obj,
var,
query,
rel = "eq",
as.df = TRUE,
ignore.case = TRUE,
perl = FALSE,
regexp_var = "token"
)
Arguments
obj |
An object of class |
var |
A character string naming a column in the tagged text. If set to
|
query |
A character vector (for words), regular expression,
or single number naming values to be matched in the variable.
Can also be a vector of two numbers to query a range of frequency data,
or a list of named lists for multiple queries (see
"Query lists" section of |
rel |
A character string defining the relation of the queried value and desired results.
Must either be |
as.df |
Logical, if |
ignore.case |
Logical, passed through to |
perl |
Logical, passed through to |
regexp_var |
A character string naming the column to query if |
Value
Depending on the arguments, might include whole objects, lists, single values etc.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
query(myCorpus, var="lttr", query="7", rel="gt")
} else {}
Apply read.corp.custom() to all texts in kRp.corpus objects
Description
This method calls read.corp.custom
on all tagged text objects
inside the given corpus
object.
Usage
## S4 method for signature 'kRp.corpus'
read.corp.custom(corpus, caseSens = TRUE, log.base = 10,
keep_dtm = FALSE, ...)
Arguments
corpus |
An object of class |
caseSens |
Logical. If |
log.base |
A numeric value defining the base of the logarithm used for inverse document frequency (idf). See
|
keep_dtm |
Logical. If |
... |
Options to pass through to the |
Details
Since the analysis is based on a document term matrix,
a pre-existing matrix as a feature of the corpus
object
will be used if it matches the case sensitivity setting. Otherwise a new matrix will be generated (but not replace the
existing one). If no document term matrix is present yet,
also one will be generated and can be kept as an additional feature
of the resulting object.
Value
An object of the same class as corpus
.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
myCorpus <- read.corp.custom(myCorpus)
corpusCorpFreq(myCorpus)
} else {}
Create kRp.corpus objects from text files or data frames
Description
You can either read a corpus from text files (one file per text, also see the Hierarchy section below) or from TIF compliant data frames (see the Data frames section below).
Usage
readCorpus(
dir,
hierarchy = list(),
lang = "kRp.env",
tagger = "kRp.env",
encoding = "",
pattern = NULL,
recursive = FALSE,
ignore.case = FALSE,
mode = "text",
format = "file",
mc.cores = getOption("mc.cores", 1L),
id = "",
...
)
Arguments
dir |
Either a file path to the root directory of the text corpus,
or a TIF compliant data frame.
If a directory path (character string),
texts can be recursively ordered into subfolders named
exactly as defined by |
hierarchy |
A named list of named character vectors describing the directory hierarchy level by level.
If |
lang |
A character string naming the language of the analyzed corpus.
See |
tagger |
A character string pointing to the tokenizer/tagger command you want to use for basic text analysis.
Defaults to |
encoding |
Character string describing the current encoding.
See |
pattern |
A regular expression for file matching.
See |
recursive |
Logical, indicates whether directories should be read recursively.
See |
ignore.case |
Logical, indicates whether |
mode |
Character string defining the reading mode.
See |
format |
Either "file" or "obj",
depending on whether you want to scan files or analyze the text in a given object,
like a character vector. If the latter and |
mc.cores |
The number of cores to use for parallelization,
see |
id |
A character string describing the main subject/purpose of the text corpus. |
... |
Additional options which are passed through to the defined |
Value
An object of class kRp.corpus
.
Hierarchy
To import a hierarchically structured text corpus you must categorize all texts in a directory
structure that resembles the hierarchy. If for example you would like to import a corpus on two
different topics and two differnt sources,
your hierarchy has two nested levels (topic and source).
The root directory dir
would then need to have two subdirectories (one for each topic)
which in turn must have two subdirectories (one for each source),
and the actual text files
are found in those.
To use this hierarchical structure in readCorpus
,
the hierarchy
argument is used.
It is a named list,
where each list item represents one hierachical level (here again topic and source),
and its value is a named character vector describing the actual topics and sources to be used. It is
important to understand how these character vectors are treated: The names of elements must exactly match
the corresponding subdirectroy name,
whereas the value is a free text description. The names of the
list items however describe the hierachical level and are not matched with directory names.
Data frames
In order to import a corpus from a data frame,
the object must be in Text Interchange Format (TIF)
as described by [1]. As a minimum, the data frame must have two character columns,
doc_id
and text
.
You can provide additional information on hierarchical categories by using further
columns,
where the column name must match the category name (hierachical level). The order of those
columns in the data frame is not important,
as you must still fully define the hierarchical structure
using the hierarchy
argument. All columns you omit are ignored,
but the values used in
the hierarchy
list and the respective columns must match,
as rows with unmatched category levels
will also be ignored.
Note that the special column names path
and file
will also be imported automatically.
References
[1] Text Interchange Formats (https://github.com/ropensci/tif)
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
# "flat" corpus, parse all texts in the given dir
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner", "Wikipedia_prev"
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
# corpus with one category names "Source"
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
tagger="tokenize",
lang="en"
)
# two hieraryhical levels, "Topic" and "Source"
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
tagger="tokenize",
lang="en"
)
# get hierarchy from directory tree
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=TRUE,
tagger="tokenize",
lang="en"
)
## Not run:
# if the same corpus is available as TIF compliant data frame
myCorpus <- readCorpus(
dir=myCorpus_df,
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
lang="en",
format="obj"
)
## End(Not run)
} else {}
Apply readability() to all texts in kRp.corpus objects
Description
This method calls readability
on all tagged text objects
inside the given txt.file
object (using mclapply
).
Usage
## S4 method for signature 'kRp.corpus'
readability(
txt.file,
summary = TRUE,
mc.cores = getOption("mc.cores", 1L),
quiet = TRUE,
...
)
Arguments
txt.file |
An object of class |
summary |
Logical, determines if the |
mc.cores |
The number of cores to use for parallelization,
see |
quiet |
Logical,
if |
... |
options to pass through to |
Value
An object of the same class as txt.file
.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
myTexts <- readability(myCorpus)
corpusSummary(myCorpus)
} else {}
Show methods for kRp.corpus objects
Description
Show methods for S4 objects of class kRp.corpus
.
Usage
## S4 method for signature 'kRp.corpus'
show(object)
Arguments
object |
An object of class |
Turn a kRp.corpus object into a list of kRp.text objects
Description
For some analysis steps it might be important to have individual tagged texts instead of one large corpus object. This method produces just that.
Usage
## S4 method for signature 'kRp.corpus'
split_by_doc_id(obj, keepFeatures = TRUE)
Arguments
obj |
An object of class |
keepFeatures |
Either logical, whether to keep all features or drop them, or a character vector of names of features to keep if present. |
Value
A named list of objects of class kRp.text
.
Elements are named by their doc_id
.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
hierarchy=list(
Topic=c(
Winner="Reality Winner",
Edwards="Natalie Edwards"
),
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
myCorpusList <- split_by_doc_id(myCorpus)
} else {}
Apply summary() to all texts in kRp.corpus objects
Description
This method performs a summary
call on all text objects inside the given
object
object. Contrary to what other summary methods do, this method
always returns the full object with an updated summary
slot.
Usage
## S4 method for signature 'kRp.corpus'
summary(object, missing = NA, ...)
corpusSummary(obj)
## S4 method for signature 'kRp.corpus'
corpusSummary(obj)
corpusSummary(obj) <- value
## S4 replacement method for signature 'kRp.corpus'
corpusSummary(obj) <- value
Arguments
object |
An object of class |
missing |
Character string to use for missing values. |
... |
Used for internal processes. |
obj |
An object of class |
value |
The new value to replace the current with. |
Details
The summary
slot contains a data.frame with aggregated information of
all texts that the respective object contains.
corpusSummary
is a simple method to get or set the summary
slot
in kRp.corpus objects directly.
Value
An object of the same class as object
.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
# calculate readability, but prevent a summary table from being added
myCorpus <- readability(myCorpus, summary=FALSE)
corpusSummary(myCorpus)
# add summaries
myCorpus <- summary(myCorpus)
corpusSummary(myCorpus)
} else {}
Getter/setter methods for kRp.corpus objects
Description
These methods should be used to get or set values of text objects
generated by functions like readCorpus
.
Usage
## S4 method for signature 'kRp.corpus'
taggedText(obj)
## S4 replacement method for signature 'kRp.corpus'
taggedText(obj) <- value
## S4 method for signature 'kRp.corpus'
doc_id(obj, has_id = NULL)
## S4 method for signature 'kRp.corpus'
describe(obj, doc_id = NULL, simplify = TRUE, ...)
## S4 replacement method for signature 'kRp.corpus'
describe(obj, doc_id = NULL, ...) <- value
## S4 method for signature 'kRp.corpus'
language(obj)
## S4 replacement method for signature 'kRp.corpus'
language(obj) <- value
## S4 method for signature 'kRp.corpus'
hasFeature(obj, feature = NULL)
## S4 replacement method for signature 'kRp.corpus'
hasFeature(obj, feature) <- value
## S4 method for signature 'kRp.corpus'
feature(obj, feature, doc_id = NULL)
## S4 replacement method for signature 'kRp.corpus'
feature(obj, feature) <- value
## S4 method for signature 'kRp.corpus'
corpusReadability(obj, doc_id = NULL)
## S4 replacement method for signature 'kRp.corpus'
corpusReadability(obj) <- value
corpusTm(obj)
## S4 method for signature 'kRp.corpus'
corpusTm(obj)
corpusTm(obj) <- value
## S4 replacement method for signature 'kRp.corpus'
corpusTm(obj) <- value
corpusMeta(obj, meta = NULL, fail = TRUE)
## S4 method for signature 'kRp.corpus'
corpusMeta(obj, meta = NULL, fail = TRUE)
corpusMeta(obj, meta = NULL) <- value
## S4 replacement method for signature 'kRp.corpus'
corpusMeta(obj, meta = NULL) <- value
## S4 method for signature 'kRp.corpus'
corpusHyphen(obj, doc_id = NULL)
## S4 replacement method for signature 'kRp.corpus'
corpusHyphen(obj) <- value
## S4 method for signature 'kRp.corpus'
corpusLexDiv(obj, doc_id = NULL)
## S4 replacement method for signature 'kRp.corpus'
corpusLexDiv(obj) <- value
## S4 method for signature 'kRp.corpus'
corpusFreq(obj)
## S4 replacement method for signature 'kRp.corpus'
corpusFreq(obj) <- value
## S4 method for signature 'kRp.corpus'
corpusCorpFreq(obj)
## S4 replacement method for signature 'kRp.corpus'
corpusCorpFreq(obj) <- value
corpusHierarchy(obj, ...)
## S4 method for signature 'kRp.corpus'
corpusHierarchy(obj)
corpusHierarchy(obj) <- value
## S4 replacement method for signature 'kRp.corpus'
corpusHierarchy(obj) <- value
corpusFiles(obj, paths = FALSE, ...)
## S4 method for signature 'kRp.corpus'
corpusFiles(obj, paths = FALSE)
corpusFiles(obj) <- value
## S4 replacement method for signature 'kRp.corpus'
corpusFiles(obj) <- value
corpusDocTermMatrix(obj, ...)
## S4 method for signature 'kRp.corpus'
corpusDocTermMatrix(obj)
corpusDocTermMatrix(obj, terms = NULL, case.sens = NULL, tfidf = NULL) <- value
## S4 replacement method for signature 'kRp.corpus'
corpusDocTermMatrix(obj, terms = NULL, case.sens = NULL,
tfidf = NULL) <- value
## S4 method for signature 'kRp.corpus'
corpusStopwords(obj)
## S4 replacement method for signature 'kRp.corpus'
corpusStopwords(obj) <- value
## S4 method for signature 'kRp.corpus'
diffText(obj, doc_id = NULL)
## S4 replacement method for signature 'kRp.corpus'
diffText(obj) <- value
## S4 method for signature 'kRp.corpus'
originalText(obj)
is.corpus(obj)
## S4 method for signature 'kRp.corpus,ANY,ANY,ANY'
x[i, j, ..., drop = TRUE]
## S4 replacement method for signature 'kRp.corpus,ANY,ANY,ANY'
x[i, j, ...] <- value
## S4 method for signature 'kRp.corpus'
x[[i, doc_id = NULL, ...]]
## S4 replacement method for signature 'kRp.corpus'
x[[i, doc_id = NULL, ...]] <- value
## S4 method for signature 'kRp.corpus'
tif_as_tokens_df(tokens)
tif_as_corpus_df(corpus)
## S4 method for signature 'kRp.corpus'
tif_as_corpus_df(corpus)
Arguments
obj |
An object of class |
value |
A new value to replace the current with. |
has_id |
A character vector with |
doc_id |
A character vector to limit the scope to one or more particular document IDs. |
simplify |
If |
... |
Additional arguments to pass through, depending on the method. |
feature |
Character string naming the object feature to look for. |
meta |
If not NULL, the |
fail |
Logical,
whether the method should fail with an error if |
paths |
Logical,
indicates for |
terms |
A character string defining the |
case.sens |
Logical, whether terms were counted case sensitive. Stored in object's meta data slot. |
tfidf |
Logical,
use |
x |
See |
i |
Defines the row selector ( |
j |
Defines the column selector in the tokens slot. |
drop |
See |
tokens |
An object of class |
corpus |
An object of class |
Details
taggedText()
returns thetokens
slot.describe()
returns thedesc
slot.hasFeature()
returnsTRUE
or codeFALSE, depending on whether the requested feature is present or not.feature()
returns the list entry of thefeat_list
slot for the requested feature.corpusReadability()
returns the list ofkRp.readability
objects.corpusTm()
returns theVCorpus
object.corpusMeta()
returns the list with meta information.corpusHyphen()
returns the list ofkRp.hyphen
objects.corpusLexDiv()
returns the list ofkRp.TTR
objects.corpusFiles()
returns the character vector of file names of the object.corpusFreq()
returns the frequency analysis data from thefeat_list
slot.corpusCorpFreq()
returns thekRp.corp.freq
object of thefeat_list
slot.corpusHierarchy()
returns the corpus' hierarchy structure.corpusDocTermMatrix()
returns the sparse document term matrix of thefeat_list
slot.corpusStopwords()
returns the number of stopwords found in each text (if analyzed) from thefeat_list
slot.diffText()
returns thediff
element of thefeat_list
slot.originalText
regenerates the original text before text transformations and returns it as a data frame.[
/[[
can be used as a shortcut to index the results oftaggedText()
.tif_as_corpus_df
returns the whole corpus in a single TIF[1] compliant data.frame.tif_as_tokens_df
returns thetokens
slot in a TIF[1] compliant data.frame, i.e.,doc_id
is not a factor but a character vector.
References
[1] Text Interchange Formats (https://github.com/ropensci/tif)
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Winner", "Wikipedia_new"
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
taggedText(myCorpus)
corpusMeta(myCorpus, "note") <- "an interesting read!"
# export object to TIF compliant data frame
myCorpus_df <- tif_as_corpus_df(myCorpus)
} else {}
Apply textTransform() to all texts in kRp.corpus objects
Description
This method calls textTransform
on all tagged text objects
inside the given txt
object (using mclapply
).
Usage
## S4 method for signature 'kRp.corpus'
textTransform(txt, mc.cores = getOption("mc.cores", 1L), ...)
Arguments
txt |
An object of class |
mc.cores |
The number of cores to use for parallelization,
see |
... |
options to pass through to |
Value
An object of the same class as txt
.
Examples
# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
myCorpus <- readCorpus(
dir=file.path(
path.package("tm.plugin.koRpus"), "examples", "corpus", "Edwards"
),
hierarchy=list(
Source=c(
Wikipedia_prev="Wikipedia (old)",
Wikipedia_new="Wikipedia (new)"
)
),
# use tokenize() so examples run without a TreeTagger installation
tagger="tokenize",
lang="en"
)
head(taggedText(myCorpus), n=10)
myCorpus <- textTransform(myCorpus, scheme="minor")
head(taggedText(myCorpus), n=10)
} else {}