---
title: "Using the tm.plugin.koRpus Package for Corpus Analysis"
author: "m.eik michalke"
date: "`r Sys.Date()`"
output:
html_document:
theme: cerulean
highlight: kate
toc: true
toc_float:
collapsed: false
smooth_scroll: false
toc_depth: 3
includes:
in_header: vignette_header.html
abstract: >
The R package `tm.plugin.koRpus` is an extension to the `koRpus` package, enhancing its usability for actual corpus analysis. It adds object classes and methods, inheriting from those provided by `koRpus`,
which are designed to work with complete text corpora in both `koRpus` and `tm` formats. This vignette gives you a quick overview.
vignette: >
%\VignetteIndexEntry{Using the tm.plugin.koRpus Package for Text Analysis}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8x]{inputenc}
\usepackage{lmodern}
% \usepackage[apaciteclassic]{apacite}
---
```{r setup, include=FALSE}
header_con <- file("vignette_header.html")
writeLines('', header_con)
close(header_con)
```
# What is tm.plugin.koRpus?
While the `koRpus` package focusses mostly on analysis steps of individual texts, `tm.plugin.koRpus` adds a new object class and respective methods, which can be used
to analyse complete text corpora in a single step. The object class can also be a first step to building a bridge between the `koRpus` and `tm` packages.
At the core of this package there is one particular object class -- `kRp.corpus` -- which can be used to construct simple corpus objects or even hierarchically nested corpora.
That is, you are able to categorize corpora on as many levels as you need. The examples in this vignette use two levels, one being different *topics* the texts
in the sample corpus deal with, and the other different *sources* the texts come from.
If you don't need these hierarchical levels, you can just use the method `readCorpus()` to create a corpus object, i.e., a simple collection of texts.
To distinguish texts which came from different sources or deal with different topics, use the `hierarchy` argument, which will add categorial columns to
the tagged text objects. These objects will only be valid if there are texts of each topic from each source.
Now, if this still confuses you, let's look at a small example.
# Tokenizing corpora
As with `koRpus`, the first step for text analysis is tokenizing and possibly POS tagging. This step is performed by the `readCorpus()` method mentioned above.
The package includes four sample texts taken from Wikipedia^[See the file `tests/testthat/samples/License\_of\_sample\_texts.txt` for details] in its `tests` directory which
we can use for a demonstration:
```{r, eval=FALSE}
library(tm.plugin.koRpus)
library(koRpus.lang.de)
# set the root path to the sample files
sampleRoot <- file.path(path.package("tm.plugin.koRpus"), "tests", "testthat", "samples")
# the next call uses "hierarchy" to describe the directory structure
# and its meaning; see below
sampleTexts <- readCorpus(
dir=sampleRoot,
hierarchy=list(
Topic=c(
C3S="C3S SCE",
GEMA="GEMA e.V."
),
Source=c(
Wikipedia_alt="Wikipedia (alt)",
Wikipedia_neu="Wikipedia (neu)"
)
),
tagger="tokenize",
lang="de"
)
```
```
Processing corpus...
Topic "C3S SCE", 2 texts...
Source "Wikipedia (alt)", 1 text...
Source "Wikipedia (neu)", 1 text...
Topic "GEMA e.V.", 2 texts...
Source "Wikipedia (alt)", 1 text...
Source "Wikipedia (neu)", 1 text...
```
## The `hierarchy` argument
The `hierarchy` argument describes our corpus in a very condensed format. It is a named list of named character vectors,
where each list entry represents a hierarchical level. In this case, the top level is called *"Topics"*, below that is the
level *"Source"*. These hierachical levels must also be represented by the directory structure of the texts to parse,
and the *names* of the character vectors must be identical to the *directory names* below the root directory specified by `dir`.^[
Future versions of this package might add furter ways of describing your corpus, like using a configuration file or providing
a full corpus in XML or JSON format. But don't hold your breath.]
So on your file system, what the `hierarchy` argument above describes is the following layout:
```
.../samples/
C3S/
Wikipedia_alt/
Text01.txt
Text02.txt
...
Wikipedia_neu/
Text03.txt
Text04.txt
...
GEMA/
Wikipedia_alt/
Text05.txt
Text06.txt
...
Wikipedia_neu/
Text07.txt
Text08.txt
...
```
Since we're using the `koRpus` package for all actual analysis, you can also setup your environment with `set.kRp.env()` and POS-tag all texts with `TreeTagger`^[see the `koRpus`
documentation for details.].
# Analysing corpora
After the initial tokenizing, we can analyse the corpus by calling the provided methods, for instance lexical diversity:
```{r, eval=FALSE}
sampleTexts <- lex.div(sampleTexts)
corpusSummary(sampleTexts)
```
```
doc_id Topic
C3S-Wikipedia_alt-C3S_2013-09-24.txt C3S-Wikipedia_alt-C3S_2013-09-24.txt C3S SCE
C3S-Wikipedia_neu-C3S_2015-07-05.txt C3S-Wikipedia_neu-C3S_2015-07-05.txt C3S SCE
GEMA-Wikipedia_alt-GEMA_2013-09-26.txt GEMA-Wikipedia_alt-GEMA_2013-09-26.txt GEMA e.V.
GEMA-Wikipedia_neu-GEMA_2015-07-05.txt GEMA-Wikipedia_neu-GEMA_2015-07-05.txt GEMA e.V.
Source a C CTTR HDD K lgV0 MATTR MSTTR
C3S-Wikipedia_alt-C3S_2013-09-24.txt Wikipedia (alt) 0.16 0.95 6.13 38.14 49.92 6.21 0.81 0.79
C3S-Wikipedia_neu-C3S_2015-07-05.txt Wikipedia (neu) 0.17 0.94 6.82 38.05 54.88 6.10 0.82 0.76
GEMA-Wikipedia_alt-GEMA_2013-09-26.txt Wikipedia (alt) 0.17 0.94 7.07 37.61 65.08 6.11 0.80 0.78
GEMA-Wikipedia_neu-GEMA_2015-07-05.txt Wikipedia (neu) 0.16 0.94 7.13 37.87 60.14 6.24 0.81 0.79
MTLD MTLDMA R S TTR U
C3S-Wikipedia_alt-C3S_2013-09-24.txt 100.16 NA 8.68 0.93 0.78 39.92
C3S-Wikipedia_neu-C3S_2015-07-05.txt 123.01 NA 9.65 0.92 0.73 36.46
GEMA-Wikipedia_alt-GEMA_2013-09-26.txt 106.94 192 10.00 0.92 0.71 35.96
GEMA-Wikipedia_neu-GEMA_2015-07-05.txt 111.64 NA 10.08 0.92 0.73 37.47
```
As you can see, `corpusSummary()` returns a `data.frame` object with the summarised results of all
texts. Here's an example how to use this to plot interactions:
```{r, eval=FALSE}
library(sciplot)
lineplot.CI(
x.factor=corpusSummary(sampleTexts)[["Source"]],
response=corpusSummary(sampleTexts)[["MTLD"]],
group=corpusSummary(sampleTexts)[["Topic"]],
type="l",
main="MTLD",
xlab="Media source",
ylab="Lexical diversity score",
col=c("grey", "black"),
lwd=2
)
```
Well, the example texts aren't so impressive here, as there's not much variance in one text per source and topic.
There are quite a number of `corpus*()` getter/setter methods for slots of these objects, e.g.,
`corpusReadability()` to get the `readability()` results from objects of class `kRp.corpus`.
The S4 object class provided by `tm.plugin.koRpus` directly inherits its structure from `kRp.text` of the `koRpus` package,
adding additional slots for meta information and `Corpus` objects of the `tm` package for raw data.
Two methods can be especially helpful for further analysis. The first one is `tif_as_tokens_df()` and returns
a `data.frame` including all texts of the tokenized corpus in a format that is compatible with
[Text Interchange Formats](https://github.com/ropensci/tif) standards.
The second one is a family of `[`, `[<-`, `[[` and `[[<-` shorcuts to directly interact with the
`data.frame` object you would get via `taggedText()`.
## Frequency analysis
The object class makes it quite comfortable to analyse type frequencies of corpora. There is a method
`read.corp.custom()` for these classes, that will do this analysis recursively on all levels:
```{r, eval=FALSE}
sampleTexts <- read.corp.custom(sampleTexts, case.sens=FALSE)
sampleTextsWordFreq <- query(
corpusCorpFreq(sampleTexts),
var="wclass",
query=kRp.POS.tags(lang="de", list.classes=TRUE, tags="words")
)
head(sampleTextsWordFreq, 10)
```
```
num word lemma tag wclass lttr freq pct pmio log10 rank.avg rank.min
3 3 die word.kRp word 3 30 0.037220844 37220 4.570776 263.0 263
4 4 der word.kRp word 3 21 0.026054591 26054 4.415874 262.0 262
5 5 gema word.kRp word 4 17 0.021091811 21091 4.324097 260.5 260
6 6 und word.kRp word 3 17 0.021091811 21091 4.324097 260.5 260
7 7 einer word.kRp word 5 12 0.014888337 14888 4.172836 258.5 258
8 8 von word.kRp word 3 12 0.014888337 14888 4.172836 258.5 258
11 11 ist word.kRp word 3 10 0.012406948 12406 4.093632 256.0 255
12 12 bei word.kRp word 3 9 0.011166253 11166 4.047898 254.0 254
13 13 das word.kRp word 3 8 0.009925558 9925 3.996731 252.5 252
14 14 urheber word.kRp word 7 8 0.009925558 9925 3.996731 252.5 252
rank.rel.avg rank.rel.min inDocs idf
3 99.24528 99.24528 4 0.00000
4 98.86792 98.86792 4 0.00000
5 98.30189 98.11321 4 0.00000
6 98.30189 98.11321 4 0.00000
7 97.54717 97.35849 4 0.00000
8 97.54717 97.35849 4 0.00000
11 96.60377 96.22642 4 0.00000
12 95.84906 95.84906 4 0.00000
13 95.28302 95.09434 4 0.00000
14 95.28302 95.09434 2 0.30103
```
In combination with the `wordcloud` package, this can directly be used to plot
word clouds:
```{r, eval=FALSE}
require(wordcloud)
colors <- brewer.pal(8, "RdGy")
wordcloud(
head(sampleTextsWordFreq[["word"]], 200),
head(sampleTextsWordFreq[["freq"]], 200),
random.color=TRUE,
colors=colors
)
```
The 200 most frequent words in the example corpus