Type: | Package |
Title: | Prototype of Multiple Latent Dirichlet Allocation Runs |
Version: | 0.3.1 |
Date: | 2021-09-01 |
Description: | Determine a Prototype from a number of runs of Latent Dirichlet Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select the LDA run with highest mean pairwise similarity, which is measured by S-CLOP (Similarity of multiple sets by Clustering with Local Pruning), to all other runs. LDA runs are specified by its assignments leading to estimators for distribution parameters. Repeated runs lead to different results, which we encounter by choosing the most representative LDA run as prototype. |
URL: | https://github.com/JonasRieger/ldaPrototype |
BugReports: | https://github.com/JonasRieger/ldaPrototype/issues |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
Depends: | R (≥ 3.5.0) |
Imports: | batchtools (≥ 0.9.11), checkmate (≥ 1.8.5), colorspace (≥ 1.4-1), data.table (≥ 1.11.2), dendextend, fs (≥ 1.2.0), future, lda (≥ 1.4.2), parallelMap, progress (≥ 1.1.1), stats, utils |
Suggests: | covr, RColorBrewer (≥ 1.1-2), testthat, tosca |
RoxygenNote: | 7.1.1 |
LazyData: | true |
NeedsCompilation: | no |
Packaged: | 2021-09-01 15:55:37 UTC; riege |
Author: | Jonas Rieger |
Maintainer: | Jonas Rieger <jonas.rieger@tu-dortmund.de> |
Repository: | CRAN |
Date/Publication: | 2021-09-02 11:20:02 UTC |
ldaPrototype: Prototype of Multiple Latent Dirichlet Allocation Runs
Description
Determine a Prototype from a number of runs of Latent Dirichlet
Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select
the LDA run with highest mean pairwise similarity, which is measured by S-CLOP
(Similarity of multiple sets by Clustering with Local Pruning), to all other
runs. LDA runs are specified by its assignments leading to estimators for
distribution parameters. Repeated runs lead to different results, which we
encounter by choosing the most representative LDA run as prototype.
For bug reports and feature requests please use the issue tracker:
https://github.com/JonasRieger/ldaPrototype/issues. Also have a look at
the (detailed) example at https://github.com/JonasRieger/ldaPrototype.
Data
reuters
Example Dataset (91 articles from Reuters) for testing.
Constructor
LDA
LDA objects used in this package.
as.LDARep
LDARep objects.
as.LDABatch
LDABatch objects.
Getter
getTopics
Getter for LDA
objects.
getJob
Getter for LDARep
and LDABatch
objects.
getSimilarity
Getter for TopicSimilarity
objects.
getSCLOP
Getter for PrototypeLDA
objects.
getPrototype
Determine the Prototype LDA.
Performing multiple LDAs
LDARep
Performing multiple LDAs locally (using parallelization).
LDABatch
Performing multiple LDAs on Batch Systems.
Calculation Steps (Workflow) to determine the Prototype LDA
mergeTopics
Merge topic matrices from multiple LDAs.
jaccardTopics
Calculate topic similarities using the Jaccard coefficient (see Similarity Measures for other possible measures).
dendTopics
Create a dendrogram from topic similarities.
SCLOP
Determine various S-CLOP values.
pruneSCLOP
Prune TopicDendrogram
objects.
Similarity Measures
cosineTopics
Cosine Similarity.
jaccardTopics
Jaccard Coefficient.
jsTopics
Jensen-Shannon Divergence.
rboTopics
rank-biased overlap.
Shortcuts
getPrototype
Shortcut which includes all calculation steps.
LDAPrototype
Shortcut which performs multiple LDAs and
determines their Prototype.
Author(s)
Maintainer: Jonas Rieger jonas.rieger@tu-dortmund.de (ORCID)
References
Rieger, Jonas (2020). "ldaPrototype: A method in R to get a Prototype of multiple Latent Dirichlet Allocations". Journal of Open Source Software, 5(51), 2181, doi: 10.21105/joss.02181.
Rieger, Jonas, Jörg Rahnenführer and Carsten Jentsch (2020). "Improving Latent Dirichlet Allocation: On Reliability of the Novel Method LDAPrototype". In: Natural Language Processing and Information Systems, NLDB 2020. LNCS 12089, pp. 118–125, doi: 10.1007/978-3-030-51310-8_11.
Rieger, Jonas, Lars Koppers, Carsten Jentsch and Jörg Rahnenführer (2020). "Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability using Clustering Techniques on Replicated Runs". arXiv 2003.04980, URL https://arxiv.org/abs/2003.04980.
See Also
Useful links:
Report bugs at https://github.com/JonasRieger/ldaPrototype/issues
LDA Object
Description
Constructor for LDA objects used in this package.
Usage
LDA(
x,
param,
assignments,
topics,
document_sums,
document_expects,
log.likelihoods
)
as.LDA(
x,
param,
assignments,
topics,
document_sums,
document_expects,
log.likelihoods
)
is.LDA(obj, verbose = FALSE)
Arguments
x |
[ |
param |
[ |
assignments |
Individual element for LDA object. |
topics |
Individual element for LDA object. |
document_sums |
Individual element for LDA object. |
document_expects |
Individual element for LDA object. |
log.likelihoods |
Individual element for LDA object. |
obj |
[ |
verbose |
[ |
Details
The functions LDA
and as.LDA
do exactly the same. If you call
LDA
on an object x
which already is of the structure of an
LDA
object (in particular a LDA
object itself),
the additional arguments param, assignments, ...
may be used to override the specific elements.
Value
[named list
] LDA object.
See Also
Other constructor functions:
as.LDABatch()
,
as.LDARep()
Other LDA functions:
LDABatch()
,
LDARep()
,
getTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 1, K = 10)
lda = getLDA(res)
LDA(lda)
# does not change anything
LDA(lda, assignments = NULL)
# creates a new LDA object without the assignments element
LDA(param = getParam(lda), topics = getTopics(lda))
# creates a new LDA object with elements param and topics
LDA Replications on a Batch System
Description
Performs multiple runs of Latent Dirichlet Allocation on a batch system using
the batchtools-package
.
Usage
LDABatch(
docs,
vocab,
n = 100,
seeds,
id = "LDABatch",
load = FALSE,
chunk.size = 1,
resources,
...
)
Arguments
docs |
[ |
vocab |
[ |
n |
[ |
seeds |
[ |
id |
[ |
load |
[ |
chunk.size |
[ |
resources |
[ |
... |
additional arguments passed to |
Details
The function generates multiple LDA runs with the possibility of
using a batch system. The integration is done by the
batchtools-package
. After all jobs of the
corresponding registry are terminated, the whole registry can be ported to
your local computer for further analysis.
The function returns a LDABatch
object. You can receive results and
all other elements of this object with getter functions (see getJob
).
Value
[named list
] with entries id
for the registry's folder name,
jobs
for the submitted jobs' ids and its parameter settings and
reg
for the registry itself.
See Also
Other batch functions:
as.LDABatch()
,
getJob()
,
mergeBatchTopics()
Other LDA functions:
LDARep()
,
LDA()
,
getTopics()
Examples
## Not run:
batch = LDABatch(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 15)
batch
getRegistry(batch)
getJob(batch)
getLDA(batch, 2)
batch2 = LDABatch(docs = reuters_docs, vocab = reuters_vocab, K = 15, chunk.size = 20)
batch2
head(getJob(batch2))
## End(Not run)
Determine the Prototype LDA
Description
Performs multiple runs of LDA and computes the Prototype LDA of this set of LDAs.
Usage
LDAPrototype(
docs,
vocabLDA,
vocabMerge = vocabLDA,
n = 100,
seeds,
id = "LDARep",
pm.backend,
ncpus,
limit.rel,
limit.abs,
atLeast,
progress = TRUE,
keepTopics = FALSE,
keepSims = FALSE,
keepLDAs = FALSE,
...
)
Arguments
docs |
[ |
vocabLDA |
[ |
vocabMerge |
[ |
n |
[ |
seeds |
[ |
id |
[ |
pm.backend |
[ |
ncpus |
[ |
limit.rel |
[0,1] |
limit.abs |
[ |
atLeast |
[ |
progress |
[ |
keepTopics |
[ |
keepSims |
[ |
keepLDAs |
[ |
... |
additional arguments passed to |
Details
While LDAPrototype
marks the overall shortcut for performing
multiple LDA runs and choosing the Prototype of them, getPrototype
just hooks up at determining the Prototype. The generation of multiple LDAs
has to be done before use of getPrototype
.
To save memory a lot of interim calculations are discarded by default.
If you use parallel computation, no progress bar is shown.
For details see the details sections of the workflow functions at getPrototype
.
Value
[named list
] with entries
id
[
character(1)
] See above.protoid
[
character(1)
] Name (ID) of the determined Prototype LDA.lda
List of
LDA
objects of the determined Prototype LDA and - ifkeepLDAs
isTRUE
- all considered LDAs.jobs
[
data.table
] with parameter specifications for the LDAs.param
[
named list
] with parameter specifications forlimit.rel
[0,1],limit.abs
[integer(1)
] andatLeast
[integer(1)
]. See above for explanation.topics
[
named matrix
] with the count of vocabularies (row wise) in topics (column wise).sims
[
lower triangular named matrix
] with all pairwise jaccard similarities of the given topics.wordslimit
[
integer
] with counts of words determined as relevant based onlimit.rel
andlimit.abs
.wordsconsidered
[
integer
] with counts of considered words for similarity calculation. Could differ fromwordslimit
, ifatLeast
is greater than zero.sclop
[
symmetrical named matrix
] with all pairwise S-CLOP scores of the given LDA runs.
See Also
Other shortcut functions:
getPrototype()
Other PrototypeLDA functions:
getPrototype()
,
getSCLOP()
Other replication functions:
LDARep()
,
as.LDARep()
,
getJob()
,
mergeRepTopics()
Examples
res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab,
n = 4, K = 10, num.iterations = 30)
res
getPrototype(res) # = getLDA(res)
getSCLOP(res)
res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab,
n = 4, K = 10, num.iterations = 30, keepLDAs = TRUE)
res
getLDA(res, all = TRUE)
getPrototypeID(res)
getParam(res)
LDA Replications
Description
Performs multiple runs of Latent Dirichlet Allocation.
Usage
LDARep(docs, vocab, n = 100, seeds, id = "LDARep", pm.backend, ncpus, ...)
Arguments
docs |
[ |
vocab |
[ |
n |
[ |
seeds |
[ |
id |
[ |
pm.backend |
[ |
ncpus |
[ |
... |
additional arguments passed to |
Details
The function generates multiple LDA runs with the possibility of
using parallelization. The integration is done by the
parallelMap-package
.
The function returns a LDARep
object. You can receive results and
all other elements of this object with getter functions (see getJob
).
Value
[named list
] with entries id
for computation's name,
jobs
for the parameter settings and lda
for the results itself.
See Also
Other replication functions:
LDAPrototype()
,
as.LDARep()
,
getJob()
,
mergeRepTopics()
Other LDA functions:
LDABatch()
,
LDA()
,
getTopics()
Other workflow functions:
SCLOP()
,
dendTopics()
,
getPrototype()
,
jaccardTopics()
,
mergeTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, seeds = 1:4,
id = "myComputation", K = 7:10, alpha = 1, eta = 0.01, num.iterations = 20)
res
getJob(res)
getID(res)
getLDA(res, 4)
LDARep(docs = reuters_docs, vocab = reuters_vocab,
K = 10, num.iterations = 100, pm.backend = "socket")
Similarity/Stability of multiple sets of Objects using Clustering with Local Pruning
Description
The function SCLOP
calculates the S-CLOP value for the best possible
local pruning state of a dendrogram from dendTopics
.
The function pruneSCLOP
supplies the corresponding pruning state itself.
To get all pairwise S-CLOP scores of two LDA runs, the function SCLOP.pairwise
can be used. It returns a matrix of the pairwise S-CLOP scores.
All three functions use the function disparitySum
to calculate the
least possible sum of disparities (on the best possible local pruning state)
on a given dendrogram.
Usage
SCLOP(dend)
disparitySum(dend)
SCLOP.pairwise(sims)
Arguments
dend |
[ |
sims |
[ |
Details
For one specific cluster g
and R
LDA Runs the disparity is calculated by
U(g) := \frac{1}{R} \sum_{r=1}^R \vert t_r^{(g)} - 1 \vert \cdot \sum_{r=1}^R t_r^{(g)},
while \bm t^{(g)} = (t_1^{(g)}, ..., t_R^{(g)})^T
contains the number of topics that belong to the different LDA runs and that
occur in cluster g
.
The function disparitySum
returns the least possible sum of disparities
U_{\Sigma}(G^*)
for the best possible pruning state G^*
with U_{\Sigma}(G) = \sum_{g \in G} U(g) \to \min
.
The highest possible value for U_{\Sigma}(G^*)
is limited by
U_{\Sigma,\textsf{max}} := \sum_{g \in \tilde{G}} U(g) = N \cdot \frac{R-1}{R},
with \tilde{G}
denotes the corresponding worst case pruning state. This worst
case scenario is useful for normalizing the SCLOP scores.
The function SCLOP
then calculates the value
\textsf{S-CLOP}(G^*) := 1 - \frac{1}{U_{\Sigma,\textsf{max}}} \cdot \sum_{g \in G^*} U(g) ~\in [0,1],
where \sum\limits_{g \in G^*} U(g) = U_{\Sigma}(G^*)
.
Value
SCLOP
[0,1] value specifying the S-CLOP for the best possible local pruning state of the given dendrogram.
disparitySum
[
numeric(1)
] value specifying the least possible sum of disparities on the given dendrogram.SCLOP.pairwise
[
symmetrical named matrix
] with all pairwise S-CLOP scores of the given LDA runs.
See Also
Other SCLOP functions:
pruneSCLOP()
Other workflow functions:
LDARep()
,
dendTopics()
,
getPrototype()
,
jaccardTopics()
,
mergeTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
jacc = jaccardTopics(topics, atLeast = 2)
dend = dendTopics(jacc)
SCLOP(dend)
disparitySum(dend)
SCLOP.pairwise(jacc)
SCLOP.pairwise(getSimilarity(jacc))
LDABatch Constructor
Description
Constructs a LDABatch
object for given elements reg
,
job
and id
.
Usage
as.LDABatch(reg, job, id)
is.LDABatch(obj, verbose = FALSE)
Arguments
reg |
|
job |
[ |
id |
[ |
obj |
[ |
verbose |
[ |
Details
Given a Registry
the function returns
a LDABatch
object, which can be handled using the getter functions
at getJob
.
Value
[named list
] with entries id
for the registry's folder name,
jobs
for the submitted jobs' ids and its parameter settings and
reg
for the registry itself.
See Also
Other constructor functions:
LDA()
,
as.LDARep()
Other batch functions:
LDABatch()
,
getJob()
,
mergeBatchTopics()
Examples
## Not run:
batch = LDABatch(docs = reuters_docs, vocab = reuters_vocab, K = 15, chunk.size = 20)
batch
batch2 = as.LDABatch(reg = getRegistry(batch))
batch2
head(getJob(batch2))
batch3 = as.LDABatch()
batch3
### one way of loading an existing registry ###
batchtools::loadRegistry("LDABatch")
batch = as.LDABatch()
## End(Not run)
LDARep Constructor
Description
Constructs a LDARep
object for given elements lda
,
job
and id
.
Usage
as.LDARep(...)
## Default S3 method:
as.LDARep(lda, job, id, ...)
## S3 method for class 'LDARep'
as.LDARep(x, ...)
is.LDARep(obj, verbose = FALSE)
Arguments
... |
additional arguments |
lda |
[ |
job |
[ |
id |
[ |
x |
|
obj |
[ |
verbose |
[ |
Details
Given a list of LDA
objects the function returns
a LDARep
object, which can be handled using the getter functions
at getJob
.
Value
[named list
] with entries id
for computation's name,
jobs
for the parameter settings and lda
for the results themselves.
See Also
Other constructor functions:
LDA()
,
as.LDABatch()
Other replication functions:
LDAPrototype()
,
LDARep()
,
getJob()
,
mergeRepTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 7, num.iterations = 20)
lda = getLDA(res)
res2 = as.LDARep(lda, id = "newName")
res2
getJob(res2)
getJob(res)
## Not run:
batch = LDABatch(docs = reuters_docs, vocab = reuters_vocab, n = 4, id = "TEMP", K = 30)
res3 = as.LDARep(batch)
res3
getJob(res3)
## End(Not run)
Pairwise Cosine Similarities
Description
Calculates the similarity of all pairwise topic combinations using the Cosine Similarity.
Usage
cosineTopics(topics, progress = TRUE, pm.backend, ncpus)
Arguments
topics |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
Details
The Cosine Similarity for two topics \bm z_{i}
and \bm z_{j}
is calculated by
\cos(\theta | \bm z_{i}, \bm z_{j}) = \frac{ \sum_{v=1}^{V}{n_{i}^{(v)} n_{j}^{(v)}} }{ \sqrt{\sum_{v=1}^{V}{\left(n_{i}^{(v)}\right)^2}} \sqrt{\sum_{v=1}^{V}{\left(n_{j}^{(v)}\right)^2}} }
with \theta
determining the angle between the corresponding
count vectors \bm z_{i}
and \bm z_{j}
,
V
is the vocabulary size and n_k^{(v)}
is the count of
assignments of the v
-th word to the k
-th topic.
Value
[named list
] with entries
sims
[
lower triangular named matrix
] with all pairwise similarities of the given topics.wordslimit
[
integer
] = vocabulary size. SeejaccardTopics
for original purpose.wordsconsidered
[
integer
] = vocabulary size. SeejaccardTopics
for original purpose.param
[
named list
] with parametertype
[character(1)
]= "Cosine Similarity"
.
See Also
Other TopicSimilarity functions:
dendTopics()
,
getSimilarity()
,
jaccardTopics()
,
jsTopics()
,
rboTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
cosine = cosineTopics(topics)
cosine
sim = getSimilarity(cosine)
dim(sim)
Topic Dendrogram
Description
Builds a dendrogram for topics based on their pairwise similarities using the
cluster algorithm hclust
.
Usage
dendTopics(sims, ind, method = "complete")
## S3 method for class 'TopicDendrogram'
plot(x, pruning, pruning.par, ...)
Arguments
sims |
[ |
ind |
[ |
method |
[ |
x |
an R object. |
pruning |
[ |
pruning.par |
[ |
... |
additional arguments. |
Details
The label´s colors are determined based on their Run belonging using
rainbow_hcl
by default. Colors can be manipulated
using labels_colors
. Analogously, the labels
themself can be manipulated using labels
.
For both the function order.dendrogram
is useful.
The resulting dendrogram
can be plotted. In addition,
it is possible to mark a pruning state in the plot, either by color or by
separator lines (or both) setting pruning.par
. For the default values
of pruning.par
call the corresponding function on any
PruningSCLOP
object.
Value
[dendrogram
] TopicDendrogram
object
(and dendrogram
object) of all considered topics.
See Also
Other plot functions:
pruneSCLOP()
Other TopicSimilarity functions:
cosineTopics()
,
getSimilarity()
,
jaccardTopics()
,
jsTopics()
,
rboTopics()
Other workflow functions:
LDARep()
,
SCLOP()
,
getPrototype()
,
jaccardTopics()
,
mergeTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
jacc = jaccardTopics(topics, atLeast = 2)
sim = getSimilarity(jacc)
dend = dendTopics(jacc)
dend2 = dendTopics(sim)
plot(dend)
plot(dendTopics(jacc, ind = c("Rep2", "Rep3")))
pruned = pruneSCLOP(dend)
plot(dend, pruning = pruned)
plot(dend, pruning = pruned, pruning.par = list(type = "color"))
plot(dend, pruning = pruned, pruning.par = list(type = "both", lty = 1, lwd = 2, col = "red"))
dend2 = dendTopics(jacc, ind = c("Rep2", "Rep3"))
plot(dend2, pruning = pruneSCLOP(dend2), pruning.par = list(lwd = 2, col = "darkgrey"))
Getter and Setter for LDARep and LDABatch
Description
Returns the job ids and its parameter set (getJob
) or the (registry's)
id (getID
) for a LDABatch
or LDARep
object.
getRegistry
returns the registry itself for a LDABatch
object. getLDA
returns the list of LDA
objects for a
LDABatch
or LDARep
object. In addition, you can
specify one or more LDAs by their id(s).
setFilDir
sets the registry's file directory for a
LDABatch
object. This is useful if you move the registry´s folder,
e.g. if you do your calculations on a batch system, but want to do your
evaluation on your desktop computer.
Usage
getJob(x)
getID(x)
getRegistry(x)
getLDA(x, job, reduce, all)
setFileDir(x, file.dir)
Arguments
x |
|
job |
[ |
reduce |
[ |
all |
|
file.dir |
[Vector to be coerced to a |
See Also
Other getter functions:
getSCLOP()
,
getSimilarity()
,
getTopics()
Other replication functions:
LDAPrototype()
,
LDARep()
,
as.LDARep()
,
mergeRepTopics()
Other batch functions:
LDABatch()
,
as.LDABatch()
,
mergeBatchTopics()
Determine the Prototype LDA
Description
Returns the Prototype LDA of a set of LDAs. This set is given as
LDABatch
object, LDARep
object, or as list of LDAs.
If the matrix of S-CLOP scores sclop
is passed, no calculation is needed/done.
Usage
getPrototype(...)
## S3 method for class 'LDARep'
getPrototype(
x,
vocab,
limit.rel,
limit.abs,
atLeast,
progress = TRUE,
pm.backend,
ncpus,
keepTopics = FALSE,
keepSims = FALSE,
keepLDAs = FALSE,
sclop,
...
)
## S3 method for class 'LDABatch'
getPrototype(
x,
vocab,
limit.rel,
limit.abs,
atLeast,
progress = TRUE,
pm.backend,
ncpus,
keepTopics = FALSE,
keepSims = FALSE,
keepLDAs = FALSE,
sclop,
...
)
## Default S3 method:
getPrototype(
lda,
vocab,
id,
job,
limit.rel,
limit.abs,
atLeast,
progress = TRUE,
pm.backend,
ncpus,
keepTopics = FALSE,
keepSims = FALSE,
keepLDAs = FALSE,
sclop,
...
)
Arguments
... |
additional arguments |
x |
|
vocab |
[ |
limit.rel |
[0,1] |
limit.abs |
[ |
atLeast |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
keepTopics |
[ |
keepSims |
[ |
keepLDAs |
[ |
sclop |
[ |
lda |
[ |
id |
[ |
job |
[ |
Details
While LDAPrototype
marks the overall shortcut for performing
multiple LDA runs and choosing the Prototype of them, getPrototype
just hooks up at determining the Prototype. The generation of multiple LDAs
has to be done before use of this function. The function is flexible enough
to use it at at least two steps/parts of the analysis: After generating the
LDAs (no matter whether as LDABatch or LDARep object) or after determing
the pairwise SCLOP values.
To save memory a lot of interim calculations are discarded by default.
If you use parallel computation, no progress bar is shown.
For details see the details sections of the workflow functions.
Value
[named list
] with entries
id
[
character(1)
] See above.protoid
[
character(1)
] Name (ID) of the determined Prototype LDA.lda
List of
LDA
objects of the determined Prototype LDA and - ifkeepLDAs
isTRUE
- all considered LDAs.jobs
[
data.table
] with parameter specifications for the LDAs.param
[
named list
] with parameter specifications forlimit.rel
[0,1],limit.abs
[integer(1)
] andatLeast
[integer(1)
]. See above for explanation.topics
[
named matrix
] with the count of vocabularies (row wise) in topics (column wise).sims
[
lower triangular named matrix
] with all pairwise jaccard similarities of the given topics.wordslimit
[
integer
] with counts of words determined as relevant based onlimit.rel
andlimit.abs
.wordsconsidered
[
integer
] with counts of considered words for similarity calculation. Could differ fromwordslimit
, ifatLeast
is greater than zero.sclop
[
symmetrical named matrix
] with all pairwise S-CLOP scores of the given LDA runs.
See Also
Other shortcut functions:
LDAPrototype()
Other PrototypeLDA functions:
LDAPrototype()
,
getSCLOP()
Other workflow functions:
LDARep()
,
SCLOP()
,
dendTopics()
,
jaccardTopics()
,
mergeTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab,
n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
jacc = jaccardTopics(topics, atLeast = 2)
dend = dendTopics(jacc)
sclop = SCLOP.pairwise(jacc)
getPrototype(lda = getLDA(res), sclop = sclop)
proto = getPrototype(res, vocab = reuters_vocab, keepSims = TRUE,
limit.abs = 20, atLeast = 10)
proto
getPrototype(proto) # = getLDA(proto)
getConsideredWords(proto)
# > 10 if there is more than one word which is the 10-th often word (ties)
getRelevantWords(proto)
getSCLOP(proto)
Getter for PrototypeLDA
Description
Returns the corresponding element of a PrototypeLDA
object.
Usage
getSCLOP(x)
## S3 method for class 'PrototypeLDA'
getSimilarity(x)
## S3 method for class 'PrototypeLDA'
getRelevantWords(x)
## S3 method for class 'PrototypeLDA'
getConsideredWords(x)
getMergedTopics(x)
getPrototypeID(x)
## S3 method for class 'PrototypeLDA'
getLDA(x, job, reduce = TRUE, all = FALSE)
## S3 method for class 'PrototypeLDA'
getID(x)
## S3 method for class 'PrototypeLDA'
getParam(x)
## S3 method for class 'PrototypeLDA'
getJob(x)
Arguments
x |
[ |
job |
[ |
reduce |
[ |
all |
[ |
See Also
Other getter functions:
getJob()
,
getSimilarity()
,
getTopics()
Other PrototypeLDA functions:
LDAPrototype()
,
getPrototype()
Getter for TopicSimilarity
Description
Returns the corresponding element of a TopicSimilarity
object.
Usage
getSimilarity(x)
getRelevantWords(x)
getConsideredWords(x)
## S3 method for class 'TopicSimilarity'
getParam(x)
Arguments
x |
[ |
See Also
Other getter functions:
getJob()
,
getSCLOP()
,
getTopics()
Other TopicSimilarity functions:
cosineTopics()
,
dendTopics()
,
jaccardTopics()
,
jsTopics()
,
rboTopics()
Getter for LDA
Description
Returns the corresponding element of a LDA
object.
getEstimators
computes the estimators for phi
and theta
.
Usage
getTopics(x)
getAssignments(x)
getDocument_sums(x)
getDocument_expects(x)
getLog.likelihoods(x)
getParam(x)
getK(x)
getAlpha(x)
getEta(x)
getNum.iterations(x)
getEstimators(x)
Arguments
x |
[ |
Details
The estimators for phi
and theta
in
w_n^{(m)} \mid T_n^{(m)}, \bm\phi_k \sim \textsf{Discrete}(\bm\phi_k),
\bm\phi_k \sim \textsf{Dirichlet}(\eta),
T_n^{(m)} \mid \bm\theta_m \sim \textsf{Discrete}(\bm\theta_m),
\bm\theta_m \sim \textsf{Dirichlet}(\alpha)
are calculated referring to Griffiths and Steyvers (2004) by
\hat{\phi}_{k, v} = \frac{n_k^{(v)} + \eta}{n_k + V \eta},
\hat{\theta}_{m, k} = \frac{n_k^{(m)} + \alpha}{N^{(m)} + K \alpha}
with V
is the vocabulary size, K
is the number of modeled topics;
n_k^{(v)}
is the count of assignments of the v
-th word to
the k
-th topic. Analogously, n_k^{(m)}
is the count of assignments
of the m
-th text to the k
-th topic. N^{(m)}
is the total
number of assigned tokens in text m
and n_k
the total number of
assigned tokens to topic k
.
References
Griffiths, Thomas L. and Mark Steyvers (2004). "Finding scientific topics". In: Proceedings of the National Academy of Sciences 101 (suppl 1), pp.5228–5235, doi: 10.1073/pnas.0307752101.
See Also
Other getter functions:
getJob()
,
getSCLOP()
,
getSimilarity()
Other LDA functions:
LDABatch()
,
LDARep()
,
LDA()
Pairwise Jaccard Coefficients
Description
Calculates the similarity of all pairwise topic combinations using a modified Jaccard Coefficient.
Usage
jaccardTopics(
topics,
limit.rel,
limit.abs,
atLeast,
progress = TRUE,
pm.backend,
ncpus
)
Arguments
topics |
[ |
limit.rel |
[0,1] |
limit.abs |
[ |
atLeast |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
Details
The modified Jaccard Coefficient for two topics \bm z_{i}
and
\bm z_{j}
is calculated by
J_m(\bm z_{i}, \bm z_{j} \mid \bm c) = \frac{\sum_{v = 1}^{V} 1_{\left\{n_{i}^{(v)} > c_i ~\wedge~ n_{j}^{(v)} > c_j\right\}}\left(n_{i}^{(v)}, n_{j}^{(v)}\right)}{\sum_{v = 1}^{V} 1_{\left\{n_{i}^{(v)} > c_i ~\vee~ n_{j}^{(v)} > c_j\right\}}\left(n_{i}^{(v)}, n_{j}^{(v)}\right)}
with V
is the vocabulary size and n_k^{(v)}
is the count of
assignments of the v
-th word to the k
-th topic. The threshold vector \bm c
is determined by the maximum threshold of the user given lower bounds limit.rel
and limit.abs
. In addition, at least atLeast
words per topic are
considered for calculation. According to this, if there are less than
atLeast
words considered as relevant after applying limit.rel
and limit.abs
the atLeast
most common words per topic are taken
to determine topic similarities.
The procedure of determining relevant words is executed for each topic individually.
The values wordslimit
and wordsconsidered
describes the number
of relevant words per topic.
Value
[named list
] with entries
sims
[
lower triangular named matrix
] with all pairwise jaccard similarities of the given topics.wordslimit
[
integer
] with counts of words determined as relevant based onlimit.rel
andlimit.abs
.wordsconsidered
[
integer
] with counts of considered words for similarity calculation. Could differ fromwordslimit
, ifatLeast
is greater than zero.param
[
named list
] with parameter specifications fortype
[character(1)
]= "Jaccard Coefficient"
,limit.rel
[0,1],limit.abs
[integer(1)
] andatLeast
[integer(1)
]. See above for explanation.
See Also
Other TopicSimilarity functions:
cosineTopics()
,
dendTopics()
,
getSimilarity()
,
jsTopics()
,
rboTopics()
Other workflow functions:
LDARep()
,
SCLOP()
,
dendTopics()
,
getPrototype()
,
mergeTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
jacc = jaccardTopics(topics, atLeast = 2)
jacc
n1 = getConsideredWords(jacc)
n2 = getRelevantWords(jacc)
(n1 - n2)[n1 - n2 != 0]
sim = getSimilarity(jacc)
dim(sim)
# Comparison to Cosine and Jensen-Shannon (more interesting on large datasets)
cosine = cosineTopics(topics)
js = jsTopics(topics)
sims = list(jaccard = sim, cosine = getSimilarity(cosine), js = getSimilarity(js))
pairs(do.call(cbind, lapply(sims, as.vector)))
Pairwise Jensen-Shannon Similarities (Divergences)
Description
Calculates the similarity of all pairwise topic combinations using the Jensen-Shannon Divergence.
Usage
jsTopics(topics, epsilon = 1e-06, progress = TRUE, pm.backend, ncpus)
Arguments
topics |
[ |
epsilon |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
Details
The Jensen-Shannon Similarity for two topics \bm z_{i}
and
\bm z_{j}
is calculated by
JS(\bm z_{i}, \bm z_{j}) = 1 - \left( KLD\left(\bm p_i, \frac{\bm p_i + \bm p_j}{2}\right) + KLD\left(\bm p_j, \frac{\bm p_i + \bm p_j}{2}\right) \right)/2
= 1 - KLD(\bm p_i, \bm p_i + \bm p_j)/2 - KLD(\bm p_j, \bm p_i + \bm p_j)/2 - \log(2)
with V
is the vocabulary size, \bm p_k = \left(p_k^{(1)}, ..., p_k^{(V)}\right)
,
and p_k^{(v)}
is the proportion of assignments of the
v
-th word to the k
-th topic. KLD defines the Kullback-Leibler
Divergence calculated by
KLD(\bm p_{k}, \bm p_{\Sigma}) = \sum_{v=1}^{V} p_k^{(v)} \log{\frac{p_k^{(v)}}{p_{\Sigma}^{(v)}}}.
There is an epsilon
added to every n_k^{(v)}
, the count
(not proportion) of assignments to ensure computability with respect to zeros.
Value
[named list
] with entries
sims
[
lower triangular named matrix
] with all pairwise similarities of the given topics.wordslimit
[
integer
] = vocabulary size. SeejaccardTopics
for original purpose.wordsconsidered
[
integer
] = vocabulary size. SeejaccardTopics
for original purpose.param
[
named list
] with parameter specifications fortype
[character(1)
]= "Cosine Similarity"
andepsilon
[numeric(1)
]. See above for explanation.
See Also
Other TopicSimilarity functions:
cosineTopics()
,
dendTopics()
,
getSimilarity()
,
jaccardTopics()
,
rboTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
js = jsTopics(topics)
js
sim = getSimilarity(js)
dim(sim)
js1 = jsTopics(topics, epsilon = 1)
sim1 = getSimilarity(js1)
summary((sim1-sim)[lower.tri(sim)])
plot(sim, sim1, xlab = "epsilon = 1e-6", ylab = "epsilon = 1")
Merge LDA Topic Matrices
Description
Collects LDA results from a given registry and merges their topic matrices for a given set of vocabularies.
Usage
mergeBatchTopics(...)
## S3 method for class 'LDABatch'
mergeBatchTopics(x, vocab, progress = TRUE, ...)
## Default S3 method:
mergeBatchTopics(vocab, reg, job, id, progress = TRUE, ...)
Arguments
... |
additional arguments |
x |
[ |
vocab |
[ |
progress |
[ |
reg |
[ |
job |
[ |
id |
[ |
Details
For details and examples see mergeTopics
.
Value
[named matrix
] with the count of vocabularies (row wise) in topics (column wise).
See Also
Other merge functions:
mergeRepTopics()
,
mergeTopics()
Other batch functions:
LDABatch()
,
as.LDABatch()
,
getJob()
Merge LDA Topic Matrices
Description
Collects LDA results from a list of replicated runs and merges their topic matrices for a given set of vocabularies.
Usage
mergeRepTopics(...)
## S3 method for class 'LDARep'
mergeRepTopics(x, vocab, progress = TRUE, ...)
## Default S3 method:
mergeRepTopics(lda, vocab, id, progress = TRUE, ...)
Arguments
... |
additional arguments |
x |
[ |
vocab |
[ |
progress |
[ |
lda |
[ |
id |
[ |
Details
For details and examples see mergeTopics
.
Value
[named matrix
] with the count of vocabularies (row wise) in topics (column wise).
See Also
Other merge functions:
mergeBatchTopics()
,
mergeTopics()
Other replication functions:
LDAPrototype()
,
LDARep()
,
as.LDARep()
,
getJob()
Merge LDA Topic Matrices
Description
Generic function, which collects LDA results and merges their topic matrices for a given set of vocabularies.
Usage
mergeTopics(x, vocab, progress = TRUE)
Arguments
x |
|
vocab |
[ |
progress |
[ |
Details
This function uses the function mergeRepTopics
or
mergeBatchTopics
. The topic matrices are transponed and cbinded,
so that the resulting matrix contains the counts of vocabularies/words (row wise)
in topics (column wise).
Value
[named matrix
] with the count of vocabularies (row wise) in topics (column wise).
See Also
Other merge functions:
mergeBatchTopics()
,
mergeRepTopics()
Other workflow functions:
LDARep()
,
SCLOP()
,
dendTopics()
,
getPrototype()
,
jaccardTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
dim(topics)
length(reuters_vocab)
## Not run:
res = LDABatch(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
dim(topics)
length(reuters_vocab)
## End(Not run)
Local Pruning State of Topic Dendrograms
Description
The function SCLOP
calculates the S-CLOP value for the best possible
local pruning state of a dendrogram from dendTopics
.
The function pruneSCLOP
supplies the corresponding pruning state itself.
Usage
pruneSCLOP(dend)
## S3 method for class 'PruningSCLOP'
plot(x, dend, pruning.par, ...)
pruning.par(pruning)
Arguments
dend |
[ |
x |
an R object. |
pruning.par |
[ |
... |
additional arguments. |
pruning |
[ |
Details
For details of computing the S-CLOP values see SCLOP
.
For details and examples of plotting the pruning state see dendTopics
.
Value
[list of dendrograms
]
PruningSCLOP
object specifying the best possible
local pruning state.
See Also
Other plot functions:
dendTopics()
Other SCLOP functions:
SCLOP()
Pairwise RBO Similarities
Description
Calculates the similarity of all pairwise topic combinations using the rank-biased overlap (RBO) Similarity.
Usage
rboTopics(topics, k, p, progress = TRUE, pm.backend, ncpus)
Arguments
topics |
[ |
k |
[ |
p |
[0,1] |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
Details
The RBO Similarity for two topics \bm z_{i}
and \bm z_{j}
is calculated by
RBO(\bm z_{i}, \bm z_{j} \mid k, p) = 2p^k\frac{\left|Z_{i}^{(k)} \cap Z_{j}^{(k)}\right|}{\left|Z_{i}^{(k)}\right| + \left|Z_{j}^{(k)}\right|} + \frac{1-p}{p} \sum_{d=1}^k 2 p^d\frac{\left|Z_{i}^{(d)} \cap Z_{j}^{(d)}\right|}{\left|Z_{i}^{(d)}\right| + \left|Z_{j}^{(d)}\right|}
with Z_{i}^{(d)}
is the vocabulary set of topic \bm z_{i}
down to
rank d
. Ties in ranks are resolved by taking the minimum.
The value wordsconsidered
describes the number of words per topic
ranked at rank k
or above.
Value
[named list
] with entries
sims
[
lower triangular named matrix
] with all pairwise similarities of the given topics.wordslimit
[
integer
] = vocabulary size. SeejaccardTopics
for original purpose.wordsconsidered
[
integer
] = vocabulary size. SeejaccardTopics
for original purpose.param
[
named list
] with parametertype
[character(1)
]= "RBO Similarity"
,k
[integer(1)
] andp
[0,1]. See above for explanation.
References
Webber, William, Alistair Moffat and Justin Zobel (2010). "A similarity measure for indefinite rankings". In: ACM Transations on Information Systems 28(4), p.20:1–-20:38, DOI 10.1145/1852102.1852106, URL https://doi.acm.org/10.1145/1852102.1852106
See Also
Other TopicSimilarity functions:
cosineTopics()
,
dendTopics()
,
getSimilarity()
,
jaccardTopics()
,
jsTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
rbo = rboTopics(topics, k = 12, p = 0.9)
rbo
sim = getSimilarity(rbo)
dim(sim)
A Snippet of the Reuters Dataset
Description
Example Dataset from Reuters consisting of 91 articles. It can be used to familiarize with the bunch of functions offered by this package.
Usage
data(reuters_docs)
data(reuters_vocab)
Format
reuters_docs
is a list of documents of length 91 prepared by LDAprep
.
reuters_vocab
is
An object of class character
of length 2141.
Source
temporarily unavailable: http://ronaldo.cs.tcd.ie/esslli07/data/reuters21578-xml/
References
Lewis, David (1997). Reuters-21578 Text Categorization Collection Distribution 1.0. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Luz, Saturnino. XML-encoded version of Reuters-21578. http://ronaldo.cs.tcd.ie/esslli07/data/reuters21578-xml/ (temporarily unavailable)