WDL Model

library(rwig) |> suppressPackageStartupMessages()

Let’s say we have some documents as character vectors, and we want to discover the underlying topics. This is called “topic modeling”, and Latent Dirichlet Allocation (LDA) is probably the most famous among all of topic models. Here, we consider the Wasserstein Dicionary Learning (WDL) model.

# a very simple example
sentences <- c("this is a sentence", "this is another one", "yet another sentence")
wdl_fit <- wdl(sentences, specs = wdl_specs(
  wdl_control = list(num_topics = 2),
  word2vec_control = list(min_count = 1)
))
#> Preprocessing the data...
#> Running tokenizer on the sentences...
#> Running Word2Vec for the embeddings and distance matrix...
#> `method` is automatically switched to "log"
#> Running WDL in CUDA mode...
#> This might take a while depending on the problem size...
#> Initializing WDL model with 5 vocabs, 3 docs, and 2 topics...
#> Training WDL model with 2 epochs, 1 batches
#> Epoch 1 of 2, batch 1 of 1
#>   batch time: 0.01 sec
#> Epoch 2 of 2, batch 1 of 1
#>   batch time: 0.00 sec
#> Inference on the dataset
#>   Inference: 3 of 3 docs done

wdl_fit
#> WDL model topics:
#> 
#> Topic 1:
#> sentenc     yet   anoth     one    </s> 
#>   0.482   0.201   0.197   0.071   0.048 
#> 
#> Topic 2:
#> sentenc     one   anoth    </s>     yet 
#>   0.419   0.255   0.176   0.112   0.039

We can see from the topics that they are vectors of the tokens (words) with associated probabilities. If you want to access the topics, you can do this:

wdl_fit$topics
#>             topic1     topic2
#> one     0.07082487 0.25479149
#> yet     0.20149596 0.03862156
#> anoth   0.19719629 0.17623633
#> sentenc 0.48236498 0.41868685
#> </s>    0.04811789 0.11166377

Alternatively, you can also obtain the weights of the topics used to re-construct the input data:

wdl_fit$weights
#>             [,1]      [,2]      [,3]
#> topic1 0.5644257 0.6153338 0.4026389
#> topic2 0.4355743 0.3846662 0.5973611

See Also

See also vignette("specs").

References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.

Peyré, G., & Cuturi, M. (2019). Computational Optimal Transport: With Applications to Data Science. Foundations and Trends® in Machine Learning, 11(5–6), 355–607. https://doi.org/10.1561/2200000073

Schmitz, M. A., Heitz, M., Bonneel, N., Ngolè, F., Coeurjolly, D., Cuturi, M., Peyré, G., & Starck, J.-L. (2018). Wasserstein dictionary learning: Optimal transport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1), 643–678. https://doi.org/10.1137/17M1140431

Xie, F. (2025). Deriving the Gradients of Some Popular Optimal Transport Algorithms (No. arXiv:2504.08722). arXiv. https://doi.org/10.48550/arXiv.2504.08722