| Type: | Package |
| Title: | Import Texts from Files in the 'Alceste' Format Using the 'tm' Text Mining Framework |
| Version: | 1.1.2 |
| Date: | 2025-02-27 |
| Imports: | NLP, tm (≥ 0.6) |
| Suggests: | stringi |
| Description: | Provides a 'tm' Source to create corpora from a corpus prepared in the format used by the 'Alceste' application (i.e. a single text file with inline meta-data). It is able to import both text contents and meta-data (starred) variables. |
| License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
| URL: | https://github.com/nalimilan/R.TeMiS |
| BugReports: | https://github.com/nalimilan/R.TeMiS/issues |
| NeedsCompilation: | no |
| Packaged: | 2025-02-27 18:19:51 UTC; milan |
| Author: | Milan Bouchet-Valat [aut, cre] |
| Maintainer: | Milan Bouchet-Valat <nalimilan@club.fr> |
| Repository: | CRAN |
| Date/Publication: | 2025-02-28 09:50:02 UTC |
A plug-in for the tm text mining framework to import corpora from Alceste files
Description
This package provides a tm Source to create corpora from files formatted in the format used by the Alceste application.
Details
Typical usage is to create a corpus from an Alceste file
prepared manually (here called myAlcesteCorpus.txt).
Frequently, it is necessary to specify the encoding of the texts
via link{AlcesteSource}'s encoding argument.
# Import corpus
source <- europresseSource("myAlcesteCorpus.txt")
corpus <- Corpus(source)
# See how many articles were imported
corpus
# See the contents of the first article and its meta-data
inspect(corpus[1])
meta(corpus[[1]])
See link{AlcesteSource} for more details and real examples.
Author(s)
Milan Bouchet-Valat <nalimilan@club.fr>
References
https://image-zafar.com/Logicieluk.html
Alceste Source
Description
Construct a source for an input containing a set of texts saved in the Alceste format in a single text file.
Usage
AlcesteSource(x, encoding = "auto")
Arguments
x |
Either a character identifying the file or a connection. |
encoding |
A character string: if non-empty declares the encoding
used when reading the file, so the character data can be
re-encoded. See the ‘Encoding’ section of the help for
|
Details
Several texts are saved in a single Alceste-formatted file, separated
by lines starting with “***” or digits, followed by starred
variables (see links below). These variables are set as document
meta-data that can be accessed via the meta function.
Currently, “theme” lines starting with “-*” are ignored.
Value
An object of class AlcesteSource which extends the class
Source representing set of articles from Alceste.
Author(s)
Milan Bouchet-Valat
See Also
https://image-zafar.com/sites/default/files/telechargements/formatage_alceste.pdf (in French) about the Alceste format
readAlceste for the function actually parsing
individual articles.
getSources to list available sources.
Examples
library(tm)
file <- system.file("texts", "alceste_test.txt",
package = "tm.plugin.alceste")
corpus <- Corpus(AlcesteSource(file))
# See the contents of the documents
inspect(corpus)
# See meta-data associated with first article
meta(corpus[[1]])
Read in a text in the Alceste format
Description
Read in a text in the Alceste format using starred variables.
Usage
readAlceste(elem, language, id)
Arguments
elem |
A |
language |
A |
id |
A |
Value
A PlainTextDocument with the contents of the article and the available meta-data set.
Author(s)
Milan Bouchet-Valat
See Also
getReaders to list available reader functions.