remove_non_ascii parameter in
textEmbed(). # text 1.6.1textTrainExamples() to
textExamples() and improving the filter_word function.textTrainExamples().textTopics().save_output = “no_plot” in
textTrainRegression() for “logistic” and “multinomial” to
reduce model size of saved objects.word_embeddings and
model requirements in the textPredict()
function. This is controlled via the new
check_matching_word_embeddings parameter, which validates
compatibility of model type, layers, and aggregation settings.textDimName() function,
allowing users to specify or change the name suffix for word embedding
dimensions.dim_names = FALSE behavior in
the textDimName() function to also ignore model-required
dimension suffixes. Now includes clearer and more informative warnings
when dimension mismatches occur.rsample::function validation_split() to
initial_validation_split(). However, this changes some
results in textTrainRegression() and
textTrainRandomForrest().textLBAM() to take
construct_start parameter.textTrainRegression() to reduce saved model sizes.textEmbedRawLayers()
(when using default -2, layer 11 was selected even for large models).
This was never a problem in textEmbed().dlatk_method to the textEmbed()
function.cv_method = “group_cv” in the
textTrainRegression() function.plot_n_word_random and
legend_number_colour in textPlot.nltk warning when running the functions
requiring pyhon.textProjection() function.textProjection()
functiontextTrainExamples()highest_parameter and
lowest_parameter when parameters are tied.textPredict(),
textAssess() and textClassify().textLBAM().textClean() (removing
common personal information).textLBAM() returns the library as a dataframtextPredict() detects model_type.textFindNonASCII() function and feature in
textEmbed() to warn and clean non-ASCII characters. This
may change results slightly.type parameter in textPredict() and instead
giving both probability and class.textClassify() is now called
textClassifyPipe()textPredict() is now called
textPredictR()textAssess(), textPredict() and
textClassify() works the same, now taking the parameter
method with the string “text” to using textPredict(), and
“huggingface” to using textClassifyPipe().hg_gated, hg_token, and
trust_remote_code.return_incorrect_results to
force_return_resultsfunction_to_apply = NULL instead of
“none”; this is to mimic huggingface default.textWordPrediction since it is under development and
note tested.textTrainN() including subsets
sampling (new: default change from random to
subsets), use_same_penalty_mixture
(new:default change from FALSE to TRUE) and
std_err (new output).textTrainPlot()textPredict() functionality.textTopics()textTopics() trains a BERTopic model with different
modules and returns the model, data, and topic_document distributions
based on c-td-idftextTopicsTest() can perform multiple tests
(correlation, t-test, regression) between a BERTopic model from
textTopics() and datatextTopicsWordcloud() can plot word clouds of topics
tested with textTopicsTest()textTopicsTree() prints out a tree structure of the
hierarchical topic structuretextEmbed() is now fully embedding one column at the
time; and reducing word_types for each column. This can break some code;
and produce different results in plots where word_types are based on
several embedded columns.textTrainN() and textTrainNPlot()
evaluates prediction accuracy across number of cases.textTrainRegression() and
textTrainRandomForest now takes tibble as input in
strata.textTrainRegression()textPredictTest() can handle auctextEmbed() is faster (thanks to faster handling of
aggregating layers)sort parameter in
textEmbedRawLayers().Possibility to use GPU for MacOS M1 and M2 chip using device = “mps”
in textEmbed()
textFineTune() as an experimental function is
implemented max_length implemented in
textTranslate()
textEmbedReduce() implementedtextEmbed(decontextualize=TRUE), which gave
error.textSimialirtyTest() for version 1.0 because
it needs more evaluations.model, so
that layers = -2 works in textEmbed().set_verbosity.sorting_xs_and_x_append from Dim to Dim0
when renaming x_appended variables.first to append_first and made it
an option in textTrainRegression() and
textTrainRandomForest().textEmbed() layers = 11:12 is now
second_to_last.textEmbedRawLayers default is now
second_to_last.textEmbedLayerAggregation()
layers = 11:12 is now layers = "all".textEmbed() and textEmbedRawLayers()
x is now called texts.textEmbedLayerAggregation() now uses
layers = "all",
aggregation_from_layers_to_tokens,
aggregation_from_tokens_to_texts.textZeroShot() is implemented.textDistanceNorm() and
textDistanceMatrix()textDistance() can compute cosine
distance.textModelLayers() provides N layers for a given
modelmax_token_to_sentence in textEmbed()
aggregate_layers is now called
aggregation_from_layers_to_tokens.aggregate_tokens is now called
aggregation_from_tokens_to_texts.
single_word_embeddings is now called
word_types_embeddingstextEmbedLayersOutput() is now called
textEmbedRawLayers()textDimName()textEmbed(): dim_name =
TRUEtextEmbed():
single_context_embeddings = TRUEtextEmbed(): device = “gpu”explore_words in
textPlot()x_append_target in textPredict()
functiontextClassify(), textGeneration(),
textNER(), textSum(), textQA(),
and textTranslate().x_add to x_append across
functionsset_seed to language analysis tasksx' in training and
predictiontextPredict does not take word_embeddings
and x_append (not new_data)textClassify() (under development)textGeneration() (under development)textNER() (under development)textSum() (under development)textQA() (under development)textTranslate() (under development)textSentiment(), from huggingface
transformers models.textEmbed(), textTrainRegression(),
textTrainRandomForest() and
textProjection().dim_names to set unique dimension names in
textEmbed() and textEmbedStatic().textPreictAll() function that can take several models,
word embeddings, and variables as input to provide multiple
outputs.textTrain() functions with x_append.textPredict related functions are located in its own
filetext_version numbertextEmbedLayersOutput and textEmbed can
provide single_context_embeddingsreturn_tokens option from textEmbed (since it
is only relevant for textEmbedLayersOutput)$single_we when
decontexts is FALSE.Logistic regression is default for classification in
textTrain.model_max_length in
textEmbed().textModels() show downloaded models.textModelsRemove() deletes specified models.textSimilarityTest() when
uneven number of cases are tested.textDistance() function with distance
measures.textSimilarity().textSimilarity() in
textSimilarityTest(), textProjection() and
textCentrality() for plotting.textTrainRegression()
concatenates word embeddings when provided with a list of several word
embeddings.word_embeddings_4$singlewords_we.textCentrality(), words to be plotted are selected
with word_data1_all$extremes_all_x >= 1 (rather than
==1).textSimilarityMatrix() computes semantic similarity
among all combinations in a given word embedding.textDescriptives() gets options to remove NA and
compute total scores.textDescriptives()textrpp_initiate()tokenization is made with NLTK from
python.textWordPredictions()
(which has a trial period/not fully developed and might be removed in
future versions); p-values are not yet implemented.textPlot() for objects from both
textProjection() and
textWordPredictions()textrpp_initiate() runs automatically in
library(text) when default environment exitstextSimilarityTest().stringr to stringi (and
removed tokenizer) as imported packagetextrpp_install() installs a conda
environment with text required python packages.textrpp_install_virtualenv() install a virtual
environment with text required python packages.textrpp_initialize() initializes installed
environment.textrpp_uninstall() uninstalls conda
environment.textEmbed() and textEmbedLayersOutput()
support the use of GPU using the device setting.remove_words makes it possible to remove specific words
from textProjectionPlot()textProjetion() and textProjetionPlot()
it now possible to add points of the aggregated word embeddings in the
plottextProjetion() it now possible to manually add
words to the plot in order to explore them in the word embedding
space.textProjetion() it is possible to add color or
remove words that are more frequent on the opposite “side” of its dot
product projection.textProjection() with
split == quartile, the comparison distribution is now based
on the quartile data (rather than the data for mean)textEmbed() with
decontexts=TRUE.textSimilarityTest() is not giving error when using
method = unpaired, with unequal number of participants in each
group.textPredictTest() function to significance test
correlations of different models. 0.9.11This version is now on CRAN. ### New Features - Adding option to
deselect the step_centre and step_scale in
training. - Cross-validation method in
textTrainRegression() and
textTrainRandomForrest() have two options
cv_folds and validation_split. (0.9.02) -
Better handling of NA in step_naomit in
training. - DistilBert model works (0.9.03)
textProjectionPlot() plots words extreme in more than
just one feature (i.e., words are now plotted that satisfy, for example,
both plot_n_word_extreme and
plot_n_word_frequency). (0.9.01)textTrainRegression() and
textTrainRandomForest() also have function that select the
max evaluation measure results (before only minimum was selected all the
time, which, e.g., was correct for rmse but not for r) (0.9.02)id_nr in training and predict by using
workflows (0.9.02).