Please check the latest news (change log) and keep this package updated.
BERT_remove(): Remove models from local cache
folder.fill_mask() and fill_mask_check():
These functions are only for technical check (i.e., checking the raw
results of fill-mask pipeline). Normal users should usually use
FMAT_run().pattern.special argument for
FMAT_run(): Regular expression patterns (matching model
names) for special model cases that are uncased or require a special
prefix character in certain situations.
prefix.u2581: adding prefix \u2581 for
all mask wordsprefix.u0120: adding prefix \u0120 for
only non-starting mask wordsset_cache_folder(),
BERT_download(), BERT_info(), and
BERT_info_date().
BERT_info() and model initial commit date scraped from
HuggingFace BERT_info_date() will be saved in subfolders of
local cache: /.info/ and /.date/,
respectively.FMAT_load().library(FMAT):
Sys.setenv("HF_HUB_DISABLE_SYMLINKS_WARNING" = "1")Sys.setenv("TF_ENABLE_ONEDNN_OPTS" = "0")Sys.setenv("KMP_DUPLICATE_LIB_OK" = "TRUE")Sys.setenv("OMP_NUM_THREADS" = "1")set_cache_folder(): Set (change) HuggingFace
cache folder temporarily.
BERT_info_date(): Scrape the initial commit date
of BERT models from HuggingFace.BERT_download() and
BERT_info().BERT_download() connects to the
Internet, while all the other functions run in an offline way.BERT_info().add.tokens and add.method arguments
for BERT_vocab() and FMAT_run(): An
experimental functionality to add new tokens (e.g.,
out-of-vocabulary words, compound words, or even phrases) as [MASK]
options. Validation is still needed for this novel practice (one of my
ongoing projects), so currently please only use at your own risk,
waiting until the publication of my validation work.BERT_download() now import local
model files only, without automatically downloading models. Users must
first use BERT_download() to download models.FMAT_load(): Better to use
FMAT_run() directly.BERT_vocab() and ICC_models().summary.fmat(), FMAT_query(), and
FMAT_run() (significantly faster because now it can
simultaneously estimate all [MASK] options for each unique
query sentence, with running time only depending on the number of unique
queries but not on the number of [MASK] options).reticulate package version ≥ 1.36.1,
then FMAT should be updated to ≥ 2024.4. Otherwise,
out-of-vocabulary [MASK] words may not be identified and marked. Now
FMAT_run() directly uses model vocabulary and token ID to
match [MASK] words. To check if a [MASK] word is in the model
vocabulary, please use BERT_vocab().BERT_download() (downloading models to local
cache folder “%USERPROFILE%/.cache/huggingface”) to differentiate from
FMAT_load() (loading saved models from local cache). But
indeed FMAT_load() can also download models
silently if they have not been downloaded.gpu argument (see Guidance
for GPU Acceleration) in FMAT_run() to allow for
specifying an NVIDIA GPU device on which the fill-mask pipeline will be
allocated. GPU roughly performs 3x faster than CPU for the fill-mask
pipeline. By default, FMAT_run() would automatically detect
and use any available GPU with an installed CUDA-supported Python
torch package (if not, it would use CPU).FMAT_run().BERT_download(),
FMAT_load(), and FMAT_run().parallel in FMAT_run():
FMAT_run(model.names, data, gpu=TRUE) is the fastest.progress in
FMAT_run().