embed_llamar() — high-level embedding provider
compatible with ragnar_store_create(embed = ...). Supports
partial application (lazy model loading), direct call returning a
matrix, and data.frame input. L2 normalization on by default.llama_embed_batch() — embed multiple texts in one call.
Uses true pooled batch decode (llama_get_embeddings_seq)
for embedding models, with automatic fallback to sequential last-token
decode for generative models.llama_get_embeddings_ith() — get embedding vector for
the i-th token (supports negative indexing).llama_get_embeddings_seq() — get pooled embedding for a
sequence ID.llama_new_context() gains embedding
parameter. When TRUE, sets
cparams.embeddings = true and disables causal attention at
creation time. llama_embed_batch() uses this flag to choose
the optimal code path.llama_load_model() gains devices parameter
for explicit backend selection. Accepts device names from
llama_backend_devices(), type keywords ("cpu",
"gpu"), or numeric indices. Multiple devices enable
multi-GPU split.llama_backend_devices() — list all available compute
devices (CPU, GPU, iGPU, accelerator) as a data.frame.llama_numa_init() — NUMA optimization with strategies:
disabled, distribute, isolate, numactl, mirror.llama_time_us() — current time in microseconds.llama_token_to_piece() — convert a single token ID to
its text piece.llama_encode() — run the encoder pass for
encoder-decoder models (e.g. T5, BART).llama_batch_init() / llama_batch_free() —
low-level batch allocation and release with automatic GC finalizer.extern "C" block wrapping
#include <R.h> in r_llama_compat.h (C++
templates cannot appear inside extern "C" linkage).Rinternals.h
#define length(x) and std::codecvt::length()
in r_llama_interface.cpp: C++ standard headers are now
included before R headers, followed by #undef length.llama_token_to_piece,
llama_batch_init, llama_batch_free, and
llama_encode, including GPU context variants.llama_hf_list() — list GGUF files in a Hugging Face
repository.llama_hf_download() — download a GGUF model with local
caching. Supports exact filename, glob pattern, or Ollama-style tag
selection.llama_load_model_hf() — download and load a model in
one step.llama_hf_cache_dir() — get the cache directory
path.llama_hf_cache_info() — inspect cached models.llama_hf_cache_clear() — clear the model cache.jsonlite and utils to
Imports.configure.win and
Makevars.win.in.ggmlR is built
with GPU support.exit() / _Exit() overrides to
r_llama_compat.h to prevent process termination (redirects
to Rf_error()).ggmlR >= 0.5.4.ggmlR).ggmlR.\value tags to all exported functions
describing return class, structure, and meaning.\dontrun{} with \donttest{} in
all examples.cph) for
bundled ‘llama.cpp’ code.NEWS.md in the package tarball (removed from
.Rbuildignore).cran-comments.md..Rbuildignore.Full LLM inference cycle is now available from R:
llama_load_model() / llama_free_model() —
load and free GGUF modelsllama_new_context() / llama_free_context()
— context managementllama_tokenize() / llama_detokenize() —
tokenization and detokenizationllama_generate() — text generation with temperature,
top_k, top_p, greedy supportllama_embeddings() — embedding extractionllama_model_info() — model metadataModel and context are wrapped as ExternalPtr with automatic GC finalizers. The context holds a reference to the model ExternalPtr, preventing premature collection.
llama_generate() runs the full pipeline in a single C++
call: prompt tokenization → encode → autoregressive decode loop with a
sampler chain → detokenization of generated tokens.
19 assertions across 7 test blocks, all passing.
libggml.a from ggmlR packageggml_build_forward_select replaced with simplified
branch selection