--- title: "Getting Started with pairwiseLLM" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with pairwiseLLM} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE ) library(pairwiseLLM) library(dplyr) ``` # 1. Introduction `pairwiseLLM` provides a unified workflow for generating and analyzing **pairwise comparisons of writing quality** using LLM APIs (OpenAI, Anthropic, Gemini, Together), and local models via Ollama.. A typical workflow: 1. Select writing samples 2. Construct pairwise comparison sets 3. Submit comparisons to an LLM (live or batch API) 4. Parse model outputs 5. Fit Bradley–Terry or Elo models to obtain latent writing-quality scores For prompt evaluation and positional-bias diagnostics, see: * [`vignette("prompt-template-bias")`](https://shmercer.github.io/pairwiseLLM/articles/prompt-template-bias.html) For advanced batch processing workflows, see: * [`vignette("advanced-batch-workflows")`](https://shmercer.github.io/pairwiseLLM/articles/advanced-batch-workflows.html) --- # 2. Setting API Keys `pairwiseLLM` reads provider keys **only from environment variables**, never from R options or global variables. | Provider | Environment Variable | |----------|----------------------| | [OpenAI](https://openai.com/api/) | OPENAI_API_KEY | | [Anthropic](https://console.anthropic.com/)| ANTHROPIC_API_KEY | | [Gemini](https://aistudio.google.com/) | GEMINI_API_KEY | | [Together](https://www.together.ai/) | TOGETHER_API_KEY | You should put these in your `~/.Renviron`: ``` OPENAI_API_KEY="sk-..." ANTHROPIC_API_KEY="..." GEMINI_API_KEY="..." TOGETHER_API_KEY="..." ``` Check which keys are available: ``` library(pairwiseLLM) check_llm_api_keys() #> All known LLM API keys are set: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, TOGETHER_API_KEY. #> # A tibble: 4 × 4 #> backend service env_var has_key #> 1 openai OpenAI OPENAI_API_KEY TRUE #> 2 anthropic Anthropic ANTHROPIC_API_KEY TRUE #> 3 gemini Google Gemini GEMINI_API_KEY TRUE #> 4 together Together.ai TOGETHER_API_KEY TRUE ``` [Ollama](https://ollama.com/) runs locally and does not require an API key, just that the Ollama server is running. --- # 3. Example Writing Data The package ships with 20 authentic student writing samples: ```{r} data("example_writing_samples", package = "pairwiseLLM") dplyr::slice_head(example_writing_samples, n = 3) ``` Each sample has: - `ID` - `text` --- # 4. Constructing Pairwise Comparisons Create all unordered pairs: ```{r} pairs <- example_writing_samples |> make_pairs() dplyr::slice_head(pairs, n = 5) ``` Sample a subset of pairs: ```{r} pairs_small <- sample_pairs(pairs, n_pairs = 10, seed = 123) ``` Randomize SAMPLE_1 / SAMPLE_2 order: ```{r} pairs_small <- randomize_pair_order(pairs_small, seed = 99) ``` --- # 5. Traits and Prompt Templates ## 5.1 Using a built-in trait ```{r} td <- trait_description("overall_quality") td ``` Or define your own: ```{r} td_custom <- trait_description( custom_name = "Clarity", custom_description = "How clearly and effectively ideas are expressed." ) ``` ## 5.2 Using or customizing prompt templates Load default prompt: ```{r} tmpl <- set_prompt_template() cat(substr(tmpl, 1, 300)) ``` Placeholders required: - `{TRAIT_NAME}` - `{TRAIT_DESCRIPTION}` - `{SAMPLE_1}` - `{SAMPLE_2}` Load a template from file: ```{r, eval=FALSE} set_prompt_template(file = "my_template.txt") ``` --- # 6. Live Pairwise Comparisons The unified wrapper works for **OpenAI, Anthropic, Gemini, Together, and Ollama.** ```{r, eval=FALSE} res_live <- submit_llm_pairs( pairs = pairs_small, backend = "openai", # also "anthropic", "gemini", "together", "ollama" model = "gpt-4o", trait_name = td$name, trait_description = td$description, prompt_template = tmpl ) ``` Preview results: ```{r, eval=FALSE} dplyr::slice_head(res_live, 5) ``` Each row includes: - `pair_id` - `sample1_id`, `sample2_id` - parsed `` tag → `better_sample` and `better_id` - (optionally) raw model output --- # 7. Preparing Data for BT or Elo Modeling Convert LLM output to a 3-column BT dataset: ```{r, eval=FALSE} # res_live: output from submit_llm_pairs() bt_data <- build_bt_data(res_live) dplyr::slice_head(bt_data, 5) ``` and/or a dataset for Elo modeling: ```{r, eval=FALSE} # res_live: output from submit_llm_pairs() elo_data <- build_elo_data(res_live) ``` --- # 8. Bradley–Terry Modeling Fit model: ```{r, eval=FALSE} bt_fit <- fit_bt_model(bt_data) ``` Summarize results: ```{r, eval=FALSE} summarize_bt_fit(bt_fit) ``` The output includes: - latent θ ability scores - SEs - reliability (sirt engine) --- # 9. Elo Modeling ```{r, eval=FALSE} elo_fit <- fit_elo_model(elo_data, runs = 5) elo_fit ``` Outputs: - Elo ratings for each sample - unweighted and weighted reliability - trial counts --- # 10. Batch APIs (Large Jobs) ## 10.1 Submit a batch ```{r, eval=FALSE} batch <- llm_submit_pairs_batch( backend = "openai", model = "gpt-4o", pairs = pairs_small, trait_name = td$name, trait_description = td$description, prompt_template = tmpl ) ``` ## 10.2 Download results ```{r, eval=FALSE} res_batch <- llm_download_batch_results(batch) head(res_batch) ``` --- # 11. Backend-Specific Tools Most users use the unified interface, but backend helpers are available. ### 11.1 OpenAI - `submit_openai_pairs_live()` - `build_openai_batch_requests()` - `run_openai_batch_pipeline()` - `parse_openai_batch_output()` ### 11.2 Anthropic - `submit_anthropic_pairs_live()` - `build_anthropic_batch_requests()` - `run_anthropic_batch_pipeline()` - `parse_anthropic_batch_output()` ### 11.3 Google Gemini - `submit_gemini_pairs_live()` - `build_gemini_batch_requests()` - `run_gemini_batch_pipeline()` - `parse_gemini_batch_output()` ### 11.4 Together.ai (live only) - `together_compare_pair_live()` - `submit_together_pairs_live()` ### 11.5 Ollama (local, live only) - `ollama_compare_pair_live()` - `submit_ollama_pairs_live()` --- # 12. Troubleshooting ### Missing API keys ```{r} check_llm_api_keys() ``` ### Getting chain-of-thought leakage Use the default template or set `include_thoughts = FALSE`. ### Timeouts Use batch APIs for >40 pairs. ### Positional bias Use `compute_reverse_consistency()` + `check_positional_bias()` (see [vignette("prompt-template-bias")](https://shmercer.github.io/pairwiseLLM/articles/prompt-template-bias.html) for a full example). --- # 13. Citation > Mercer, S. (2025). *Getting started with pairwiseLLM* (Version 1.0.0) [R package vignette]. In *pairwiseLLM: Pairwise Comparison Tools for Large Language Model-Based Writing Evaluation*. https://shmercer.github.io/pairwiseLLM/