This app displays evaluation results comparing how well various LLMs generate R code.
We used the ellmer package to create connections to various models and the vitals package to evaluate model performance.
Models were evaluated on the are dataset (An R Eval), which contains challenging R coding problems and their solutions. are is included in the vitals package.
Each model’s solution was scored by Claude 4.6 Sonnet as either Incorrect, Partially Correct, or Correct.
Costs for the open-weight models (Qwen, Gemma, GPT-OSS) are listed as $0. These models can be downloaded and run locally for free, but you may incur costs if using a hosted inference service. For this analysis, open-weight models were run via OpenRouter and costs were as follows: