How well do LLMs generate R code?

Score distribution by model

Compare accuracy and total cost

Claude Opus 4.5

Claude Haiku 4.5

Claude Sonnet 4.5

Claude Opus 4.1

Claude Sonnet 4

Claude Sonnet 4 (No Thinking)

Gemini 3

Gemini 2.5 Pro

GPT-5.1 Codex

GPT-5.1

GPT-5

GPT-5 mini

GPT-5 nano

gpt-oss-120b

gpt-oss-20b

GPT-4.1

o4-mini

o3-mini

About This Evaluation

This app displays evaluation results comparing how well various LLMs generate R code.

We used the ellmer package to create connections to various models and the vitals package to evaluate model performance.
Models were evaluated on the are dataset (An R Eval), which contains challenging R coding problems and their solutions. are is included in the vitals package.
Each model’s solution was scored by Claude 3.7 Sonnet as either Incorrect, Partially Correct, or Correct.