How well do LLMs generate R code?

Score distribution by model

Compare accuracy and total cost

Claude Opus 4.7

Claude Opus 4.6

Claude Haiku 4.5

Claude Sonnet 4.5

Gemini 3.5 Flash

Gemini 3.1 Pro

Gemma 4 26B-A4B

Gemini 3 Flash

Gemini 3 Pro

GPT-5.5

GPT-5.4 mini

GPT-5.4 nano

GPT-5.4

GPT-OSS 20B

Qwen 3.6 35B-A3B

Qwen 3.5 35B-A3B

About this evaluation

This app displays evaluation results comparing how well various LLMs generate R code.

We used the ellmer package to create connections to various models and the vitals package to evaluate model performance.
Models were evaluated on the are dataset (An R Eval), which contains challenging R coding problems and their solutions. are is included in the vitals package.
Each model’s solution was scored by Claude 4.6 Sonnet as either Incorrect, Partially Correct, or Correct.
Costs for the open-weight models (Qwen, Gemma, GPT-OSS) are listed as $0. These models can be downloaded and run locally for free, but you may incur costs if using a hosted inference service. For this analysis, open-weight models were run via OpenRouter and costs were as follows:
- GPT-OSS-20B: $0.06
- Gemma 4 26B-A4B: $0.02
- Qwen 3.5-35B-A3B: $0.30
- Qwen 3.6-35B-A3B: $0.34