Score distribution by model
Compare accuracy and total cost

About this evaluation

This app displays evaluation results comparing how well various LLMs generate R code.

Methodology

  • We used the ellmer package to create connections to various models and the vitals package to evaluate model performance.

  • Models were evaluated on the are dataset (An R Eval), which contains challenging R coding problems and their solutions. are is included in the vitals package.

  • Each model’s solution was scored by Claude 4.6 Sonnet as either Incorrect, Partially Correct, or Correct.

  • Costs for the open-weight models (Qwen, Gemma, GPT-OSS) are listed as $0. These models can be downloaded and run locally for free, but you may incur costs if using a hosted inference service. For this analysis, open-weight models were run via OpenRouter and costs were as follows:

    • GPT-OSS-20B: $0.06
    • Gemma 4 26B-A4B: $0.02
    • Qwen 3.5-35B-A3B: $0.30
    • Qwen 3.6-35B-A3B: $0.34