Score distribution by model
Compare accuracy and total cost
This app displays evaluation results comparing how well various LLMs generate R code.
We used the ellmer package to create connections to various models and the vitals package to evaluate model performance.
Models were evaluated on the are dataset (An R Eval), which contains challenging R coding problems and their solutions. are is included in the vitals package.
Each model’s solution was scored by Claude 3.7 Sonnet as either Incorrect, Partially Correct, or Correct.