Model Performance on R Code Generation
Model Performance vs. Cost
Model Pricing and Token Usage
This app displays evaluation results comparing how well various LLMs generate R code.
We used the ellmer package to create connections to various models and the vitals package to evaluate model performance.
Models were evaluated on the are dataset (An R Eval), which contains challenging R coding problems and their solutions. are is included in the vitals package.
Each model’s solution was scored by Claude 3.7 Sonnet as either Incorrect, Partially Correct, or Correct.