Evaluation playbook

Test, compare, and improve model outputs.

Evaluations help you verify that model responses match your style, accuracy, and safety expectations. This playbook shows how to build simple evals using the Luntrex API and your own test set.

When to use evals

Switching models and validating quality before a rollout.
Testing prompt changes against a representative sample set.
Tracking regressions in accuracy, tone, or safety over time.

Step 1: Define a task + rubric

Create a clear objective and a scoring rubric. Keep it simple at first.

{
  "task": "Categorize support tickets into Hardware, Software, or Other",
  "rubric": "Output must be exactly one of: Hardware, Software, Other"
}

Step 2: Build a test set

Prepare a JSONL file with input + expected output.

{ "ticket_text": "My monitor won't turn on", "expected_label": "Hardware" }
{ "ticket_text": "Can't log into the app", "expected_label": "Software" }
{ "ticket_text": "Best restaurants in Dhaka?", "expected_label": "Other" }

Step 3: Run your eval with Luntrex

Loop through your dataset and send each test case to Luntrex using the unified API.

curl https://luntrex.com/api/v1/chat/completions \
  -H "Authorization: Bearer YOUR_LUNTREX_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4.1",
    "messages": [
      { "role": "system", "content": "Categorize the ticket into Hardware, Software, or Other." },
      { "role": "user", "content": "My monitor won''t turn on." }
    ]
  }'

Analyze results

Compare the model output to your expected label. Track pass/fail rates and examples.

Start with 20-50 samples, then expand.
Log failures to refine prompts or switch models.
Run weekly to detect regressions.

Best practices

Keep prompts deterministic during evals (temperature 0-0.3).
Use the same dataset across models for fair comparisons.
Track latency + cost alongside quality for a full picture.

Optional: Automate weekly runs

Schedule a weekly job that runs your eval suite and writes a summary report. This can feed your internal dashboards or model comparison pages.