Test, compare, and improve model outputs.
Evaluations help you verify that model responses match your style, accuracy, and safety expectations. This playbook shows how to build simple evals using the Luntrex API and your own test set.
When to use evals
- Switching models and validating quality before a rollout.
- Testing prompt changes against a representative sample set.
- Tracking regressions in accuracy, tone, or safety over time.
Step 1: Define a task + rubric
Create a clear objective and a scoring rubric. Keep it simple at first.
{
"task": "Categorize support tickets into Hardware, Software, or Other",
"rubric": "Output must be exactly one of: Hardware, Software, Other"
}
Step 2: Build a test set
Prepare a JSONL file with input + expected output.
{ "ticket_text": "My monitor won't turn on", "expected_label": "Hardware" }
{ "ticket_text": "Can't log into the app", "expected_label": "Software" }
{ "ticket_text": "Best restaurants in Dhaka?", "expected_label": "Other" }
Step 3: Run your eval with Luntrex
Loop through your dataset and send each test case to Luntrex using the unified API.
curl https://luntrex.com/api/v1/chat/completions \
-H "Authorization: Bearer YOUR_LUNTREX_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4.1",
"messages": [
{ "role": "system", "content": "Categorize the ticket into Hardware, Software, or Other." },
{ "role": "user", "content": "My monitor won''t turn on." }
]
}'
Analyze results
Compare the model output to your expected label. Track pass/fail rates and examples.
- Start with 20-50 samples, then expand.
- Log failures to refine prompts or switch models.
- Run weekly to detect regressions.
Best practices
- Keep prompts deterministic during evals (temperature 0-0.3).
- Use the same dataset across models for fair comparisons.
- Track latency + cost alongside quality for a full picture.
Optional: Automate weekly runs
Schedule a weekly job that runs your eval suite and writes a summary report. This can feed your internal dashboards or model comparison pages.