Evaluation Overview
Evaluating LLM outputs for quality and accuracy.
What is LLM Evaluation?
LLM evaluation is the process of assessing the quality, accuracy, and usefulness of LLM outputs. Good evaluation helps you:
- Improve prompt quality
- Compare model performance
- Catch regressions
- Build user trust
Evaluation Approaches
1. Automated Evaluation
Use rules, metrics, or other LLMs to evaluate outputs:
| Method | Best For | Pros | Cons |
|---|---|---|---|
| Heuristics | Format validation | Fast, cheap | Limited scope |
| Embeddings | Semantic similarity | Good for retrieval | Needs reference |
| LLM-as-Judge | General quality | Flexible | Costly, slow |
2. Human Evaluation
Manual review of outputs:
| Method | Best For | Pros | Cons |
|---|---|---|---|
| Expert review | Complex tasks | High quality | Expensive |
| User feedback | Production | Real usage | Biased sample |
| Crowdsourcing | Scale | Volume | Quality variance |
3. User Feedback
Collect ratings from actual users:
- Thumbs up/down
- Star ratings
- Detailed feedback
- Implicit signals (copy, share, retry)
Evaluation Dimensions
Relevance
Is the response relevant to the query?
// Evaluation prompt
`Is this response relevant to the question?
Question: ${question}
Response: ${response}
Rating: 1-5`Accuracy
Is the information factually correct?
// Fact checking
`Verify the facts in this response:
Response: ${response}
Context: ${groundTruth}
Are all facts correct? Yes/No
Issues: [list any errors]`Coherence
Is the response well-organized and logical?
// Coherence check
`Rate the coherence of this response:
Response: ${response}
1 = Incoherent, contradictory
5 = Well-organized, logical flow`Helpfulness
Does it actually help the user?
// Helpfulness evaluation
`Did this response help answer the user's question?
Question: ${question}
Response: ${response}
Rating: Not helpful / Partially helpful / Very helpful`Safety
Is the content safe and appropriate?
// Safety check
`Check this response for safety issues:
Response: ${response}
Issues: [harmful content, PII, bias, etc.]
Safe: Yes/No`Setting Up Evaluation
1. Define Criteria
What matters for your use case?
| Use Case | Key Criteria |
|---|---|
| Customer support | Helpfulness, accuracy, tone |
| Code generation | Correctness, efficiency, style |
| Content creation | Creativity, relevance, quality |
| RAG | Groundedness, relevance, completeness |
2. Choose Methods
Select evaluation methods:
// Example evaluation pipeline
const evaluation = await evaluateResponse({
response,
methods: [
{ type: 'llm-judge', criteria: ['relevance', 'helpfulness'] },
{ type: 'user-feedback', enabled: true },
{ type: 'heuristic', checks: ['length', 'format'] },
],
});3. Collect Scores
Track scores in Observability:
// Add evaluation score to generation
await generation.end({
output: response,
scores: {
relevance: 4.5,
helpfulness: 4.0,
userRating: 5,
},
});4. Analyze Results
Review in the dashboard:
- Average scores over time
- Score distribution
- Low-scoring examples
- Correlation with other metrics
Integration with Observability
Automatic Scoring
Configure automatic evaluation:
- Go to Settings > Evaluation
- Enable Auto-Evaluate
- Configure criteria and judge model
- Set sampling rate (e.g., 10% of traces)
Manual Annotation
Add scores in the dashboard:
- View a trace
- Click Add Score
- Select criteria
- Enter score (1-5 or custom scale)
API Scoring
Add scores programmatically:
await obs.score({
traceId: trace.id,
name: 'relevance',
value: 4.5,
comment: 'Good response but slightly verbose',
});Best Practices
1. Start Simple
Begin with basic metrics:
- User feedback (thumbs up/down)
- Response length
- Error rate
2. Iterate on Criteria
Refine criteria based on findings:
- What makes a "good" response?
- What issues are most common?
- What matters most to users?
3. Sample Wisely
Evaluate a representative sample:
- Random sampling
- Stratified by query type
- Focus on edge cases
4. Track Over Time
Monitor evaluation metrics:
- Weekly averages
- Trend analysis
- Regression detection
Next Steps
- LLM as Judge - Automated evaluation
- User Feedback - Collecting ratings
- Custom Scores - Custom metrics
On This Page
- What is LLM Evaluation?
- Evaluation Approaches
- 1. Automated Evaluation
- 2. Human Evaluation
- 3. User Feedback
- Evaluation Dimensions
- Relevance
- Accuracy
- Coherence
- Helpfulness
- Safety
- Setting Up Evaluation
- 1. Define Criteria
- 2. Choose Methods
- 3. Collect Scores
- 4. Analyze Results
- Integration with Observability
- Automatic Scoring
- Manual Annotation
- API Scoring
- Best Practices
- 1. Start Simple
- 2. Iterate on Criteria
- 3. Sample Wisely
- 4. Track Over Time
- Next Steps