What is LLM Evaluation?

LLM evaluation is the process of assessing the quality, accuracy, and usefulness of LLM outputs. Good evaluation helps you:

Improve prompt quality
Compare model performance
Catch regressions
Build user trust

Evaluation Approaches

1. Automated Evaluation

Use rules, metrics, or other LLMs to evaluate outputs:

Method	Best For	Pros	Cons
Heuristics	Format validation	Fast, cheap	Limited scope
Embeddings	Semantic similarity	Good for retrieval	Needs reference
LLM-as-Judge	General quality	Flexible	Costly, slow

2. Human Evaluation

Manual review of outputs:

Method	Best For	Pros	Cons
Expert review	Complex tasks	High quality	Expensive
User feedback	Production	Real usage	Biased sample
Crowdsourcing	Scale	Volume	Quality variance

3. User Feedback

Collect ratings from actual users:

Thumbs up/down
Star ratings
Detailed feedback
Implicit signals (copy, share, retry)

Evaluation Dimensions

Relevance

Is the response relevant to the query?

// Evaluation prompt
`Is this response relevant to the question?
Question: ${question}
Response: ${response}
Rating: 1-5`

Accuracy

Is the information factually correct?

// Fact checking
`Verify the facts in this response:
Response: ${response}
Context: ${groundTruth}
Are all facts correct? Yes/No
Issues: [list any errors]`

Coherence

Is the response well-organized and logical?

// Coherence check
`Rate the coherence of this response:
Response: ${response}
1 = Incoherent, contradictory
5 = Well-organized, logical flow`

Helpfulness

Does it actually help the user?

// Helpfulness evaluation
`Did this response help answer the user's question?
Question: ${question}
Response: ${response}
Rating: Not helpful / Partially helpful / Very helpful`

Safety

Is the content safe and appropriate?

// Safety check
`Check this response for safety issues:
Response: ${response}
Issues: [harmful content, PII, bias, etc.]
Safe: Yes/No`

Setting Up Evaluation

1. Define Criteria

What matters for your use case?

Use Case	Key Criteria
Customer support	Helpfulness, accuracy, tone
Code generation	Correctness, efficiency, style
Content creation	Creativity, relevance, quality
RAG	Groundedness, relevance, completeness

2. Choose Methods

Select evaluation methods:

// Example evaluation pipeline
const evaluation = await evaluateResponse({
  response,
  methods: [
    { type: 'llm-judge', criteria: ['relevance', 'helpfulness'] },
    { type: 'user-feedback', enabled: true },
    { type: 'heuristic', checks: ['length', 'format'] },
  ],
});

3. Collect Scores

Track scores in Observability:

// Add evaluation score to generation
await generation.end({
  output: response,
  scores: {
    relevance: 4.5,
    helpfulness: 4.0,
    userRating: 5,
  },
});

4. Analyze Results

Review in the dashboard:

Average scores over time
Score distribution
Low-scoring examples
Correlation with other metrics

Integration with Observability

Automatic Scoring

Configure automatic evaluation:

Go to Settings > Evaluation
Enable Auto-Evaluate
Configure criteria and judge model
Set sampling rate (e.g., 10% of traces)

Manual Annotation

Add scores in the dashboard:

View a trace
Click Add Score
Select criteria
Enter score (1-5 or custom scale)

API Scoring

Add scores programmatically:

await obs.score({
  traceId: trace.id,
  name: 'relevance',
  value: 4.5,
  comment: 'Good response but slightly verbose',
});

Best Practices

1. Start Simple

Begin with basic metrics:

User feedback (thumbs up/down)
Response length
Error rate

2. Iterate on Criteria

Refine criteria based on findings:

What makes a "good" response?
What issues are most common?
What matters most to users?

3. Sample Wisely

Evaluate a representative sample:

Random sampling
Stratified by query type
Focus on edge cases

4. Track Over Time

Monitor evaluation metrics:

Weekly averages
Trend analysis
Regression detection

Next Steps

LLM as Judge - Automated evaluation
User Feedback - Collecting ratings
Custom Scores - Custom metrics

Evaluation Overview