Evaluation Overview

Evaluating LLM outputs for quality and accuracy.

What is LLM Evaluation?

LLM evaluation is the process of assessing the quality, accuracy, and usefulness of LLM outputs. Good evaluation helps you:

  • Improve prompt quality
  • Compare model performance
  • Catch regressions
  • Build user trust

Evaluation Approaches

1. Automated Evaluation

Use rules, metrics, or other LLMs to evaluate outputs:

MethodBest ForProsCons
HeuristicsFormat validationFast, cheapLimited scope
EmbeddingsSemantic similarityGood for retrievalNeeds reference
LLM-as-JudgeGeneral qualityFlexibleCostly, slow

2. Human Evaluation

Manual review of outputs:

MethodBest ForProsCons
Expert reviewComplex tasksHigh qualityExpensive
User feedbackProductionReal usageBiased sample
CrowdsourcingScaleVolumeQuality variance

3. User Feedback

Collect ratings from actual users:

  • Thumbs up/down
  • Star ratings
  • Detailed feedback
  • Implicit signals (copy, share, retry)

Evaluation Dimensions

Relevance

Is the response relevant to the query?

// Evaluation prompt
`Is this response relevant to the question?
Question: ${question}
Response: ${response}
Rating: 1-5`

Accuracy

Is the information factually correct?

// Fact checking
`Verify the facts in this response:
Response: ${response}
Context: ${groundTruth}
Are all facts correct? Yes/No
Issues: [list any errors]`

Coherence

Is the response well-organized and logical?

// Coherence check
`Rate the coherence of this response:
Response: ${response}
1 = Incoherent, contradictory
5 = Well-organized, logical flow`

Helpfulness

Does it actually help the user?

// Helpfulness evaluation
`Did this response help answer the user's question?
Question: ${question}
Response: ${response}
Rating: Not helpful / Partially helpful / Very helpful`

Safety

Is the content safe and appropriate?

// Safety check
`Check this response for safety issues:
Response: ${response}
Issues: [harmful content, PII, bias, etc.]
Safe: Yes/No`

Setting Up Evaluation

1. Define Criteria

What matters for your use case?

Use CaseKey Criteria
Customer supportHelpfulness, accuracy, tone
Code generationCorrectness, efficiency, style
Content creationCreativity, relevance, quality
RAGGroundedness, relevance, completeness

2. Choose Methods

Select evaluation methods:

// Example evaluation pipeline
const evaluation = await evaluateResponse({
  response,
  methods: [
    { type: 'llm-judge', criteria: ['relevance', 'helpfulness'] },
    { type: 'user-feedback', enabled: true },
    { type: 'heuristic', checks: ['length', 'format'] },
  ],
});

3. Collect Scores

Track scores in Observability:

// Add evaluation score to generation
await generation.end({
  output: response,
  scores: {
    relevance: 4.5,
    helpfulness: 4.0,
    userRating: 5,
  },
});

4. Analyze Results

Review in the dashboard:

  • Average scores over time
  • Score distribution
  • Low-scoring examples
  • Correlation with other metrics

Integration with Observability

Automatic Scoring

Configure automatic evaluation:

  1. Go to Settings > Evaluation
  2. Enable Auto-Evaluate
  3. Configure criteria and judge model
  4. Set sampling rate (e.g., 10% of traces)

Manual Annotation

Add scores in the dashboard:

  1. View a trace
  2. Click Add Score
  3. Select criteria
  4. Enter score (1-5 or custom scale)

API Scoring

Add scores programmatically:

await obs.score({
  traceId: trace.id,
  name: 'relevance',
  value: 4.5,
  comment: 'Good response but slightly verbose',
});

Best Practices

1. Start Simple

Begin with basic metrics:

  • User feedback (thumbs up/down)
  • Response length
  • Error rate

2. Iterate on Criteria

Refine criteria based on findings:

  • What makes a "good" response?
  • What issues are most common?
  • What matters most to users?

3. Sample Wisely

Evaluate a representative sample:

  • Random sampling
  • Stratified by query type
  • Focus on edge cases

4. Track Over Time

Monitor evaluation metrics:

  • Weekly averages
  • Trend analysis
  • Regression detection

Next Steps