LLM as Judge
Using LLMs to automatically evaluate response quality.
Overview
LLM-as-judge uses an LLM to evaluate the outputs of another LLM. It's a scalable way to assess quality, relevance, and accuracy across many responses.
How It Works
Response → Judge LLM → Score + Reasoning
- Take the original response
- Send to a judge LLM with evaluation criteria
- Receive a score and explanation
- Store for analysis
Setting Up LLM Judge
Basic Configuration
import { getObservability } from '@transactional/observability';
const obs = getObservability();
// After getting a response
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [...],
});
// Evaluate with LLM judge
const evaluation = await obs.evaluate({
traceId: trace.id,
type: 'llm-judge',
criteria: ['relevance', 'helpfulness', 'accuracy'],
model: 'gpt-4o-mini', // Judge model
input: {
question: userQuestion,
response: response.choices[0].message.content,
},
});
console.log(evaluation.scores);
// { relevance: 4.5, helpfulness: 4.0, accuracy: 5.0 }Dashboard Configuration
- Go to Settings > Evaluation
- Enable LLM-as-Judge
- Configure:
- Judge model
- Criteria
- Sampling rate
- Score scale
Evaluation Criteria
Built-in Criteria
| Criteria | Description | Scale |
|---|---|---|
relevance | Response addresses the query | 1-5 |
helpfulness | Response is useful to the user | 1-5 |
accuracy | Information is factually correct | 1-5 |
coherence | Response is well-organized | 1-5 |
conciseness | Response is appropriately brief | 1-5 |
safety | Response is safe and appropriate | Pass/Fail |
Custom Criteria
Define your own criteria:
const evaluation = await obs.evaluate({
traceId: trace.id,
type: 'llm-judge',
customCriteria: [
{
name: 'brand_voice',
description: 'Response matches our friendly, professional brand voice',
scale: { min: 1, max: 5 },
},
{
name: 'technical_accuracy',
description: 'Technical details are correct for our product',
scale: { min: 1, max: 5 },
},
],
context: {
brand_guidelines: 'Friendly, professional, helpful...',
product_docs: relevantDocs,
},
});Judge Prompts
Default Prompt Template
You are evaluating an AI assistant's response.
Question: {{question}}
Response: {{response}}
Evaluate the response on these criteria:
{{#each criteria}}
- {{name}}: {{description}}
{{/each}}
For each criterion, provide:
1. A score from 1-5
2. Brief reasoning
Output as JSON:
{
"scores": {
"criterion_name": { "score": X, "reasoning": "..." }
}
}
Custom Prompt
const evaluation = await obs.evaluate({
traceId: trace.id,
type: 'llm-judge',
customPrompt: `
You are a customer support quality evaluator.
Customer question: {{question}}
Agent response: {{response}}
Evaluate on:
1. Did the agent resolve the customer's issue?
2. Was the tone appropriate?
3. Were next steps clear?
Score each 1-5 and explain.
`,
});Judge Model Selection
Recommended Models
| Model | Speed | Cost | Quality |
|---|---|---|---|
| gpt-4o | Slow | $$$ | Best |
| gpt-4o-mini | Fast | $ | Good |
| claude-3-haiku | Fast | $ | Good |
Guidelines
- Use a capable model for complex criteria
- Use faster models for simple checks
- Consider cost at scale
- Test judge accuracy on known examples
Sampling Strategies
Random Sampling
Evaluate a percentage of all responses:
// 10% of responses
samplingRate: 0.1Stratified Sampling
Sample by category:
// More sampling for complex queries
samplingRules: [
{ filter: { tags: ['complex'] }, rate: 0.5 },
{ filter: { tags: ['simple'] }, rate: 0.05 },
]Threshold Sampling
Evaluate uncertain responses:
// Evaluate when model confidence is low
samplingRules: [
{ filter: { 'metadata.confidence': { $lt: 0.8 } }, rate: 1.0 },
]Analyzing Results
Dashboard Views
- Score Distribution: Histogram of scores
- Score Over Time: Trends and regressions
- Low Scores: Examples needing attention
- Criteria Breakdown: Performance by criterion
Insights
Identify patterns:
- Which query types score lowest?
- What criteria fail most often?
- Are there user segments with lower satisfaction?
Actions
Based on evaluation:
- Update prompts
- Add examples for edge cases
- Improve context retrieval
- Consider model changes
Example: RAG Evaluation
async function evaluateRAGResponse(
question: string,
response: string,
retrievedDocs: Document[]
) {
const obs = getObservability();
// Create trace for evaluation
const trace = obs.trace({
name: 'rag-evaluation',
input: { question },
});
// Evaluate groundedness (is response supported by docs?)
const groundedness = await obs.evaluate({
traceId: trace.id,
type: 'llm-judge',
customPrompt: `
Question: ${question}
Response: ${response}
Source Documents:
${retrievedDocs.map(d => d.content).join('\n---\n')}
Is every claim in the response supported by the source documents?
Score 1-5 where:
1 = Contains unsupported claims
5 = Fully grounded in sources
`,
model: 'gpt-4o-mini',
});
// Evaluate completeness
const completeness = await obs.evaluate({
traceId: trace.id,
type: 'llm-judge',
customPrompt: `
Question: ${question}
Response: ${response}
Does the response fully answer the question?
Score 1-5 where:
1 = Misses key information
5 = Complete and comprehensive
`,
model: 'gpt-4o-mini',
});
await trace.end({
output: {
scores: {
groundedness: groundedness.score,
completeness: completeness.score,
},
},
});
return {
groundedness: groundedness.score,
completeness: completeness.score,
};
}Best Practices
1. Validate Judge Accuracy
Test on known examples:
// Create test set with human-scored examples
const testSet = [
{ question: '...', response: '...', humanScore: 4 },
// ...
];
// Compare judge scores to human scores
for (const example of testSet) {
const judgeScore = await evaluate(example);
console.log(`Human: ${example.humanScore}, Judge: ${judgeScore}`);
}2. Include Reasoning
Always capture judge reasoning:
const evaluation = await obs.evaluate({
includeReasoning: true, // Get explanations
});
// Review reasoning for low scores
if (evaluation.score < 3) {
console.log(evaluation.reasoning);
}3. Monitor Judge Costs
Track evaluation costs:
- Cost per evaluation
- Total monthly spend
- Cost vs. benefit
4. Iterate on Criteria
Refine criteria based on results:
- Are criteria capturing what matters?
- Is the scale appropriate?
- Are descriptions clear to the judge?
Next Steps
- User Feedback - Combine with user ratings
- Custom Scores - Add your own metrics
- Dashboard - View evaluation results
On This Page
- Overview
- How It Works
- Setting Up LLM Judge
- Basic Configuration
- Dashboard Configuration
- Evaluation Criteria
- Built-in Criteria
- Custom Criteria
- Judge Prompts
- Default Prompt Template
- Custom Prompt
- Judge Model Selection
- Recommended Models
- Guidelines
- Sampling Strategies
- Random Sampling
- Stratified Sampling
- Threshold Sampling
- Analyzing Results
- Dashboard Views
- Insights
- Actions
- Example: RAG Evaluation
- Best Practices
- 1. Validate Judge Accuracy
- 2. Include Reasoning
- 3. Monitor Judge Costs
- 4. Iterate on Criteria
- Next Steps