Overview

LLM-as-judge uses an LLM to evaluate the outputs of another LLM. It's a scalable way to assess quality, relevance, and accuracy across many responses.

How It Works

Response → Judge LLM → Score + Reasoning

Take the original response
Send to a judge LLM with evaluation criteria
Receive a score and explanation
Store for analysis

Setting Up LLM Judge

Basic Configuration

import { getObservability } from '@transactional/observability';
 
const obs = getObservability();
 
// After getting a response
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [...],
});
 
// Evaluate with LLM judge
const evaluation = await obs.evaluate({
  traceId: trace.id,
  type: 'llm-judge',
  criteria: ['relevance', 'helpfulness', 'accuracy'],
  model: 'gpt-4o-mini',  // Judge model
  input: {
    question: userQuestion,
    response: response.choices[0].message.content,
  },
});
 
console.log(evaluation.scores);
// { relevance: 4.5, helpfulness: 4.0, accuracy: 5.0 }

Dashboard Configuration

Go to Settings > Evaluation
Enable LLM-as-Judge
Configure:
- Judge model
- Criteria
- Sampling rate
- Score scale

Evaluation Criteria

Built-in Criteria

Criteria	Description	Scale
`relevance`	Response addresses the query	1-5
`helpfulness`	Response is useful to the user	1-5
`accuracy`	Information is factually correct	1-5
`coherence`	Response is well-organized	1-5
`conciseness`	Response is appropriately brief	1-5
`safety`	Response is safe and appropriate	Pass/Fail

Custom Criteria

Define your own criteria:

const evaluation = await obs.evaluate({
  traceId: trace.id,
  type: 'llm-judge',
  customCriteria: [
    {
      name: 'brand_voice',
      description: 'Response matches our friendly, professional brand voice',
      scale: { min: 1, max: 5 },
    },
    {
      name: 'technical_accuracy',
      description: 'Technical details are correct for our product',
      scale: { min: 1, max: 5 },
    },
  ],
  context: {
    brand_guidelines: 'Friendly, professional, helpful...',
    product_docs: relevantDocs,
  },
});

Judge Prompts

Default Prompt Template

You are evaluating an AI assistant's response.

Question: {{question}}
Response: {{response}}

Evaluate the response on these criteria:
{{#each criteria}}
- {{name}}: {{description}}
{{/each}}

For each criterion, provide:
1. A score from 1-5
2. Brief reasoning

Output as JSON:
{
  "scores": {
    "criterion_name": { "score": X, "reasoning": "..." }
  }
}

Custom Prompt

const evaluation = await obs.evaluate({
  traceId: trace.id,
  type: 'llm-judge',
  customPrompt: `
    You are a customer support quality evaluator.
 
    Customer question: {{question}}
    Agent response: {{response}}
 
    Evaluate on:
    1. Did the agent resolve the customer's issue?
    2. Was the tone appropriate?
    3. Were next steps clear?
 
    Score each 1-5 and explain.
  `,
});

Judge Model Selection

Recommended Models

Model	Speed	Cost	Quality
gpt-4o	Slow	$$$	Best
gpt-4o-mini	Fast	$	Good
claude-3-haiku	Fast	$	Good

Guidelines

Use a capable model for complex criteria
Use faster models for simple checks
Consider cost at scale
Test judge accuracy on known examples

Sampling Strategies

Random Sampling

Evaluate a percentage of all responses:

// 10% of responses
samplingRate: 0.1

Stratified Sampling

Sample by category:

// More sampling for complex queries
samplingRules: [
  { filter: { tags: ['complex'] }, rate: 0.5 },
  { filter: { tags: ['simple'] }, rate: 0.05 },
]

Threshold Sampling

Evaluate uncertain responses:

// Evaluate when model confidence is low
samplingRules: [
  { filter: { 'metadata.confidence': { $lt: 0.8 } }, rate: 1.0 },
]

Analyzing Results

Dashboard Views

Score Distribution: Histogram of scores
Score Over Time: Trends and regressions
Low Scores: Examples needing attention
Criteria Breakdown: Performance by criterion

Insights

Identify patterns:

Which query types score lowest?
What criteria fail most often?
Are there user segments with lower satisfaction?

Actions

Based on evaluation:

Update prompts
Add examples for edge cases
Improve context retrieval
Consider model changes

Example: RAG Evaluation

async function evaluateRAGResponse(
  question: string,
  response: string,
  retrievedDocs: Document[]
) {
  const obs = getObservability();
 
  // Create trace for evaluation
  const trace = obs.trace({
    name: 'rag-evaluation',
    input: { question },
  });
 
  // Evaluate groundedness (is response supported by docs?)
  const groundedness = await obs.evaluate({
    traceId: trace.id,
    type: 'llm-judge',
    customPrompt: `
      Question: ${question}
      Response: ${response}
 
      Source Documents:
      ${retrievedDocs.map(d => d.content).join('\n---\n')}
 
      Is every claim in the response supported by the source documents?
      Score 1-5 where:
      1 = Contains unsupported claims
      5 = Fully grounded in sources
    `,
    model: 'gpt-4o-mini',
  });
 
  // Evaluate completeness
  const completeness = await obs.evaluate({
    traceId: trace.id,
    type: 'llm-judge',
    customPrompt: `
      Question: ${question}
      Response: ${response}
 
      Does the response fully answer the question?
      Score 1-5 where:
      1 = Misses key information
      5 = Complete and comprehensive
    `,
    model: 'gpt-4o-mini',
  });
 
  await trace.end({
    output: {
      scores: {
        groundedness: groundedness.score,
        completeness: completeness.score,
      },
    },
  });
 
  return {
    groundedness: groundedness.score,
    completeness: completeness.score,
  };
}

Best Practices

1. Validate Judge Accuracy

Test on known examples:

// Create test set with human-scored examples
const testSet = [
  { question: '...', response: '...', humanScore: 4 },
  // ...
];
 
// Compare judge scores to human scores
for (const example of testSet) {
  const judgeScore = await evaluate(example);
  console.log(`Human: ${example.humanScore}, Judge: ${judgeScore}`);
}

2. Include Reasoning

Always capture judge reasoning:

const evaluation = await obs.evaluate({
  includeReasoning: true,  // Get explanations
});
 
// Review reasoning for low scores
if (evaluation.score < 3) {
  console.log(evaluation.reasoning);
}

3. Monitor Judge Costs

Track evaluation costs:

Cost per evaluation
Total monthly spend
Cost vs. benefit

4. Iterate on Criteria

Refine criteria based on results:

Are criteria capturing what matters?
Is the scale appropriate?
Are descriptions clear to the judge?

Next Steps

User Feedback - Combine with user ratings
Custom Scores - Add your own metrics
Dashboard - View evaluation results

LLM as Judge