LLM as Judge

Using LLMs to automatically evaluate response quality.

Overview

LLM-as-judge uses an LLM to evaluate the outputs of another LLM. It's a scalable way to assess quality, relevance, and accuracy across many responses.

How It Works

Response → Judge LLM → Score + Reasoning
  1. Take the original response
  2. Send to a judge LLM with evaluation criteria
  3. Receive a score and explanation
  4. Store for analysis

Setting Up LLM Judge

Basic Configuration

import { getObservability } from '@transactional/observability';
 
const obs = getObservability();
 
// After getting a response
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [...],
});
 
// Evaluate with LLM judge
const evaluation = await obs.evaluate({
  traceId: trace.id,
  type: 'llm-judge',
  criteria: ['relevance', 'helpfulness', 'accuracy'],
  model: 'gpt-4o-mini',  // Judge model
  input: {
    question: userQuestion,
    response: response.choices[0].message.content,
  },
});
 
console.log(evaluation.scores);
// { relevance: 4.5, helpfulness: 4.0, accuracy: 5.0 }

Dashboard Configuration

  1. Go to Settings > Evaluation
  2. Enable LLM-as-Judge
  3. Configure:
    • Judge model
    • Criteria
    • Sampling rate
    • Score scale

Evaluation Criteria

Built-in Criteria

CriteriaDescriptionScale
relevanceResponse addresses the query1-5
helpfulnessResponse is useful to the user1-5
accuracyInformation is factually correct1-5
coherenceResponse is well-organized1-5
concisenessResponse is appropriately brief1-5
safetyResponse is safe and appropriatePass/Fail

Custom Criteria

Define your own criteria:

const evaluation = await obs.evaluate({
  traceId: trace.id,
  type: 'llm-judge',
  customCriteria: [
    {
      name: 'brand_voice',
      description: 'Response matches our friendly, professional brand voice',
      scale: { min: 1, max: 5 },
    },
    {
      name: 'technical_accuracy',
      description: 'Technical details are correct for our product',
      scale: { min: 1, max: 5 },
    },
  ],
  context: {
    brand_guidelines: 'Friendly, professional, helpful...',
    product_docs: relevantDocs,
  },
});

Judge Prompts

Default Prompt Template

You are evaluating an AI assistant's response.

Question: {{question}}
Response: {{response}}

Evaluate the response on these criteria:
{{#each criteria}}
- {{name}}: {{description}}
{{/each}}

For each criterion, provide:
1. A score from 1-5
2. Brief reasoning

Output as JSON:
{
  "scores": {
    "criterion_name": { "score": X, "reasoning": "..." }
  }
}

Custom Prompt

const evaluation = await obs.evaluate({
  traceId: trace.id,
  type: 'llm-judge',
  customPrompt: `
    You are a customer support quality evaluator.
 
    Customer question: {{question}}
    Agent response: {{response}}
 
    Evaluate on:
    1. Did the agent resolve the customer's issue?
    2. Was the tone appropriate?
    3. Were next steps clear?
 
    Score each 1-5 and explain.
  `,
});

Judge Model Selection

ModelSpeedCostQuality
gpt-4oSlow$$$Best
gpt-4o-miniFast$Good
claude-3-haikuFast$Good

Guidelines

  • Use a capable model for complex criteria
  • Use faster models for simple checks
  • Consider cost at scale
  • Test judge accuracy on known examples

Sampling Strategies

Random Sampling

Evaluate a percentage of all responses:

// 10% of responses
samplingRate: 0.1

Stratified Sampling

Sample by category:

// More sampling for complex queries
samplingRules: [
  { filter: { tags: ['complex'] }, rate: 0.5 },
  { filter: { tags: ['simple'] }, rate: 0.05 },
]

Threshold Sampling

Evaluate uncertain responses:

// Evaluate when model confidence is low
samplingRules: [
  { filter: { 'metadata.confidence': { $lt: 0.8 } }, rate: 1.0 },
]

Analyzing Results

Dashboard Views

  1. Score Distribution: Histogram of scores
  2. Score Over Time: Trends and regressions
  3. Low Scores: Examples needing attention
  4. Criteria Breakdown: Performance by criterion

Insights

Identify patterns:

  • Which query types score lowest?
  • What criteria fail most often?
  • Are there user segments with lower satisfaction?

Actions

Based on evaluation:

  • Update prompts
  • Add examples for edge cases
  • Improve context retrieval
  • Consider model changes

Example: RAG Evaluation

async function evaluateRAGResponse(
  question: string,
  response: string,
  retrievedDocs: Document[]
) {
  const obs = getObservability();
 
  // Create trace for evaluation
  const trace = obs.trace({
    name: 'rag-evaluation',
    input: { question },
  });
 
  // Evaluate groundedness (is response supported by docs?)
  const groundedness = await obs.evaluate({
    traceId: trace.id,
    type: 'llm-judge',
    customPrompt: `
      Question: ${question}
      Response: ${response}
 
      Source Documents:
      ${retrievedDocs.map(d => d.content).join('\n---\n')}
 
      Is every claim in the response supported by the source documents?
      Score 1-5 where:
      1 = Contains unsupported claims
      5 = Fully grounded in sources
    `,
    model: 'gpt-4o-mini',
  });
 
  // Evaluate completeness
  const completeness = await obs.evaluate({
    traceId: trace.id,
    type: 'llm-judge',
    customPrompt: `
      Question: ${question}
      Response: ${response}
 
      Does the response fully answer the question?
      Score 1-5 where:
      1 = Misses key information
      5 = Complete and comprehensive
    `,
    model: 'gpt-4o-mini',
  });
 
  await trace.end({
    output: {
      scores: {
        groundedness: groundedness.score,
        completeness: completeness.score,
      },
    },
  });
 
  return {
    groundedness: groundedness.score,
    completeness: completeness.score,
  };
}

Best Practices

1. Validate Judge Accuracy

Test on known examples:

// Create test set with human-scored examples
const testSet = [
  { question: '...', response: '...', humanScore: 4 },
  // ...
];
 
// Compare judge scores to human scores
for (const example of testSet) {
  const judgeScore = await evaluate(example);
  console.log(`Human: ${example.humanScore}, Judge: ${judgeScore}`);
}

2. Include Reasoning

Always capture judge reasoning:

const evaluation = await obs.evaluate({
  includeReasoning: true,  // Get explanations
});
 
// Review reasoning for low scores
if (evaluation.score < 3) {
  console.log(evaluation.reasoning);
}

3. Monitor Judge Costs

Track evaluation costs:

  • Cost per evaluation
  • Total monthly spend
  • Cost vs. benefit

4. Iterate on Criteria

Refine criteria based on results:

  • Are criteria capturing what matters?
  • Is the scale appropriate?
  • Are descriptions clear to the judge?

Next Steps