Research/Benchmarks
Research · Legal AI Evals

Legal AI Benchmarks Should Evaluate Scaffolds, Not Just Models

Most legal AI leaderboards answer one question: which model performs best. But lawyers never use a raw model. They use it inside a system. We ran the same model through three setups (plain chat and two scaffolds) to see how much the system around the model actually matters.

LN Labs Research·Published June 2026·8 min read

Most legal AI leaderboards today answer one question: “Which model performs best on legal tasks?” That is an important question. But it is no longer enough.

In real legal workflows, lawyers and legal teams rarely use a raw model directly. They use a model inside an application that sends prompts to the model, lets it use tools, provides document retrieval and parsing, and adds validation logic and complex agentic workflows. In other words, they use a scaffold. And if the scaffold changes the quality of the output, then evaluating the model alone gives us only part of the picture.

Are we evaluating the model, or the full system that helps the model perform the task?

01 · Precedent

What coding evals already showed

Earlier this year, Princeton’s Holistic Agent Leaderboard highlighted an important point from coding-agent evaluations: running the same model through different scaffolds can produce materially different results.

Princeton HAL · CORE-Bench

Running Opus 4.5 with a Claude Code-based scaffold substantially outperformed the earlier CORE-Agent scaffold: same model, different surrounding system, different score.

hal.cs.princeton.edu

This is not just a coding story. The model matters, but so does the surrounding agentic system: the tools, the execution loop, the file access, the test-running environment, and the way the task is decomposed. Legal AI is now reaching the same point.

02 · Context

Why this matters for legal AI

Legal AI is moving from simple Q&A into long-horizon legal work: reviewing a vendor data processing agreement against internal privacy requirements, identifying compliance gaps against a new regulatory update, mapping legal obligations to internal teams. For this type of work, the model is only one component. The final performance depends on the full configuration around the model.

That is why model-only legal benchmarks are useful, but incomplete.

Harvey · Legal Agent Benchmark

An open-source benchmark to evaluate legal agents on realistic work. Each task includes an instruction, a client matter with relevant materials, and a required work product for review, plus an execution harness for running and scoring agents.

It moves legal evals closer to how legal work is actually assigned →

If legal agents are becoming the unit of work, why are so many legal AI comparisons still focused mainly on the model?

03 · Hypothesis

Our hypothesis

The same model may perform differently depending on the legal scaffold wrapped around it, and the cost of completing those tasks may differ too.

If scaffold choice changes performance in coding tasks, it may also change performance in legal tasks.

If the hypothesis is correct, then the fastest way to improve legal AI performance may not always be fine-tuning. In many cases, it may be better scaffold, harness, retrieval, and workflow engineering. This matters for:

BigLaw innovation teamsLegal data vendorsLegal AI companiesEnterprise legal departments
04 · Method

What we tested

A small internal legal evaluation experiment. The goal was not a universal benchmark; it was narrower and more practical: to test whether different legal scaffolds change performance on recurring legal and compliance tasks.

Model
Opus 4.8
Dataset
40 tasks
Domain
Data protection & operational resilience
Three system configurations
CONFIG 01
Claude Chat
Direct chat, no scaffold
CONFIG 02
Cowork + Legal Plugin
Cowork + Privacy / Regulatory plugin
CONFIG 03
MikeOSS
Open-source scaffold
Scaffold : performs the task

The surrounding system that helps the model complete a legal task: instructions, retrieval, tools, workflow logic, and output structure.

Harness : evaluates the task

The evaluation infrastructure used to run the tasks and score the outputs, independent of the system doing the work.

LLM-as-a-judge
Fable 5

Evaluated during the week it was still publicly available.

Scoring method
Rubric-based

Share of rubric checks graded “Pass” across all tasks.

05 · Dataset

What was inside the eval dataset

Tasks that financial-market companies face regularly in privacy, compliance, and operational resilience work. For example:

Example task

“Draft a CCPA/CPRA client consultation response advising on a described California consumer privacy scenario, using three supporting documents: a drafting playbook, a case context document, and a regulatory rules reference.”

Capabilities tested
·Extracting obligations from legal and regulatory materials
·Identifying affected internal teams
·Assessing practical business impact
·Spotting missing information
·Mapping requirements to internal controls
·Producing structured outputs for legal review
Scored against
Legal soundnessCompletenessFaithfulness to provided materialsQuality of reasoningCost of end-to-end execution

Small by design: a practice-specific eval, not a general legal-reasoning benchmark.

06 · Results

The scaffold changed the score

Key finding stood out: the legal-workflow scaffold (Cowork + Legal Plugin) led on every dimension, even though the underlying model was identical across all three.

Quality

Quality leaderboard
Pass rate · 40 tasks
Final results
1
Cowork + Legal PluginTop
Claude Cowork scaffold
Evidence
83.2%
Soundness
84.8%
Reasoning
80.1%
Overall
82.7
2
Claude Chat
Direct chat, no scaffold
Evidence
77.7%
Soundness
79.7%
Reasoning
69.3%
Overall
75.6
3
MikeOSS
Open-source scaffold
Evidence
77.4%
Soundness
75.9%
Reasoning
64.9%
Overall
72.7

Scaffold choice materially affected performance on legal tasks.

The point is not that one scaffold wins universally; that would be too strong a claim from 40 tasks. It is that evaluation should not stop at the model layer.

Cost

A second finding emerged on cost: the average spend to complete one task. The scores and the price tags did not line up the way you might expect.

Cost per task
USD · averaged over 40 tasks
Lower is cheaper
MikeOSS
$0.30
Cowork + Legal Plugin
$0.80
Claude Chat
$2.80

Claude Cowork with Legal Plugin executed tasks more cost-efficiently than Claude Chat. The reason is simple: the scaffold reached a good-quality output in fewer attempts than the raw model baseline, and fewer attempts mean fewer tokens used. Scaffold engineering shapes not just the quality of model performance, but its cost-efficiency too.

07 · Implications

What this means for legal AI teams

The question is no longer only “which model should we use?” It becomes “which model + scaffold configuration performs best for this workflow?”

BigLaw AI teams

Which model + scaffold configuration performs best for this specific legal workflow?

Legal data vendors

How do our content, taxonomy, and retrieval layer improve downstream AI performance?

Legal AI vendors

How much of our performance comes from the base model, and how much from the scaffold?

Frontier labs

Legal capability should be evaluated in realistic agentic environments, not model-only benchmarks.

08 · Takeaway

Why scaffold engineering matters now

Fine-tuning will become increasingly important for legal AI, especially as open-source and open-weight models improve. But for many applied teams today, scaffold and harness engineering may still be the fastest and most cost-efficient way to improve performance.

Before investing in fine-tuning, understand how much performance can be unlocked through better task design, retrieval, output schemas, validation logic, and evals.

Before tuning the model, evaluate the system.

Our next step

We plan to open-source our legal eval dataset soon, so other teams can validate our results and run their own evaluations, and we will continue testing other models, scaffolds, and legal-agent configurations.

The broader goal: a more practical evaluation layer for legal AI, one that reflects how legal work is actually done. Not just which model is best? but which legal AI system performs best for this task?

Work with us

Legal AI needs benchmarks that evaluate the full system, not just the model.

If you are building legal AI evals, legal-agent workflows, or applied AI systems for law firms and legal teams, we would be happy to compare notes.

Book a Discovery Call