Legal AI Benchmarks Should Evaluate Scaffolds, Not Just Models
Most legal AI leaderboards answer one question: which model performs best. But lawyers never use a raw model. They use it inside a system. We ran the same model through three setups (plain chat and two scaffolds) to see how much the system around the model actually matters.
Most legal AI leaderboards today answer one question: “Which model performs best on legal tasks?” That is an important question. But it is no longer enough.
In real legal workflows, lawyers and legal teams rarely use a raw model directly. They use a model inside an application that sends prompts to the model, lets it use tools, provides document retrieval and parsing, and adds validation logic and complex agentic workflows. In other words, they use a scaffold. And if the scaffold changes the quality of the output, then evaluating the model alone gives us only part of the picture.
Are we evaluating the model, or the full system that helps the model perform the task?
What coding evals already showed
Earlier this year, Princeton’s Holistic Agent Leaderboard highlighted an important point from coding-agent evaluations: running the same model through different scaffolds can produce materially different results.
Running Opus 4.5 with a Claude Code-based scaffold substantially outperformed the earlier CORE-Agent scaffold: same model, different surrounding system, different score.
hal.cs.princeton.edu→This is not just a coding story. The model matters, but so does the surrounding agentic system: the tools, the execution loop, the file access, the test-running environment, and the way the task is decomposed. Legal AI is now reaching the same point.
Why this matters for legal AI
Legal AI is moving from simple Q&A into long-horizon legal work: reviewing a vendor data processing agreement against internal privacy requirements, identifying compliance gaps against a new regulatory update, mapping legal obligations to internal teams. For this type of work, the model is only one component. The final performance depends on the full configuration around the model.
That is why model-only legal benchmarks are useful, but incomplete.
An open-source benchmark to evaluate legal agents on realistic work. Each task includes an instruction, a client matter with relevant materials, and a required work product for review, plus an execution harness for running and scoring agents.
It moves legal evals closer to how legal work is actually assigned →If legal agents are becoming the unit of work, why are so many legal AI comparisons still focused mainly on the model?
Our hypothesis
The same model may perform differently depending on the legal scaffold wrapped around it, and the cost of completing those tasks may differ too.
If scaffold choice changes performance in coding tasks, it may also change performance in legal tasks.
If the hypothesis is correct, then the fastest way to improve legal AI performance may not always be fine-tuning. In many cases, it may be better scaffold, harness, retrieval, and workflow engineering. This matters for:
What we tested
A small internal legal evaluation experiment. The goal was not a universal benchmark; it was narrower and more practical: to test whether different legal scaffolds change performance on recurring legal and compliance tasks.
The surrounding system that helps the model complete a legal task: instructions, retrieval, tools, workflow logic, and output structure.
The evaluation infrastructure used to run the tasks and score the outputs, independent of the system doing the work.
What was inside the eval dataset
Tasks that financial-market companies face regularly in privacy, compliance, and operational resilience work. For example:
“Draft a CCPA/CPRA client consultation response advising on a described California consumer privacy scenario, using three supporting documents: a drafting playbook, a case context document, and a regulatory rules reference.”
Small by design: a practice-specific eval, not a general legal-reasoning benchmark.
The scaffold changed the score
Key finding stood out: the legal-workflow scaffold (Cowork + Legal Plugin) led on every dimension, even though the underlying model was identical across all three.
Quality
Scaffold choice materially affected performance on legal tasks.
The point is not that one scaffold wins universally; that would be too strong a claim from 40 tasks. It is that evaluation should not stop at the model layer.
Cost
A second finding emerged on cost: the average spend to complete one task. The scores and the price tags did not line up the way you might expect.
Claude Cowork with Legal Plugin executed tasks more cost-efficiently than Claude Chat. The reason is simple: the scaffold reached a good-quality output in fewer attempts than the raw model baseline, and fewer attempts mean fewer tokens used. Scaffold engineering shapes not just the quality of model performance, but its cost-efficiency too.
What this means for legal AI teams
The question is no longer only “which model should we use?” It becomes “which model + scaffold configuration performs best for this workflow?”
Which model + scaffold configuration performs best for this specific legal workflow?
How do our content, taxonomy, and retrieval layer improve downstream AI performance?
How much of our performance comes from the base model, and how much from the scaffold?
Legal capability should be evaluated in realistic agentic environments, not model-only benchmarks.
Why scaffold engineering matters now
Fine-tuning will become increasingly important for legal AI, especially as open-source and open-weight models improve. But for many applied teams today, scaffold and harness engineering may still be the fastest and most cost-efficient way to improve performance.
Before investing in fine-tuning, understand how much performance can be unlocked through better task design, retrieval, output schemas, validation logic, and evals.
Before tuning the model, evaluate the system.
We plan to open-source our legal eval dataset soon, so other teams can validate our results and run their own evaluations, and we will continue testing other models, scaffolds, and legal-agent configurations.
The broader goal: a more practical evaluation layer for legal AI, one that reflects how legal work is actually done. Not just which model is best? but which legal AI system performs best for this task?
Legal AI needs benchmarks that evaluate the full system, not just the model.
If you are building legal AI evals, legal-agent workflows, or applied AI systems for law firms and legal teams, we would be happy to compare notes.
Book a Discovery Call