Research · Legal AI Evals

Legal AI Benchmarks Should Evaluate Scaffolds, Not Just Models

Most legal AI leaderboards answer one question: which model performs best. But lawyers never use a raw model. They use it inside a system. We ran the same model through three setups (plain chat and two scaffolds) to see how much the system around the model actually matters.

LN Labs Research·Published June 2026·8 min read

Most legal AI leaderboards today answer one question: “Which model performs best on legal tasks?” That is an important question. But it is no longer enough.

In real legal workflows, lawyers and legal teams rarely use a raw model directly. They use a model inside an application that sends prompts to the model, lets it use tools, provides document retrieval and parsing, and adds validation logic and complex agentic workflows. In other words, they use a scaffold. And if the scaffold changes the quality of the output, then evaluating the model alone gives us only part of the picture.

Are we evaluating the model, or the full system that helps the model perform the task?

01 · Precedent

What coding evals already showed

Earlier this year, Princeton’s Holistic Agent Leaderboard highlighted an important point from coding-agent evaluations: running the same model through different scaffolds can produce materially different results.

Princeton HAL · CORE-Bench

Running Opus 4.5 with a Claude Code-based scaffold substantially outperformed the earlier CORE-Agent scaffold: same model, different surrounding system, different score.

hal.cs.princeton.edu→

This is not just a coding story. The model matters, but so does the surrounding agentic system: the tools, the execution loop, the file access, the test-running environment, and the way the task is decomposed. Legal AI is now reaching the same point.

02 · Context

Why this matters for legal AI

Legal AI is moving from simple Q&A into long-horizon legal work: reviewing a vendor data processing agreement against internal privacy requirements, identifying compliance gaps against a new regulatory update, mapping legal obligations to internal teams. For this type of work, the model is only one component. The final performance depends on the full configuration around the model.

That is why model-only legal benchmarks are useful, but incomplete.

Harvey · Legal Agent Benchmark

An open-source benchmark to evaluate legal agents on realistic work. Each task includes an instruction, a client matter with relevant materials, and a required work product for review, plus an execution harness for running and scoring agents.

It moves legal evals closer to how legal work is actually assigned →

If legal agents are becoming the unit of work, why are so many legal AI comparisons still focused mainly on the model?

03 · Hypothesis

Our hypothesis

The same model may perform differently depending on the legal scaffold wrapped around it, and the cost of completing those tasks may differ too.

If scaffold choice changes performance in coding tasks, it may also change performance in legal tasks.

If the hypothesis is correct, then the fastest way to improve legal AI performance may not always be fine-tuning. In many cases, it may be better scaffold, harness, retrieval, and workflow engineering. This matters for:

BigLaw innovation teamsLegal data vendorsLegal AI companiesEnterprise legal departments

04 · Method

What we tested

A small internal legal evaluation experiment. The goal was not a universal benchmark; it was narrower and more practical: to test whether different legal scaffolds change performance on recurring legal and compliance tasks.

Model

Opus 4.8

Dataset

40 tasks

Domain

Data protection & operational resilience

Three system configurations

CONFIG 01

Claude Chat

Direct chat, no scaffold

CONFIG 02

Cowork + Legal Plugin

Cowork + Privacy / Regulatory plugin

CONFIG 03

MikeOSS

Open-source scaffold

Scaffold : performs the task

The surrounding system that helps the model complete a legal task: instructions, retrieval, tools, workflow logic, and output structure.

Harness : evaluates the task

The evaluation infrastructure used to run the tasks and score the outputs, independent of the system doing the work.

LLM-as-a-judge

Fable 5

Evaluated during the week it was still publicly available.

Scoring method

Rubric-based

Share of rubric checks graded “Pass” across all tasks.

05 · Dataset

What was inside the eval dataset

Tasks that financial-market companies face regularly in privacy, compliance, and operational resilience work. For example:

Example task

“Draft a CCPA/CPRA client consultation response advising on a described California consumer privacy scenario, using three supporting documents: a drafting playbook, a case context document, and a regulatory rules reference.”

Capabilities tested

·Extracting obligations from legal and regulatory materials

·Identifying affected internal teams

·Assessing practical business impact

·Spotting missing information

·Mapping requirements to internal controls

·Producing structured outputs for legal review

Scored against

Legal soundnessCompletenessFaithfulness to provided materialsQuality of reasoningCost of end-to-end execution

Small by design: a practice-specific eval, not a general legal-reasoning benchmark.

06 · Results

The scaffold changed the score

Key finding stood out: the legal-workflow scaffold (Cowork + Legal Plugin) led on every dimension, even though the underlying model was identical across all three.

Quality

Quality leaderboard

Pass rate · 40 tasks

Final results

Cowork + Legal PluginTop

Claude Cowork scaffold

Evidence

83.2%

Soundness

84.8%

Reasoning

80.1%

Overall

82.7

Claude Chat

Direct chat, no scaffold

Evidence

77.7%

Soundness

79.7%

Reasoning

69.3%

Overall

75.6

MikeOSS

Open-source scaffold

Evidence

77.4%

Soundness

75.9%

Reasoning

64.9%

Overall

72.7

Quality leaderboard

Pass rate · 40 data-protection and op. resilience tasks

Final results

System

Evidence

Legal
soundness

Reasoning

Overall · avg

Cowork + Legal PluginTop

Claude Cowork scaffold

83.2%

84.8%

80.1%

82.7

Claude Chat

Direct chat, no scaffold

77.7%

79.7%

69.3%

75.6

MikeOSS

Open-source scaffold

77.4%

75.9%

64.9%

72.7

Scaffold choice materially affected performance on legal tasks.

The point is not that one scaffold wins universally; that would be too strong a claim from 40 tasks. It is that evaluation should not stop at the model layer.

Cost

A second finding emerged on cost: the average spend to complete one task. The scores and the price tags did not line up the way you might expect.

Cost per task

USD · averaged over 40 tasks

Lower is cheaper

MikeOSS

$0.30

Cowork + Legal Plugin

$0.80

Claude Chat

$2.80

Claude Cowork with Legal Plugin executed tasks more cost-efficiently than Claude Chat. The reason is simple: the scaffold reached a good-quality output in fewer attempts than the raw model baseline, and fewer attempts mean fewer tokens used. Scaffold engineering shapes not just the quality of model performance, but its cost-efficiency too.

07 · Implications

What this means for legal AI teams

The question is no longer only “which model should we use?” It becomes “which model + scaffold configuration performs best for this workflow?”

BigLaw AI teams

Which model + scaffold configuration performs best for this specific legal workflow?

Legal data vendors

How do our content, taxonomy, and retrieval layer improve downstream AI performance?

Legal AI vendors

How much of our performance comes from the base model, and how much from the scaffold?

Frontier labs

Legal capability should be evaluated in realistic agentic environments, not model-only benchmarks.

08 · Takeaway

Why scaffold engineering matters now

Fine-tuning will become increasingly important for legal AI, especially as open-source and open-weight models improve. But for many applied teams today, scaffold and harness engineering may still be the fastest and most cost-efficient way to improve performance.

Before investing in fine-tuning, understand how much performance can be unlocked through better task design, retrieval, output schemas, validation logic, and evals.

Before tuning the model, evaluate the system.

Our next step

We plan to open-source our legal eval dataset soon, so other teams can validate our results and run their own evaluations, and we will continue testing other models, scaffolds, and legal-agent configurations.

The broader goal: a more practical evaluation layer for legal AI, one that reflects how legal work is actually done. Not just which model is best? but which legal AI system performs best for this task?

Work with us

Legal AI needs benchmarks that evaluate the full system, not just the model.

If you are building legal AI evals, legal-agent workflows, or applied AI systems for law firms and legal teams, we would be happy to compare notes.

Book a Discovery Call