
[EDRM Editor’s Note: The opinions and positions are those of John Tredennick.]
Between November 12th and November 24th, all three major AI labs released new flagship models. OpenAI shipped GPT-5.1. Google launched Gemini 3 Pro. Anthropic dropped Claude Opus 4.5. Each represents a meaningful leap in capability. Each claims to be “state of the art.” And each one landed in the market faster than most legal tech vendors deploy minor updates.
If your discovery platform architecture assumes a single “best” model, you just watched your vendor’s calculus become obsolete three times in twelve days.
What Just Changed
GPT-5.1 introduced adaptive reasoning—the model itself decides whether to allocate “thinking time” to complex problems or move quickly through straightforward tasks. OpenAI is positioning this as the conversational, intelligent default for natural interaction.
Gemini 3 Pro achieved top rankings on multiple academic benchmarks, including PhD-level reasoning performance with a 1 million-token context window. Google also previewed Deep Think mode, which demonstrated even stronger reasoning on complex analytical tasks.
Claude Opus 4.5 became the first model to exceed 80% on SWE-Bench Verified, a rigorous software engineering benchmark. More significantly, Anthropic cut pricing substantially—making flagship-tier capabilities available at roughly one-third previous costs.
None of these are cosmetic improvements. Each represents architectural decisions about what “intelligence” means for production systems.
The Patterns That Matter for Legal Teams
Here’s what jumped out across all three releases:
Reasoning depth is now table stakes. All three labs are competing on who can make models “think longer” or “reason more carefully” on hard problems—whether that’s called Thinking mode, Deep Think, or an effort parameter. The competition has moved from raw accuracy to deliberative accuracy: how well can a model slow down, reconsider, and give you the right answer, not just the fast one?
Pricing is collapsing at the top tier. Claude Opus 4.5 costs roughly one-third to one-half of previous Opus models for similar or better quality. OpenAI and Google are both scaling down inference costs faster than anyone expected. The economic floor for “best in class” is dropping, which means you can afford to use better models for more tasks—but it also means what’s “worth using” changes month to month.
No one is winning permanently. Google claims top benchmarks. Anthropic claims best coding performance. OpenAI claims best conversational reasoning. They’re all right—and all wrong—depending on your task, your latency budget, and your preference for depth versus speed. There is no single “best” model. There’s only “best for this task, right now.”
There is no single “best” model. There’s only “best for this task, right now.”
John Tredennick, Merlin Search Technologies.
The Fundamental Problem: Legal Discovery Isn’t One Task
The legal technology market has historically assumed vendors pick a model, build their platform around it, and that’s the end of the architecture discussion. That assumption is now dangerous because legal discovery requires two fundamentally different optimization profiles:
Bulk Document Processing
You have millions of pages to process. Cost per document and throughput dominate. Accuracy matters, but speed and expense per document are the limiting factors. You don’t need adaptive reasoning overhead on every page—you need something fast and economical. This is first-pass privilege review, broad categorization, initial summaries.
Strategic Synthesis and Analysis
You’re synthesizing thousands of documents into a narrative, finding contradictions, building arguments. Reasoning depth matters enormously. A model that thinks longer and finds nuance is worth the extra latency. You’re willing to pay more (and wait longer) for reliability and depth on high-stakes tasks. This is closing arguments, investigation reports, fraud pattern analysis, expert witness preparation.
A single model forces you to choose: overpay for reasoning capabilities you don’t need on bulk work, or sacrifice analytical depth on high-stakes synthesis to maintain acceptable bulk processing costs.
A single model forces you to choose: overpay for reasoning capabilities you don’t need on bulk work, or sacrifice analytical depth on high-stakes synthesis to maintain acceptable bulk processing costs.
John Tredennick, Merlin Search Technologies.
This isn’t a minor technical tradeoff. It’s a structural business decision that affects every matter you handle.
The Velocity Problem
Now layer in the release cadence we just witnessed. When three flagship models ship in twelve days—each with different strengths, weaknesses, and pricing models—single-model architectures create three equally problematic scenarios:
Locked in and falling behind. Your vendor chose OpenAI in October. Claude Opus 4.5 now offers comparable performance at one-third the cost. You’re stuck paying more than competitors while your vendor renegotiates contracts.
Perpetual migration. Your vendor commits to “always using the best model.” That means platform-wide migrations every few weeks as leadership shifts between labs. Each migration risks workflow disruption and retraining overhead.
Frozen in time. Your vendor picks one model and stays there to maintain stability. You’re now paying 2025 prices for 2024 capabilities while the technology moves past you.
None of these outcomes serve your interests. All of them are architectural artifacts, not business necessities.
A Different Approach: Multi-Model Architecture
Merlin Alchemy was designed from inception around a different premise: there is no permanently “best” model—there are multiple models, each optimized for different tasks, and users should control which model handles which work.
For bulk document processing:
- Select models optimized for speed and economy
- Current options include Claude Sonnet variants, Gemini 3 Pro, GPT-5.1 Instant
- When pricing drops or efficiency improves, swap models without changing workflows
- Decision criteria: cost per page processed, throughput, accuracy thresholds
For strategic synthesis:
- Deploy flagship reasoning models where depth justifies cost
- Current options include GPT-5.1 Thinking, Gemini 3 Deep Think, Claude Opus 4.5
- A/B test new releases against established baselines
- Decision criteria: analytical sophistication, citation accuracy, reasoning coherence
You don’t tell clients “Alchemy uses Claude” or “Alchemy uses GPT.” Alchemy uses whatever combination delivers optimal results for each specific task, and that combination evolves as the technology advances.
The Economic Dimension
This isn’t purely an architectural or capability discussion. There’s a material cost implication.
When Anthropic dropped Opus 4.5 pricing by roughly 60% while claiming improved performance, single-model platforms faced an immediate arbitrage problem. Multi-model platforms simply redirected synthesis work to the more cost-effective option while maintaining bulk processing on already-optimized models.
For organizations processing millions of pages annually, that flexibility translates to six-figure cost differences—savings you can return to clients or capture as operational efficiency.
Framework for Evaluating Your Platform
If you’re managing discovery operations, litigation technology, or legal department innovation, ask your current vendor:
Architecture questions:
- Can we select different models for different tasks within a single workflow?
- What’s your process for evaluating and integrating new models?
- How long between model release and availability in your platform?
Economic questions:
- If a model vendor drops pricing 50%, when do we see that savings?
- Can we A/B test new models against our current baseline?
- What’s the migration cost if we want to change models?
Risk questions:
- What happens if our current model becomes significantly more expensive than alternatives?
- How do you handle model deprecation by vendors?
- What’s our exposure if one vendor’s capabilities fall behind?
The answers reveal whether you’re on a platform designed for this moment or one that assumed the technology landscape would remain stable.
What This Means for Legal Practice
The question facing legal departments and law firms isn’t “which AI vendor will win?” It’s “which platform architecture lets us benefit from rapid innovation without accepting structural lock-in?”
The vendors who succeed over the next 18 months won’t be the ones who guessed correctly about which model would dominate in November 2024. They’ll be the ones who build systems that treat models as composable components—letting you experiment, compare, and optimize without friction, without migration overhead, and without renegotiating contracts every time a lab announces a breakthrough.
The pace of model releases isn’t slowing. Neither should your platform’s ability to take advantage of them.
Resources
Understanding the Technology: Dr. William Webber and I just released the Fourth Edition of Generative AI for Smart Discovery Professionals—a comprehensive 200+ page guide covering how these models work, what they can accomplish, and where their limitations require human judgment.
Download your free copy: https://www.merlin.tech/genai-book/
See Multi-Model Architecture in Practice: Merlin Alchemy currently supports the latest models from Anthropic, Google and OpenAI—with continuous evaluation of new releases as they become available.
Schedule a demonstration: https://www.merlin.tech
Assisted by GAI and LLM Technologies per EDRM GAI and LLM Policy.

