How Waldo reduces latency and token usage without compromising quality
Waldo, Glean's agentic search model, reduces latency and token usage by separating retrieval work from reasoning work, letting a smaller, purpose-built model handle query decomposition and evidence gathering before a frontier model ever sees the prompt. In Glean's production deployments, this architectural split cuts response times roughly in half and trims token consumption by about 25%, with no drop in answer accuracy or citation correctness.
The idea is straightforward: frontier large language models are excellent reasoners, but reducing latency means not asking them to do everything. Retrieval, tool planning, and evidence selection don't require the same computational weight as synthesizing a final answer. A dedicated agent can handle those steps faster and cheaper.
For enterprise teams running thousands of search queries a day, the difference compounds quickly. Lower latency means employees get answers in under a second instead of waiting several seconds per query.
Fewer tokens per request means lower infrastructure cost at scale. And because the reasoning model still handles the final synthesis, answer accuracy and citation correctness stay intact. As enterprise AI search becomes central to how organizations operate, optimizing that cost-performance ratio matters more than ever.
How Waldo reduces latency and token usage in the search pipeline
Waldo is an agentic search model that runs before the frontier reasoning model, handling query decomposition, tool routing, and evidence gathering so the larger model focuses only on synthesis and response generation.
Within Glean's Agentic Engine, Waldo acts as a lightweight orchestrator at the front of the AI search pipeline. When a query arrives, Waldo breaks it into sub-queries, decides which tools and data sources to call, and gathers relevant evidence.
Only after that retrieval phase is finished does the frontier model receive a focused, pre-filtered prompt for final reasoning. This separation means the most expensive model in the stack never touches retrieval work.
The performance gains are concrete. On a per-call basis, Waldo runs roughly 10x faster than a default reasoning model, with a P50 latency of about 250 ms compared to roughly three seconds. Research from Stevens Institute of Technology confirms this pattern: a single LLM call takes approximately 800 milliseconds, but multi-step orchestration flows can take 10 to 30 seconds without a specialized retrieval layer.
Across the full pipeline, that translates to approximately 50% lower end-to-end latency and about 25% fewer tokens consumed per query. These numbers hold without regression in answer accuracy, citation correctness, or relevance scoring.
Waldo is built on a compact base model, post-trained specifically on search planning tasks rather than general instruction-following. That specialization is the key to its speed.
Query decomposition and tool routing are pattern-matching jobs that a purpose-built model can handle at a fraction of the cost of a frontier reasoner. By reserving the large model for synthesis and reasoning, you get the same answer accuracy while cutting the time and compute spent on every request.
For enterprise workloads where search volume is high and latency tolerance is low, this separation of concerns turns an architectural choice into a measurable operational advantage.
How monolithic model architectures create retrieval bottlenecks
Most AI systems funnel every step through a single frontier model. Query understanding, tool selection, document reading, evidence evaluation, and response generation all run on the same heavyweight architecture. This monolithic design means the most expensive, highest-latency model in your stack handles work that doesn't require its full reasoning capability.
Retrieval is a well-defined, high-frequency job. Breaking down a question, choosing which data sources to query, reading results, and deciding when you have enough evidence are tasks that demand precision and speed. They don't require the kind of multi-step reasoning that justifies a frontier model's computational cost.
The mismatch creates a compounding problem. Organizations pay frontier-model prices for what amounts to search-planning work.
Users wait through frontier-model latency on tasks that a purpose-built model could resolve in milliseconds. At enterprise scale, where Glean Search handles thousands of queries daily, this overhead shows up in both infrastructure bills and employee wait times.
Consider the arithmetic. If a frontier model takes roughly three seconds per call and a typical agentic query requires four to six retrieval steps before final synthesis, the retrieval phase alone can consume 12 to 18 seconds of large language model (LLM) inference time.
A specialized model running those same steps at 250 ms each completes the retrieval phase in one to two seconds. The frontier model's deep reasoning capability sits idle during retrieval, burning tokens on pattern-matching work it's overqualified to perform. This is why enterprises increasingly use small models for 80% of their API calls and reserve large models for complex reasoning, cutting compute costs by up to 70%.
Prompt tuning and caching strategies don't address the root cause. The pipeline architecture determines how compute is allocated across retrieval and reasoning, and changing that allocation requires a structural redesign.
You need an architecture where each model handles the work it was built for.
How Waldo separates search planning from reasoning
Query decomposition and tool routing
Waldo breaks a complex question into sub-queries and determines which tools and data sources to call before the frontier model is invoked. A question like "How did our Q1 revenue compare to forecast, and what drove the gap?" becomes a structured plan: pull revenue actuals from the finance system, retrieve the forecast from the planning tool, and search internal documents for variance commentary.
The planning step includes deciding order of operations and which of Glean's 100-plus connectors to hit. Waldo evaluates dependencies between sub-queries, sequences them for efficiency, and routes each one to the appropriate data source. This orchestration happens at the Agentic Engine layer, where Waldo coordinates retrieval across enterprise systems without requiring the frontier model to manage tool selection or call sequencing.
A non-obvious advantage of this separation: Waldo's planning decisions are deterministic for a given query structure. The same type of question consistently routes through the same retrieval pattern, which makes debugging and auditing straightforward. This deterministic behavior is one of the principles behind agentic reasoning in enterprise settings.
When a frontier model handles planning, its non-deterministic nature means identical queries can produce different tool-call sequences on different runs.
Evidence gathering and sufficiency detection
After routing sub-queries, Waldo reads results from each retrieval step and evaluates whether the gathered evidence is sufficient to answer the original question. If results from one source are incomplete, Waldo decides whether to search further or whether the existing evidence is strong enough to hand off.
This "stop when you have enough" behavior is critical. Without it, retrieval agents tend to over-fetch, pulling documents from every available source regardless of whether the first two results already contain a definitive answer. Each unnecessary retrieval loop adds latency and inflates token usage without improving the final response.
When Waldo determines it has sufficient evidence, it assembles a pre-constructed, grounded context package for the frontier model. The reasoning model receives filtered, relevant passages rather than raw search results. This approach builds on the principles of retrieval-augmented generation, where grounding model outputs in retrieved evidence improves accuracy and reduces hallucination.
Waldo's filtering step means the frontier model reasons over cleaner input and spends its tokens on synthesis rather than sifting through irrelevant documents.
What makes Waldo's architecture different from standard AI search
The structural differences between a monolithic AI search system and Waldo's two-model approach show up across every stage of the query lifecycle. Here's how the two architectures compare on six dimensions that matter for enterprise deployments:
| Capability | Standard approach | Waldo's approach |
|---|---|---|
| Query planning | Frontier model decomposes query | Specialized search model plans retrieval |
| Tool selection | Frontier model selects and calls tools | Waldo routes to the right tools at ~250 ms P50 |
| Evidence reading | Frontier model reads all retrieved documents | Waldo reads, filters, and stops when evidence is sufficient |
| Context handoff | No separation — one model does everything | Grounded context handed to frontier model for reasoning only |
| Latency profile | ~3s per LLM call | ~250 ms per Waldo call; ~50% lower end-to-end latency |
| Token cost | Full frontier-model token spend on retrieval + reasoning | ~25% fewer tokens by offloading retrieval to a smaller model |
The comparison highlights a principle that extends beyond any single product: composable architectures, where specialized models own distinct stages of a pipeline, reflect where enterprise AI is heading. Monolithic systems hit a ceiling when you try to optimize one dimension (speed, cost, accuracy) without degrading another. A two-model design lets you tune each stage independently.
Within Glean's architecture, this composability preserves quality by keeping the frontier model focused on what it does best. It receives pre-filtered evidence grounded in the Enterprise Graph's permission-aware retrieval, then synthesizes that evidence into accurate, cited answers. The reasoning model's token budget goes entirely toward analysis and response generation rather than being split across retrieval and reasoning tasks.
One detail worth noting: Waldo's architecture doesn't require replacing your frontier model. It sits upstream in the pipeline, which means you can swap the downstream reasoning model as new versions ship without retraining or reconfiguring the retrieval layer.
How Waldo maintains answer quality with fewer tokens
Quality regression is the typical risk when you optimize for speed or cost. Teams that reduce token budgets or switch to smaller models often see accuracy drop and hallucination rates climb. Waldo avoids this tradeoff by changing who does retrieval, not how much evidence gets gathered.
The same volume of evidence reaches the frontier model in both architectures. Waldo gathers it faster and with less token overhead, but the reasoning model's input remains equivalent.
Think of Waldo's role as changing the delivery vehicle without reducing the payload. The frontier model still sees the full body of relevant passages it needs to construct an accurate, grounded response.
Post-training on search planning tasks is what makes this possible. Waldo learned domain-specific retrieval patterns through targeted training on query decomposition, tool routing, and evidence sufficiency.
A general-purpose frontier model handles these tasks adequately, but a model post-trained specifically for retrieval planning handles them with higher precision at lower computational cost. The specialization is similar to how a compiler optimizes code more efficiently than an interpreter, even though both produce the same output.
The quality story also benefits from cleaner input to the reasoning model. When a frontier model handles its own retrieval, it receives raw document dumps that may include irrelevant passages or low-relevance results.
Waldo's filtering step removes noise before handoff. The reasoning model works with a tighter evidence set, which reduces the surface area for hallucination and improves citation accuracy. Research on context compression supports this approach: the FinOps Foundation reports that semantic filtering can achieve 70–80% reductions in tokens sent to the underlying language model, with measurable improvements in answer quality.
In Glean's enterprise deployments, switching retrieval to Waldo produced no regression in answer accuracy or citation correctness across thousands of production queries. Within Glean Assistant, the Agentic Engine's permission-aware pipeline feeds Waldo a scoped evidence set, and Waldo passes only what's relevant and accessible to the frontier model.
The reasoning model produces equivalent or marginally better outputs because it works with pre-filtered context rather than doing its own document triage.
What Waldo means for enterprise AI performance and cost
A 25% reduction in token cost per query compounds into significant savings on LLM inference spend. For an organization running 50,000 AI search queries per day, that reduction translates to millions of fewer tokens consumed monthly. At current frontier-model pricing, the savings can offset a meaningful portion of an enterprise AI deployment's operating cost.
Lower latency drives a less obvious but equally important outcome: adoption. Across Glean's 2025 enterprise deployments, internal usage data shows that response times above two seconds correlate with lower repeat usage and reduced task completion rates.
When Glean Assistant returns answers in under a second instead of three to four seconds, employees build the habit of reaching for AI search instead of defaulting to manual workflows. Faster answers change whether people use the tool at all. Understanding how input token count impacts latency helps explain why architectural choices at the retrieval layer have such a direct effect on user experience.
The specialized-model approach also scales more predictably than a monolithic architecture. As query volume grows, the frontier model handles a proportionally smaller share of total compute because Waldo absorbs the retrieval workload. This means your most expensive model's utilization scales sub-linearly with demand, a cost curve that finance teams can forecast and plan around.
IT and security teams benefit from a structural advantage: Waldo's architecture doesn't change the permission model or governance layer. Permission-aware retrieval through the Enterprise Graph works the same regardless of whether a frontier model or Waldo handles the retrieval step.
Access controls, audit logs, and data governance remain intact. Adding a specialized model to the pipeline introduces no new attack surface.
Token management is becoming a board-level conversation for large enterprises scaling AI across thousands of employees. According to Menlo Ventures' survey of 495 enterprise AI decision-makers, enterprise generative AI spending reached $37 billion in 2025, and multi-model deployment has become the standard as organizations recognize that different models excel at different tasks. Waldo is an architectural change that reduces per-query cost at the model level, something prompt engineering and caching alone cannot achieve.
How to evaluate whether an agentic search model fits your workflow
Start by measuring your current baseline. Track end-to-end latency from query submission to answer delivery, token consumption per query across all model calls, and answer quality scores including accuracy, citation grounding, and hallucination rate. Without these numbers, you can't quantify the impact of any architectural change.
Identify retrieval-heavy patterns in your query mix. Workloads that involve frequent query decomposition, multiple tool calls per question, or reading across several data sources benefit most from a specialized retrieval model. If most of your queries are simple lookups that resolve in a single retrieval step, the latency and cost gains from a two-model architecture will be smaller.
Compare per-call latency using a P50 benchmark. Waldo runs at approximately 250 ms P50, compared to roughly three seconds for a frontier model performing the same retrieval task.
Per-call numbers don't tell the full story. Measure end-to-end latency across the complete pipeline, including all retrieval steps, context assembly, and final reasoning. The cumulative difference across multi-step queries is where the architectural advantage compounds.
Evaluate answer accuracy, citation grounding, and hallucination rates together rather than relying on a single metric. All three should remain stable or improve after introducing a specialized retrieval model. For a structured approach to measuring these dimensions, see this guide on evaluating AI agents in production environments.
Run A/B comparisons on a representative sample of your actual query traffic, not synthetic benchmarks. Glean Agents provide built-in evaluation against these dimensions, which simplifies the comparison process.
Consider cost trajectory over a 12- to 18-month horizon. The gap between "frontier model does everything" and "specialized models handle retrieval" widens as query volume increases.
Model a scenario where your AI search usage doubles or triples, then compare the token cost curves of both architectures. The inflection point where a two-model approach becomes dramatically cheaper often arrives faster than teams expect. Broadening your understanding of how enterprise search with LLM technology is evolving can help frame that analysis.
Frequently asked questions
What specific techniques does Waldo use to reduce latency?
Waldo uses query decomposition to break complex questions into sub-queries, then routes each sub-query to the appropriate data source through Glean's connector framework. Its sufficiency detection mechanism stops retrieval as soon as evidence is adequate, preventing unnecessary additional calls. These techniques combined produce a P50 latency of approximately 250 ms per retrieval call.
How does Waldo compare to other search models in terms of performance?
Waldo operates roughly 10x faster per call than using a frontier model for the same retrieval tasks. The end-to-end pipeline shows approximately 50% lower latency and 25% fewer tokens consumed. Unlike general-purpose small models, Waldo is post-trained specifically on search planning tasks, which gives it higher precision on query decomposition and tool routing than a model trained for general instruction-following.
What metrics should teams use to evaluate Waldo's effectiveness?
Track four dimensions: end-to-end latency (time from query to answer), token consumption per query (across all model calls in the pipeline), answer accuracy (correctness of the final response), and citation grounding (whether cited sources actually support the claims in the answer). Compare these metrics with and without the specialized retrieval model to isolate its impact.
Does Waldo work for simple queries or only complex multi-step searches?
Waldo handles both, but the performance advantage is most pronounced on complex, multi-step queries that require decomposition and multiple tool calls. For simple single-source lookups, Waldo still processes the retrieval step faster than a frontier model would, but the absolute time savings per query are smaller because there's less retrieval work to offload.
What are the practical implications of using Waldo in enterprise settings?
The primary implications are lower per-query cost, faster response times that drive higher adoption rates, and a more predictable cost curve as usage scales. From a governance perspective, Waldo operates within the same permission-aware retrieval framework as the rest of Glean's architecture, so existing access controls and audit capabilities remain unchanged.
The shift from monolithic AI pipelines to composable, specialized-model architectures is already changing how enterprise teams think about search performance and cost. Waldo represents that shift in practice: a purpose-built model that handles retrieval faster and cheaper while the frontier model focuses on delivering accurate, cited answers.
Request a demo to explore how Glean and AI can transform your workplace. We can walk you through how Waldo fits into your existing infrastructure and what the performance gains look like on your data.










