TechnologyJanuary 7, 2025

How Agentic Hybrid Search Creates Smarter RAG Apps

How Agentic Hybrid Search Creates Smarter RAG Apps

If you’re building a retrieval-augmented generation (RAG) application, you know how powerful they can be — when they work well. But semantic embedding models aren’t magic. Most RAG implementations rely on semantic similarity as the sole retrieval mechanism, throwing every document into a vector database and applying the same retrieval logic for every query.

This approach works for straightforward questions but often retrieves contextually irrelevant (but semantically similar) documents. When nuanced queries require precise answers, semantic similarity alone leads to confusing or incorrect responses.

The problem isn’t your model — it’s your retrieval process.

Here, we’ll introduce a better way: agentic hybrid search. By using structured metadata and letting a large language model (LLM) choose the best retrieval operations for each query, you can turn your RAG app into a truly intelligent assistant. We’ll start by introducing the core concepts, then walk through an example where we transform a simple “credit card policy QA bot” into an agentic system that dynamically adapts to user needs.

Say goodbye to cookie-cutter retrieval and hello to a smarter RAG experience.

Why your RAG app isn’t delivering

At its core, RAG connects LLMs to external knowledge. You index your documents, use a vector search to retrieve semantically similar ones, and let the LLM generate responses based on those results. Sounds simple enough, right?

But simplicity can be a double-edged sword. While many developers focus on improving the knowledge base — enriching it with more documents or better embeddings — or fine-tuning prompts for their LLMs, the real bottleneck is often the retrieval process itself. Most RAG implementations rely on semantic similarity as a one-size-fits-all strategy.

This approach often retrieves the wrong documents: Either it pulls in contextually irrelevant results because semantic similarity isn’t the right method for the query, or it retrieves too many overlapping or redundant documents, diluting the usefulness of the response. Without a smarter way to filter and prioritize results, nuanced queries that depend on subtle distinctions will continue to fail.

Imagine a QA bot tasked with answering specific questions, such as, What happens if I pay my Premium Card bill 10 days late? or Does Bank A’s Basic Card offer purchase protection? These queries demand precise answers that hinge on subtle distinctions between policies. Similarly, consider a support bot for a company like Samsung, which offers a wide range of products from smartphones to refrigerators.

A question like, How do I reset my Galaxy S23? requires retrieving instructions specific to that model, while a query about a fridge’s warranty would need entirely different documents. With naive vector search, the bot might pull in semantically related but contextually irrelevant documents, muddying the response or causing hallucinations by blending information meant for entirely different products or use cases.

This issue persists no matter how advanced your LLM or embeddings are. Developers often respond by fine-tuning models or tweaking prompts, but the real solution lies in improving the way documents are retrieved before generation. Naive retrieval systems either retrieve too much — forcing the LLM to sift through irrelevant information, which can sometimes be mitigated with clever prompting — or retrieve too little, leaving the LLM “flying blind” without the necessary context to generate a meaningful response.

By making retrieval smarter and more context-aware, hybrid search addresses both problems: It reduces irrelevant noise by constraining searches to relevant topics and ensures that the retrieved documents contain more of the precise information the LLM needs. This dramatically improves the accuracy and reliability of your RAG application.

The solution: Agentic hybrid search

The solution is surprisingly simple yet transformative: combine hybrid search backed by structured metadata with the agentic decision-making capabilities of an LLM to implement agentic hybrid search. This approach doesn’t require overhauling your architecture or discarding your existing investments; it builds on what you already have to unlock new levels of intelligence and flexibility.

From naive to agentic: A smarter flow

A typical RAG app follows a straightforward process: question → search → generate. The user’s question is passed to a retrieval engine — often a vector search — which retrieves the most semantically similar documents. These documents are then passed to the LLM to generate a response. This works well for simple queries but stumbles when nuanced retrieval strategies are required.

Agentic hybrid search replaces this rigid flow with a smarter, more adaptable one: question → analyze → search(s) → generate. Instead of jumping straight to retrieval, the LLM analyzes the question to determine the best retrieval strategy. This flexibility empowers the system to handle a wider variety of use cases with greater accuracy.

Capabilities unlocked

With agentic hybrid search, your RAG app becomes far more capable:

  • Multiple knowledge bases — The LLM can dynamically decide which knowledge base to query based on the question. For example, a QA bot might pull general policy information from one database and bank-specific FAQs from another.
  • Tailored search queries — Instead of relying solely on semantic similarity, the LLM can craft custom search queries. For instance, a question like Which cards from Bank A offer purchase protection? might trigger a metadata-filtered search for cards with the “purchase protection” tag.
  • Metadata filters — By enriching your documents with structured metadata (such as card name, bank name, section, date), you enable precise, targeted searches that avoid irrelevant results.
  • Multiple search operations — Some questions require breaking down the query into subparts. For example, What are the eligibility requirements and benefits of the Premium Card? might involve one search for eligibility criteria and another for benefits.

These capabilities expand the types of queries your application can handle. Instead of being limited to simple fact-finding, your RAG app can now tackle exploratory research, multistep reasoning and domain-specific tasks — all while maintaining accuracy.

How it works: Transforming a credit card policy QA bot

Let’s walk through an example. Suppose you’re building a bot to answer questions about credit card policies for multiple banks. Here’s what a naive implementation looks like:

The naive approach

The documents are indexed in a vector database, and the bot performs a simple semantic search to retrieve the most similar ones. It doesn’t matter whether the user asks about eligibility requirements, fees or cancellation policies, the retrieval logic is the same.

from langchain_core.runnables import (
   RunnablePassthrough,
   ConfigurableField,
)
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_astradb.graph_vectorstores import AstraDBVectorStore

llm = ChatOpenAI()
embeddings = OpenAIEmbeddings()
vectorstore = AstraDBVectorStore(
   collection_name="knowledge_store",
   embedding=embeddings,
)

ANSWER_PROMPT = (
   "Use the information in the results to provide a concise answer the original question.\n\n"
   "Original Question: {question}\n\n"
   "Vector Store Results:\n{'\n\n'.join(c.page_content for c in context)}\n\n"
)

retriever = vectorstore.as_retriever()

# Construct the LLM execution chain
chain = (
   {"context": retriever, "question": RunnablePassthrough()}
   | ChatPromptTemplate.from_messages([ANSWER_PROMPT])
   | llm
)

The result? For a question like, How much is my annual membership fee? the system might retrieve policies from unrelated cards because the embeddings prioritize broad similarity over specificity.

chain.invoke("How much is my annual membership fee?",)

# > Response: Your annual membership fee could be $250, $95, $695, or $325, depending on the specific plan or card you have chosen. Please refer to your specific card member agreement or plan details to confirm the exact amount of your annual membership fee.

The agentic approach

In the agentic hybrid search approach, we improve this system by:

  1. Enriching documents with metadata — When indexing policies, we add structured metadata like:
    • Card name (“Premium Card”)
    • Bank name ( “Bank A”)
    • Policy sections ( “Fees,” “Rewards,” “Eligibility”)
    • Effective dates
  2. Using an LLM to choose retrieval operations — Instead of blindly performing a vector search, the bot uses the query context to decide:
    • Should it search for semantically similar policies?
    • Should it filter by card or bank metadata?
    • Should it issue multiple queries for specific policy sections?
  3. Composing a response from multiple searches — The bot combines results intelligently to generate precise and trustworthy answers.

Here’s how this looks in practice:

Example code

from typing import List, Literal
from pydantic import BaseModel, Field
from langchain_core.documents.base import Document
from langchain_core.tools import StructuredTool
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import MessagesPlaceholder

prompt = ChatPromptTemplate.from_messages([
   ("system", "Concisely answer the following question, using information retrieved from tools and the provided information about the user."),
   ("system", "The following card types are associated with the user: {cards}"),
   ("system", "Always use the provided tools to retrieve information needed to answer policy-related questions."),
   ("human", "{question}"),
   MessagesPlaceholder("agent_scratchpad"),
])

# First we define the parameters to our search operation
class RetrieveInput(BaseModel):
   question: str = Field(description="Question to retrieve content for. Should be a simple question describing the starting point for retrieval likely to have content.")
   card_type: str = Field(description=f"Search for documents related to this card type. Value must be one of {pages.keys()}")

# Next, create a "tool" that implements the search logic
def retrieve_policy(question: str, card_type: str) -> List[Document]:
   print(f"retrieve_policy(card_type: {card_type}, question: {question})")
   retriever = graph_vectorstore.as_retriever(
     search_type = "similarity",
     search_kwargs = {"metadata_filter": {"card-type": card_type}},
   )
   return list(retriever.invoke(question))

policy_tool = StructuredTool.from_function(
   func=retrieve_policy,
   name="RetrievePolicy",
   description="Retrieve information about a specific card policy.",
   args_schema=RetrieveInput,
   return_direct=False,
)

# Finally, construct an agent to use the tool we created
agent = create_tool_calling_agent(llm, [policy_tool], prompt)
agent_executor = AgentExecutor(agent=agent, tools=[policy_tool], verbose=True)

In this example, the bot recognizes that the query is highly specific and uses metadata filters to retrieve the exact policy based on the user profile provided. Additionally, the LLM re-writes the user’s question to be narrowly focused on the information needed to retrieve the relevant documents.

agent_executor.invoke({
   "question": "How much is my annual membership fee?",
   "cards": ["gold"],
})

# > Agent: Invoking: `RetrievePolicy` with `{'question': 'annual membership fee', 'card_type': 'gold'}`

# > Response: Your annual membership fee could be $250, $95, $695, or $325, depending on the specific plan or card you have chosen. Please refer to your specific card member agreement or plan details to confirm the exact amount of your annual membership fee.

Because the LLM is choosing how to use the search tool, we’re not limited to using the same filters for every question. For example, the LLM can dynamically recognize that the user is asking a question about a policy that’s different from their own and create an appropriate filter.

agent_executor.invoke({
   "question": "What's the annual membership fee for platinum cards?",
   "cards": ["gold"],
})

# > Agent: Invoking: `RetrievePolicy` with `{'question': 'annual membership fee for platinum cards', 'card_type': 'platinum'}`

# > Response: The annual membership fee for Platinum cards is $695. Additionally, each Additional Platinum Card has an annual fee of $195, but there is no annual fee for Companion Platinum Cards.

The LLM may even choose to use a given tool multiple times. For example, the following questions require the LLM to know about the user’s current policy as well as the policy mentioned in the question.

agent_executor.invoke({
   "question": "How much would my membership fee change if I upgraded to a platinum card?",
   "cards": ["gold"],
})

# > Agent: Invoking: `RetrievePolicy` with `{'question': 'membership fee for gold card', 'card_type': 'gold'}`
# > Agent: Invoking: `RetrievePolicy` with `{'question': 'membership fee for platinum card', 'card_type': 'platinum'}`

# > Response: The annual membership fee for your current American Express® Gold Card is $325. If you were to upgrade to a Platinum Card, the annual fee would be $695. Therefore, upgrading from a Gold Card to a Platinum Card would increase your annual membership fee by $370.

Try the code out for yourself in this notebook: Agentic_Retrieval.ipynb.

Why this works

The magic lies in leveraging the LLM as a decision-maker. Instead of hardcoding retrieval logic, you allow the LLM to analyze the query and dynamically select the best approach. This flexibility makes your system smarter and more adaptable, without requiring massive changes to your infrastructure.

The payoff: Smarter retrieval, better responses

Adopting agentic hybrid search transforms your RAG application into a system capable of handling complex, nuanced queries. By introducing smarter retrieval, you can provide several key benefits:

  • Improved accuracy — Smarter retrieval ensures the right documents are surfaced for each query, reducing hallucinations and irrelevant results. This directly improves the quality of the LLM’s responses.
  • Enhanced trust — By pulling in only contextually appropriate information, you avoid embarrassing errors like mixing up critical details, ensuring user confidence in your system.
  • Broader use cases — Dynamic search strategies allow your app to tackle more complex queries, integrate multiple knowledge sources and serve a wider range of users and scenarios.
  • Simplified maintenance — Instead of hardcoding retrieval rules or manually curating filters, you let the LLM dynamically adapt retrieval strategies, reducing the need for ongoing manual intervention.
  • Future-proof scalability — As your data sets grow or your knowledge bases diversify, the agentic approach scales to handle new challenges without requiring fundamental changes to your system.

By making retrieval smarter and more adaptive, you enhance the system’s overall performance without the need for major overhauls.

The tradeoffs: Balancing flexibility and cost

Adding an agentic layer to your retrieval process does come with some trade-offs:

  • Increased latency — Each query analysis involves an additional LLM call, and issuing multiple tailored searches can take longer than a single operation. This may slightly delay response times, particularly for latency-sensitive applications.
  • Higher inference costs — Query analysis and orchestrating multiple searches add computational overhead, which could increase costs for systems with high query volumes.
  • Complexity in orchestration — While the implementation is straightforward, maintaining a system that dynamically selects retrieval strategies may introduce additional debugging or testing considerations.

Despite these trade-offs, the benefits of agentic hybrid search typically outweigh the costs. For most applications, the added flexibility and precision significantly improve user satisfaction and system reliability, making the investment worthwhile. Additionally, latency and cost concerns can often be mitigated through optimizations like caching, precomputing filters or limiting analysis to complex queries.

By understanding and managing these trade-offs, you can harness the full potential of agentic hybrid search to build smarter, more capable RAG applications.

Conclusion

Agentic hybrid search is the key to unlocking the full potential of your RAG app. By enriching your documents with structured metadata and letting the LLM intelligently decide retrieval strategies, you can go beyond simple semantic similarity and build an assistant that users can truly rely on.

It’s an easy-to-adopt change with a surprisingly large payoff. Why not give it a try in your next project? Your users — and your future self — will thank you.

This story originally appeared in The New Stack.

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.