Graph RAG by Example
data:image/s3,"s3://crabby-images/75964/759640561bbb9562b97e722d3e6699384f71e8cf" alt="Graph RAG by Example"
For some generative AI use cases, you can get better results by taking a graph-based approach to retrieval-augmented generation (RAG) than than you can using traditional vector search. “Graph RAG” can yield greater accuracy for many scenarios where your underlying dataset consists of highly interlinked sources.
The good news is that, if you already use LangChain and a vector database, you don’t have to make major changes to your existing data stack to incorporate graph RAG. In that case, you can get started with just a few lines of code.
Here, we’ll walk through some simple examples of generating, storing, and searching knowledge graphs. Using these code samples, you can easily get started adding graph RAG to your GenAI app toolbox.
Why use graph RAG?
Large language models (LLMs) like ChatGPT are great general-purpose language parsers and generators. However, they usually need additional context to perform well on domain-specific queries. RAG takes information from a GenAI query and supplements it with domain-specific context that’s both relevant and current.
The most-often used storage format for this is a vector database, which converts information into numeric vector embeddings and looks for matches along similar axes in n-dimensional vector space. However, in some use cases, vector search may miss relevant information.
A graph structure and graph traversal can compensate for this by instead breaking a set of sources into a series of connected nodes consisting of entities and the relationships between them. This can yield greater accuracy when using data sets such as technical documents, highly crossed-referenced legal and scientific documents, and strongly interlinked Web pages.
Graphs can contain structured, unstructured, and semi-structured texts. That makes them a versatile tool for searching a wide variety of content.
Knowledge graphs come in two forms:
- Entity graphs identify real-world entities in the source, rationalize them against existing entity definitions, and ascertain relationships between them. Developers traditionally search these types of graphs using a custom language like Cypher.
- Content-centric graphs break source documents into smaller chunks and represent explicit links between them as relationships.
Here we’ll focus on building content-centric graphs. Entity graphs are usually labor-intensive to build, requiring extensive manual definition and tweaking. By contrast, we can build content-centric graphs that improve GenAI queries automatically.
Graph RAG examples
You don’t need a lot of specialized knowledge to get started with graph RAG. There are two tools that simplify adding this technique to your GenAI data stack: LangChain and a vector database.
LangChain is a framework for quickly developing GenAI apps. It provides a series of services that eliminate much of the undifferentiated heavy lifting in standing up apps that incorporate LLMs.
For our purposes here today, its most helpful feature is the library of document transformers it provides. That includes transformers for converting documents into a hybrid graph/vector format.
A vector database can represent both vectors as well as simple graph formats. While you can store graphs in a dedicated graph database, that’ll require integrating a graph database into your data stack and mastering a new technology. Fortunately, vector storage of a graph database provides enough fidelity and accuracy for GenAI applications.
Let’s dive into some code and see how this works in practice.
Graph RAG example 1: Straight Python implementation
Before digging into a LangChain-based implementation, let’s look at how we might implement graph RAG if we were building everything ourselves. The example-graphrag GitHub project, based on the original findings on the technique published by Microsoft Research, provides a solid working example in Python.
The heart of the solution, app.py, creates an entity graph. It’s divided into several sections:
- Read and chunk the source documents
- Create entities
- Summarize the entities and their relationships in a structured format
- Create communities of graphs of connected entities
- Summarize each community
- Combine the summaries into a single global answer
The solution relies heavily upon LLMs. For example, in the second step, it uses ChatGPT to identify entities in text:
# 2. Text Chunks → Element Instances def extract_elements_from_chunks(chunks): elements = [] for index, chunk in enumerate(chunks): print(f"Chunk index {index} of {len(chunks)}:") response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Extract entities and relationships from the following text."}, {"role": "user", "content": chunk} ] ) print(response.choices[0].message.content) entities_and_relations = response.choices[0].message.content elements.append(entities_and_relations) return elements
Another thing you’ll notice about the solution is that it’s over 200 lines of code. Not a huge lift - but still more code that you need to own and maintain, as we’ll see shortly. This code is also just an example and would require even more tweaking and fine-tuning to build an accurate entity graph ontology. Finally, this code runs completely in memory - there’s no storage logic.
Graph RAG example 2: Some simple LangChain code
Now, let’s see how this might look using LangChain to generate a content-centric graph that we store in a vector database.
Let’s say we want to generate a graph from HTML documents. You can do this in LangChain and store the graph with about half as much code. This solution loads a list of URLs that describe movies and builds them into a graph database.
The core of the solution consists of around 10 lines of code in the main()
method of load_data.py:
# Load and process documents loader = AsyncHtmlLoader(urls) documents = loader.load() # Extract links as graph edges transformer = LinkExtractorTransformer([ HtmlLinkExtractor().as_document_extractor(), #KeybertLinkExtractor(), #... ]) documents = transformer.transform_documents(documents) # Extract page content / "clean" documents bs4_transfromer = BeautifulSoupTransformer() documents = bs4_transfromer.transform_documents(documents) # Split documents into chunks text_splitter = RecursiveCharacterTextSplitter( chunk_size=1024, chunk_overlap=64, ) documents = text_splitter.split_documents(documents) # Add documents and their metadata to the graph vector store store.add_documents(documents)
The solution takes advantage of a number of built-in LangChain components, such as the BeautifulSoupTransformer component to parse HTML documents and the CassandaGraphVectorStore component to store the results in Apache Cassandra®. (You can also use Astra DB, our highly scalable vector database, by using the AstraDBVectorStore class.)
Graph RAG example 3: Traversing documents
Once built, you can search the graph in one of two ways: a similarity (vector search) or a graph traversal. Running both at the start of your project can be useful to see how the results differ between the two search methods.
After opening a connection to either Cassandra or Astra DB, you can use the similarity_search()
method defined by LangChain to run a normal vector search:
docs = store.similarity_search( "What did the president say about Ketanji Brown Jackson?" )
By contrast, the traversal_search()
method will traverse the graph to find linked documents:
docs = list( store.traversal_search("What did the president say about Ketanji Brown Jackson?") )
GraphRAG example 4: Splitting a PDF
What if your documents are in PDF format? Again, LangChain makes this relatively easily. We can use the PyPDFLoader and RecursiveCharacterTextSplitter classes to read and chunk PDFs, and then use LangChain’s KeybertLinkExtractor class to traverse links:
from langchain_openai import OpenAIEmbeddings import cassio from langchain_community.graph_vectorstores.cassandra import ( CassandraGraphVectorStore, ) from langchain_community.graph_vectorstores.extractors import ( LinkExtractorTransformer, KeybertLinkExtractor, ) from langchain_community.document_loaders import PyPDFLoader from langchain_text_splitters import RecursiveCharacterTextSplitter # Initialize AstraDB / Cassandra connections. cassio.init(auto=True) # Create a GraphVectorStore, combining Vector nodes and Graph edges. knowledge_store = CassandraGraphVectorStore(OpenAIEmbeddings()) # Load and split content as normal. text_splitter = RecursiveCharacterTextSplitter( chunk_size=1024, chunk_overlap=64, length_function=len, is_separator_regex=False, ) loader = PyPDFLoader("Tourbook.pdf") pages = loader.load_and_split(text_splitter) # Create a document transformer extracting keywords. transformer = LinkExtractorTransformer([ KeybertLinkExtractor(extract_keywords_kwargs={ "stop_words": "english", }), ]) # Apply the document transformer. pages = transformer.transform_documents(pages) # Store the pages in the knowledge store. knowledge_store.add_documents(pages)
This code is highly performant because, instead of traversing keys by source, it traverses them by tag. This means we can delay traversing edges until query time.
In other words, using LangChain and Astra DB or Cassandra, you can get up and running with a highly performant PDF graph RAG solution with around 20 or so lines of code.
Conclusion
Graph RAG is an exciting new way to increase the accuracy of your GenAI apps. With LangChain and a vector database like Astra DB as part of your data stack, you don’t need any specialized knowledge or skills to incorporate it into your apps immediately.
New to LangChain? Check out our tutorial on building an agent with LangChain to see how this useful toolkit simplifies building AI applications.