How to Build a Knowledge Graph for AI
In this guide, we’ll dive into how knowledge graphs can yield better results - and how to build one without adding additional infrastructure lift.
Retrieval-augmented generation (RAG) has proven effective at yielding better responses from large language models (LLMs) like OpenAI and Claude. However, recent experience shows you may not be getting the best responses by just using vector embeddings. It turns out a little structure can go a long way.
In this guide, we’ll dive into how knowledge graphs can yield better results - and how to build one without adding additional infrastructure lift.
Why build a knowledge graph for AI?
RAG is a technique in which your GenAI apps provide supplementary context to an LLM query. To implement RAG, you store relevant information - e.g., product manuals or support case logs for a product support chatbot - in a database. You then search for relevant documents based on the user’s question and include them in your prompt to the LLM.
The typical RAG implementation stores data as vector embeddings and uses similarity search to find material on the same trajectory in n-dimensional space as the query. The result is that RAG-enabled systems can return more accurate and fresh results at less cost than more expensive options, such as LLM fine-tuning.
Unfortunately, in some use cases, traditional RAG struggles to find the most relevant answers to a question. For large collections of documents with frequent cross-references, and large knowledge bases with intensive hyperlinking, RAG doesn’t go deep enough into the document set to produce accurate answers.
In these cases, even with RAG, an LLM may provide inaccurate information - or even hallucinate a completely fictitious answer.
This is where a knowledge graph can help. A knowledge graph represents real-world entities (objects, events, etc.) and the relationships between them. Objects are represented as nodes, and relationships as edges connecting the nodes. (Example: Person A is related to Person B via a “direct report” relationship in an org chart.)
Benefits of a knowledge graph
For deep knowledge sets - legal documents, technical manuals, research papers, and highly interconnected websites - knowledge graphs can supplement a vector database by connecting chunks of text that might not be close to one another in vector space. Knowledge graphs take advantage of the clear and direct relationships drawn by links and references to fill in the details of a response to a GenAI query.
Knowledge graphs also have several other benefits that can increase response accuracy. They can:
- Extract many facts from a single source document
- Treat the same nodes and edges as similar rather than distinct, providing additional confidence in chunks cited by multiple sources
- Traverse a graph to find items multiple steps away
Building a knowledge graph with a vector database
The typical way to implement a knowledge graph is by using a graph database for storage and retrieval. A graph database is a purpose-built data store that stores nodes, edges, and properties of those edges. It’s typically used to build recommendation engines, perform route optimization, or implement advanced fraud detection.
I can hear the collective groan from the audience: “Another database?!” If you don’t already run a graph database then, yeah - this means implementing another scalable storage service or adding a new serverless solution to your GenAI app stack.
The good news is that a graph database isn’t strictly necessary for graph RAG. Typically, graph RAG doesn’t require the same type of deep, multi-node traversal as the use cases above. Traversing the nodes around a single node will get us the information we need for more accurate queries.
Instead of a graph DB, you can store your knowledge graphs in a vector representation using a vector database like Astra DB. You can use Astra DB to support both RAG and knowledge graph queries, reducing the number of components you need to maintain in your infra. Using a serverless solution like Astra DB also means you don’t need to worry about hosting or scalability.
How to build a knowledge graph with a vector database
To get started with building a knowledge graph using a vector database, you’ll need to take three key steps:
- Convert knowledge to a graph format
- Extract entities from the question
- Retrieve the subgraph for the question to supply in your LLM query
Convert knowledge to a graph format
The first step is converting the documents and assets you have into a graph structure. A common format for this is the Resource Description Framework (RDF). Libraries such as RDFLib provide parsers and serializers for RDF storage and file formats but don’t provide much support in the way of extracting knowledge from unstructured documents.
The easiest way to parse a doc is to use LangChain to transform a document into a knowledge graph - i.e., use an LLM to parse and create structure from the unstructured text. There are two ways to build your knowledge graph using this approach:
- Entity-centric: The traditional knowledge graph approach that represents entities and the relationships between them; or
- Content-centric: A different approach where the nodes are chunks of text in the original document and edges are the relationships (hyperlinks, references) between them
As we’ve argued before, content-centric graphs are usually better for RAG. Entity knowledge graphs can take a long time to build and often require human experts to get right.
By contrast, content-centric graphs take advantage of an LLM’s greatest feature - processing large amounts of data automatically. They’re easier to build automatically and scale better in production.
You can easily build a hybrid content-centric graph and vector store using LangChain’s GraphVectorStore. The following code shows how to use GraphVectorStore to parse a new document:
import json from langchain_core.graph_vectorstores.links import METADATA_LINKS_KEY, Link def parse_document(line: str) -> Document: para = json.loads(line) id = para["id"] links = { Link.outgoing(kind="href", tag=id) for m in para["mentions"] if m["ref_ids"] is not None for id in m["ref_ids"] } links.add(Link.incoming(kind="href", tag=id)) return Document( id = id, page_content = " ".join(para["sentences"]), metadata = { "content_id": para["id"], METADATA_LINKS_KEY: list(links) }, )
(If you want to experiment with a graph-centric knowledge graph to contrast and compare the two approaches, you can use LangChain’s LLMGraphTransformer.)
Extract entities from the question
Next, we need to extract the entities from a user’s question so we know where to start traversing our graph. For this, again, we can leverage an LLM to extract the entities for us. The following LangChain Python code returns a list of entities as a LangChain Runnable that we can subsequently use in a query to Astra DB:
QUERY_ENTITY_EXTRACT_PROMPT = ( "A question is provided below. Given the question, extract up to 5 " "entity names and types from the text. Focus on extracting the key entities " "that we can use to best lookup answers to the question. Avoid stopwords.\n" "---------------------\n" "{question}\n" "---------------------\n" "{format_instructions}\n" ) def extract_entities(llm): prompt = ChatPromptTemplate.from_messages([keyword_extraction_prompt]) class SimpleNode(BaseModel): """Represents a node in a graph with associated properties.""" id: str = Field(description="Name or human-readable unique identifier.") type: str = optional_enum_field(node_types, description="The type or label of the node.") class SimpleNodeList(BaseModel): """Represents a list of simple nodes.""" nodes: List[SimpleNode] output_parser = JsonOutputParser(pydantic_object=SimpleNodeList) return ( RunnablePassthrough.assign( format_instructions=lambda _: output_parser.get_format_instructions(), ) | ChatPromptTemplate.from_messages([QUERY_ENTITY_EXTRACT_PROMPT]) | llm | output_parser | RunnableLambda( lambda node_list: [(n["id"], n["type"]) for n in node_list["nodes"]]) )
Retrieve the subgraph for the question and supply to LLM
Finally, you can retrieve the subgraph using the entity extracted by the LLM. There are a few different methods for accomplishing this with content-centric knowledge graphs. The most accurate is a maximum marginal relevance (MMR) query, which uses a combination of vector and graph traversal to retrieve a specific number of documents.
Since we stored our data as hybrid vector and graph data, we can use the ragstack-ai-knowledge-store library to accomplish this:
retriever = knowledge_store.as_retriever( search_type = "mmr_traversal", search_kwargs = { "k": 4, "fetch_k": 10, "depth": 2, }, )
By utilizing the connections between docs, this technique produces much more detailed responses than using vector embeddings alone - and all without adding an additional piece of infrastructure to your app stack.
Final thoughts on implementing knowledge graphs for AI
Converting knowledge-dense and highly-connected documents into a knowledge graph can yield superior RAG results than just using vector embeddings alone. Even better, leveraging a vector database like Astra DB for knowledge graph store and retrieval means you can improve LLM response quality with no additional architectural lift.
Want to see it in action for yourself? Sign up for a free DataStax account and walk through our full knowledge graph example in Google Colab.