TechnologyJanuary 30, 2025

How to Store a Knowledge Graph in a Database

How to Store a Knowledge Graph in a Database

Retrieval-augmented generation (RAG) has proven a reliable method for generating additional context for generative AI apps. However, in some circumstances, implemented RAG techniques using a knowledge graph can yield more accurate results than a vector graph. 

That raises the question of how to store and effectively query your knowledge graph. In this article, we’ll look at when to create a knowledge graph, how to build one from source documents, and the best way to store one while adding minimal additional lift to your GenAI app infrastructure.

When to use a knowledge graph

RAG is typically implemented using a vector database, where information is stored as a set of mathematical vector embeddings. Before sending a prompt to a large language model (LLM), a GenAI app queries a vector database to find additional, relevant information to add as context to the prompt.

This works better than not using RAG at all. It reduces LLM hallucinations and provides the LLM - whose knowledge isn’t domain-specific and may be several years out of date - with relevant and up-to-date data related to the prompt.

However, for use cases where the source information consists of heavily interlinked documents, you can often get more effective results by formatting the information into a knowledge graph. 

A knowledge graph represents information as a set of nodes and the relationships between those nodes. For example, in the statement “John Doe owns a 2022 Buick LeSabre,” “John Doe” and “Buick LeSabre” would be nodes, and “owns” would represent the relationship between John Doe and the car. 

When your source data consists of assets like technical documentation, research publications, or highly interconnected websites, a knowledge graph returns better results than a simple vector search. That’s because a knowledge graph search can traverse links between nodes, finding semantically relevant results two or more steps away from the first node.

In some cases, Graph RAG (as some call it) can provide answers when regular RAG can’t. One example cited by Microsoft shows a news query that requires connecting the dots between multiple source documents to provide an answer. Traditional RAG can’t answer these questions - but GraphRAG can by traversing the nodes that connect relevant documents. 

How to build a knowledge graph

There are three steps to building and storing a knowledge graph for RAG:  

  1. Extract the original data from source documents
  2. Represent the data in a standard graph format
  3. Store the data for search and retrieval

There are a couple of different models for building a graph. One standard, the Resource Description Framework (RDF), defines graphs consisting of a subject (first node), the object (the second node), and the predicate (relationship) between them. RDF also supports defining semantic namespace to develop distinct ontologies and avoid addressing conflicts. 

Another option is to represent graphs as a property graph. In a property graph, information is organized as nodes, relationships, and properties. Nodes and relationships can have metadata, or properties, attached to them, adding additional context for your queries. 

How to store a knowledge graph

Which model should you use? To answer that question, let’s look at how you might store each type of graph. There are three possible storage options: 

  • RDF triplestore
  • Graph database
  • Vector database

Let’s look at each one in turn. 

RDF triplestore 

Storing RDF-formatted graphs requires using an RDF triplestore, a database specifically for storing the standard’s highly atomic format. Triplestores can be implemented in their own purpose-built databases or on top of existing SQL or NoSQL solutions. Data is retrieved using a purpose-built query language, the most common being W3C’s SPARQL.

One benefit of an RDF triplestore is that it can handle complexity at scale. (It was, after all, designed to represent the entire World Wide Web.) Triplestores also tend to be less costly to implement and maintain than, for example, a relational database. They also perform well running federated queries across distributed data sources. 

Some downsides of the RDF triplestore are: 

  • It requires building an ontology, which takes considerable time and the manual assistance of experts
  • Ontologies must  be fully fleshed out and complete before you can ship anything, which prolongs development
  • RDF triplestore graphs can become so complex that queries against them aren’t guaranteed to complete

Graph database

By contrast, the default storage option for property graphs is a graph database. Graph databases are a type of NoSQL database that represents both nodes as well as relationships as first-class citizens, both of which can contain any number of properties. 

Graph databases may store graphs as tables, key-value pairs, or documents. neo4j and Amazon Neptune are two popular types of graph databases.

Graph databases have several advantages over RDF triplestores. Whereas RDF ontologies must be fully developed prior to deployment, graph databases are flexible. You can change the schema of a graph database by inserting new data without breaking anything. You can also build out graph databases for RAG automatically by leveraging tools in LLM frameworks like LangChain to convert source information into graph format. 

The downside of using a graph database is that it means adding yet another component to your architecture. It also requires your developers to master another database format and the accompanying tools. This adds additional lift to your project and can slow down your GraphRAG implementation. 

Vector database 

The good news is that, when it comes to RAG, you don’t need the full power of a graph database. Graph databases shine by enabling you to query a graph to an arbitrary depth. However, for RAG, all we usually need are the results immediately around a single node. 

This means that, instead of supporting a graph database just to implement Graph RAG, you can leverage your existing vector database, such as Astra DB, to store your knowledge graphs as vector embeddings. Components such as LangChain’s GraphVectorStore enable parsing and creating a hybrid vector/graph store you can then save to a vector database. 

The major advantage of this approach is you don’t have to spend time getting a graph database up and running or learning a new query language. You can also build and store property graphs easily using popular LLM frameworks such as LangChain and LlamaIndex.

What’s next

Knowledge graphs can greatly improve RAG for use cases involving heavily interlinked sources. By using your existing vector database and tools like LangChain to do the heavy lifting of parsing your sources, you can support GraphRAG without significantly changing your existing GenAI app stack.

This article covered the basics of GraphRAG storage. To learn more about creating knowledge graphs for RAG that return the most relevant results, read more on how to build content-centric knowledge graphs using LangChain and Astra DB.

Need a highly scalable vector database that requires little heavy lifting? Astra DB is a serverless, low-latency vector database built on top of Apache Cassandra® that enables you to move your GenAI apps quickly from experimentation to production. Try Astra DB and discover a faster way to build and deploy AI apps.

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.