Knowledge Graphs: Do You Need One?

You might have heard of knowledge graphs in connection with generative AI apps. Does your GenAI app needs one? And are they worth the additional overhead?

In this article, we’ll look at what knowledge graphs do, under what use cases you might need one, and how to add them to your existing GenAI apps with little additional architectural overhead.

What is a knowledge graph?

We can think of knowledge graphs by thinking about how they differ from relational and vector data structures:

In a relational data structure, data is represented as tables. Each table consists of a number of fields (columns) defining the data’s attributes, with each new entry represented as a row of data. Connections between records are signified using foreign keys that create relationships between tables.
In a vector data structure or vector embedding, data is converted to a numerical representation in n-dimensional space, with each dimension representing a property of the original data. Records are related to one another if they lie in the same direction in n-dimensional space.

By contrast, a knowledge graph defines criteria for identifying data items as entities, which it represents as nodes. The connections between these nodes, or edges, signify their relationships. For example, a customer might be represented as “Jack Smith,” a product as “2022 Toyota Highlander,” and the relationship between Jack and the car represented as “Owns.”

On top of nodes and edges, a knowledge graph has a set of organizing principles. This can be a schema that denotes the canonical relationships between entities or another method of breaking data into smaller components (e.g., breaking a document down into paragraphs).

Benefits of a knowledge graph

A knowledge graph has several key differences that provide added value when compared with relational or vector data. For example, knowledge graphs can extract multiple facts from a single source. This makes data retrieval more efficient, as it restricts the data we read to the relevant portion of the source.

Like vector data (and unlike relational data), a knowledge graph can represent structured, unstructured, and semi-structured data. Unlike vector representations, however, knowledge graphs can correlate the same facts and relationships from different sources with the same nodes and edges in the graph. This means we can bias data retrieval towards results that appear in multiple sources, improving the quality of our search results.

A knowledge graph can also retrieve knowledge several steps away from the original entities in question in a single query. In a vector search, you might require multiple rounds of queries to achieve the same result.

How to implement knowledge graphs

There are two primary implementations of knowledge graphs:

Triple store - Used primarily to represent graphs that conform to the Resource Description Framework (RDF). Knowledge graphs are represented as three distinct elements (two nodes and a relationship) stored in a three-item array.

Designed to model the semantic web, triple stores have great scalability but can become so complex that queries against them aren’t guaranteed to complete. It also doesn’t model many-to-many relationships well.

Property graph - A property graph stores graphs in a physical database that matches the graph model. For example: a set of nodes, with relationships represented as a named, direction connection between them. Property graphs also enable storing arbitrary metadata on both the nodes and connections. E.g., a Person entity could have properties indicating someone’s date of birth, address, etc.

In general—and especially for GenAI applications—property graphs are easier to use and faster to query compared to RDF. Property graphs can also be dynamic and evolve over time, whereas triple stores require defining your entire data structure (ontology) up front.

Use cases for a knowledge graph

The primary use case for a knowledge graph that concerns us today is its use in GenAI - particularly as a method for implementing retrieval-augmented generation (RAG).

Most GenAI apps make queries against a general-purpose LLM. In order to give the LLM up-to-date and relevant information related to a user’s query, app developers will query their own data sources - product catalogs, customer support chats, documentation, code bases, etc. - to find relevant context to provide to the LLM.

RAG is typically implemented using a vector database, with vector embeddings used to calculate similarities between units of data. A knowledge graph, however, can provide more relevant context for some LLM queries by establishing connections between sources. This enables it in some cases to retrieve facts that a vector search can’t.

Knowledge graphs are also used in other traditional AI and machine learning applications. A good example is search engines. Google’s knowledge graph, for example, is what enables it to answer questions such as “Where were the 2016 Summer Olympics held” that require making connections between real-world entities. Other examples of knowledge graph applications include real-time fraud detection and product recommendation engines.

Do you need a knowledge graph?

So do you need a knowledge graph? That question comes down to the nature of your source documents.

There are two types of knowledge graphs: entity-centric and content-centric. Whereas entity-centric knowledge graphs find and extract real-world entities from source documents, content-centric graphs break documents into chunks (paragraphs, sentences) and links them together based on keywords or explicit cross-references.

Entity-centric knowledge graphs are generally hard to build well. They require time, manual tooling, and the input of domain experts. By contrast, you can use content-centric knowledge graphs automatically, without domain expert input. Content-centric graphs generally provide more relevant results than entity-centric graphs to boot.

You’re likely to see better results with RAG using a knowledge graph traversal than a vector search if your original sources are things such as:

Collections of frequently crossed-referenced documents
Large docs with multiple sections, glossaries, and citations
Large websites, like Wikis, in which every paragraph has multiple HTML links to other documents in the collection

In other words, if your sources are legal documents, technical docs, research and academic publications, and highly-interconnected websites, then your use case will likely benefit from using knowledge graph RAG—so-called “Graph RAG”—versus using vector search.

How to get started with knowledge graph RAG

The most common way of implementing a knowledge graph is by storing your data in a purpose-built knowledge graph database. However, unless you’re already using knowledge graphs, this means installing, configuring, and maintaining yet another scalable and highly resilient component in your data stack.

Fortunately, you can also store knowledge graphs in a vector database. This doesn’t create as rich a structure as storing them in a knowledge graph. However, for RAG, we only need to find nodes that are two to three steps away from the starting point to find the most relevant results. This means we can implement GraphRAG without inflating our data stack.

That still leaves the issue of how to ingest and store documents as knowledge graphs. To make this part of the process simple, DataStax has contributed a class, GraphVectorStore, to the LangChain framework. By utilizing LangChain in your GenAI apps, you can launch quick experiments with GraphRAG and see how it performs relative to a vector search implementation for your use case.

Conclusion

For use cases involving highly interconnected documents, knowledge graphs can provide more relevant LLM results using RAG compared to vector search. By using a vector database to store knowledge graph structures, you can start experimenting with GraphRAG today without making any changes to your data stack.

Need a fast, fully managed, and highly scalable database to power your GenAI apps? DataStax Astra DB is a low-latency, serverless vector database for performing context-sensitive searches over petabytes of data. Try it today for free.