What is GraphRAG?

]

What is GraphRAG?

One of the most exciting applications of Generative AI and LLMs is Retrieval-Augmented Generation (RAG), which lets you interact with external documents like PDFs, text files, and YouTube videos.

GraphRAG crash course with codes is live now

This post covers the below topics:

What is RAG and KnowledgeGraph? Issues with baseline RAG How GraphRAG works? Advantages of GraphRAG over naive RAG

Recently, a new advancement to improve naive RAG is introduced called GraphRAG which uses Knowledge Graphs over Vector DBs for finding relevant information from external documents when a user inputs a query. This post talks about GraphRAG and its advantages over baseline RAG

My debut book: LangChain in your Pocket is out now !!

But before we jump on GraphRAG, you need to know two major concepts,

How RAG works?

RAG takes a user’s query and:

Searches a vector database (prepared using external documents) for relevant information using vector similarity. Selects the top relevant documents. Extracts useful content. Combines this content with an LLM to generate an answer.

As I’ve already explained RAG in quite some detail, I’m skipping it for now:

What is a Knowledge Graph?

A knowledge graph is a structured representation of information, capturing entities, their attributes, and relationships. It models complex data and highlights connections within a domain Some key components of a Knowledge Graph are:

Entities: The fundamental units of a knowledge graph, representing real-world objects, concepts, or things (e.g., “Albert Einstein,” “Theory of Relativity,” “University”). Attributes: Properties or characteristics of entities (e.g., “Albert Einstein” has an attribute “birthdate” with the value “March 14, 1879”). Relationships: Connections between entities that describe how they are related to each other (e.g., “Albert Einstein” is related to “Theory of Relativity” by the relationship “developed”). Nodes and Edges: In a graphical representation, entities are nodes, and relationships are edges connecting these nodes.

Consider a simple knowledge graph about scientific discoveries:

Entities: “Albert Einstein,” “Theory of Relativity,” “Speed of Light,” “Photoelectric Effect” Attributes: “Albert Einstein” (birthdate: “March 14, 1879”), “Theory of Relativity” (published: “1905”) Relationships:

“Albert Einstein” developed “Theory of Relativity”

“Theory of Relativity” “Theory of Relativity” relates to the “Speed of Light”

the “Speed of Light” “Albert Einstein” proposed the “Photoelectric Effect”

It might look something like this

Coming back to GraphRAG

The basic RAG implementation has a serious issue. Consider this example:

Suppose a company has a large collection of internal documents, including research papers, technical reports, emails, and meeting notes. The goal is to answer the question: “What are the recent advancements in our AI research department?”

Retrieval:

Searches the document collection for terms like “recent advancements” and “AI research department.” Retrieves the top documents based on vector similarity (e.g., documents containing similar phrases).

Response:

Lists several documents or passages that mention advancements in AI research. Struggles to connect insights across different documents, often presenting isolated pieces of information without synthesis.

The final output may retrieve these sentences:

Document 1: “Our team has recently developed a new AI model for natural language processing.”

“Our team has recently developed a new AI model for natural language processing.” Document 2: “In the past quarter, we made significant progress in AI-based image recognition.”

“In the past quarter, we made significant progress in AI-based image recognition.” Document 3: “AI Advancements are at a rapid pace”

As you must have noticed, this approach misses on :

Connecting the Dots : Might not link related advancements spread across the document but not directly mentioned. Also, it is very much driven by text similarity even if the sentence is a filler text (3rd output)

: Might not link related advancements spread across the document but not directly mentioned. Also, it is very much driven by text similarity even if the sentence is a filler text (3rd output) Holistic Understanding: Might miss the overarching trends or themes because it focuses on similar phrases rather than understanding the context.

Here comes GraphRAG

GraphRAG

Graph RAG, as mentioned earlier, uses KnowledgeGraphs instead of Vector DBs for information retrieval, hence the output is more wholesome and meaningful compared to baseline RAG.

How GraphRAG Works?

GraphRAG uses an LLM to automatically extract a rich knowledge graph from a collection of text documents. The knowledge graph captures the semantic structure of the data, detecting “communities” of densely connected nodes at different levels of granularity. These community summaries provide an overview of the dataset, allowing the system to answer global queries that would be difficult for naive RAG approaches. When answering a user’s question, GraphRAG retrieves the most relevant information from the knowledge graph and uses it to condition the LLM’s response, improving accuracy and reducing hallucinations.

Some major advantages of using GraphRAG over baseline RAG are:

Uses knowledge graphs to give more complete and varied responses compared to basic RAG. Generates responses that are better connected to the original data, and can show where the information comes from. Provides overviews of the dataset at different levels, so users can understand the overall context without needing specific questions. Can be more efficient than summarizing the full text, while still generating high-quality responses.

With this, I will wrap up this post. We will explore how GraphRAG can be implemented in my next post!

By that time, you can explore the repo by Microsoft for GraphRAG implementation here

]

Editor’s note, Apr. 2, 2024 – Figure 1 was updated to clarify the origin of each source.

Perhaps the greatest challenge – and opportunity – of LLMs is extending their powerful capabilities to solve problems beyond the data on which they have been trained, and to achieve comparable results with data the LLM has never seen. This opens new possibilities in data investigation, such as identifying themes and semantic concepts with context and grounding on datasets. In this post, we introduce GraphRAG, created by Microsoft Research, as a significant advance in enhancing the capability of LLMs.

Retrieval-Augmented Generation (RAG) is a technique to search for information based on a user query and provide the results as reference for an AI answer to be generated. This technique is an important part of most LLM-based tools and the majority of RAG approaches use vector similarity as the search technique. GraphRAG uses LLM-generated knowledge graphs to provide substantial improvements in question-and-answer performance when conducting document analysis of complex information. This builds upon our recent research, which points to the power of prompt augmentation when performing discovery on private datasets. Here, we define private dataset as data that the LLM is not trained on and has never seen before, such as an enterprise’s proprietary research, business documents, or communications. Baseline RAG[1] was created to help solve this problem, but we observe situations where baseline RAG performs very poorly. For example:

Baseline RAG struggles to connect the dots. This happens when answering a question requires traversing disparate pieces of information through their shared attributes in order to provide new synthesized insights.

Baseline RAG performs poorly when being asked to holistically understand summarized semantic concepts over large data collections or even singular large documents.

To address this, the tech community is working to develop methods that extend and enhance RAG (e.g., LlamaIndex (opens in new tab)). Microsoft Research’s new approach, GraphRAG, uses the LLM to create a knowledge graph based on the private dataset. This graph is then used alongside graph machine learning to perform prompt augmentation at query time. GraphRAG shows substantial improvement in answering the two classes of questions described above, demonstrating intelligence or mastery that outperforms other approaches previously applied to private datasets.

Applying RAG to private datasets

To demonstrate the effectiveness of GraphRAG, let’s start with an investigation using the Violent Incident Information from News Articles (VIINA) dataset (opens in new tab). This dataset was chosen due to its complexity and the presence of differing opinions and partial information. It is a messy real-world test case that was recent enough not to be included in the LLM base model’s training.

For this research, we use thousands of news articles from both Russian and Ukrainian news sources for the month of June 2023, translated into English, to create a private dataset on which we will perform our LLM-based retrieval. The dataset is far too large to fit into an LLM context window, thus demanding a RAG approach.

We start with an exploratory query, which we pose to both a baseline RAG system and to our new approach, GraphRAG:

Query: “What is Novorossiya?”

Baseline RAG GraphRAG The term “Novorossiya” is not mentioned in the provided context. However, to provide some information, Novorossiya, or New Russia, is a historical term referring to a region in the south of Ukraine that was conquered by the Russian Empire in the 18th century. In recent years, the term has been revived by Russian-backed separatists in Eastern Ukraine to refer to the territories they claim, but it is not officially recognized as a separate entity by international bodies. Novorossiya, also known as New Russia, is a term with historical and contemporary significance. Historically, it referred to a region in the southern part of the Russian Empire that is now part of Ukraine. In the contemporary context, it has been used by pro-Russian separatists to refer to certain areas of Ukraine [Data: Entities (6494, 912, 6493)].

The term Novorossiya has been used in the context of a political movement in Ukraine, particularly by Russian nationalists. This movement has been associated with efforts to destabilize Ukraine and create a so-called “Novorossiya” [Data: Entities (912); Relationships (15211, 15206)]. The term has also been used to denote a separatist state in Ukraine, known as the Federal Republic of Novorossiya (FRN) [Data: Entities (7690)].

In these results, we can see both systems perform well – highlighting a class of query on which baseline RAG performs well. Let’s try a query that requires connecting the dots:

Query: “What has Novorossiya done?”

Baseline RAG GraphRAG The text does not provide specific information on what Novorossiya has done. Novorossiya, a political movement in Ukraine, has been involved in a series of destructive activities, particularly targeting various entities in Ukraine [Entities (6494, 912)]. The movement has been linked to plans to destroy properties of several Ukrainian entities, including Rosen, the Odessa Canning Factory, the Odessa Regional Radio Television Transmission Center, and the National Television Company of Ukraine [Relationships (15207, 15208, 15209, 15210)]. In addition to these, Novorossiya has also been associated with plans to destroy properties of Roshen, the Odessa Conservatory Factory, the Odesa Regional Radio Television Transmission Center, and the National Telecommunications Company of Ukraine [Relationships (8329, 8330, 8331, 8332)]. Furthermore, Novorossiya has been implicated in plans to blow up ATMs of PrivatBank, the largest state bank in Ukraine [Relationships (8335, 8335)].

The Office of the General Prosecutor in Ukraine has reported on the creation of Novorossiya, indicating the government’s awareness and potential concern over the activities of this movement [Relationships (8264, 8264)]. The destructive activities of Novorossiya have had significant impacts on the targeted entities and have likely contributed to the ongoing tensions and conflicts in the region.

In summary, Novorossiya has been involved in a series of destructive activities targeting various entities in Ukraine. These activities have had significant impacts on the targeted entities and have likely contributed to the ongoing tensions and conflicts in the region. The Ukrainian government, through the Office of the General Prosecutor, has acknowledged the existence and activities of Novorossiya, indicating a level of concern over the movement’s actions.

Baseline RAG fails to answer this question. Looking at the source documents inserted into the context window (Figure 1), none of the text segments discuss Novorossiya, resulting in this failure.

Figure 1: Baseline RAG retrieved context [2]

In comparison, the GraphRAG approach discovered an entity in the query, Novorossiya. This allows the LLM to ground itself in the graph and results in a superior answer that contains provenance through links to the original supporting text. For example, Figure 2 below shows the exact content the LLM used for the LLM-generated statement, “Novorossiya has been implicated in plans to blow up ATMs.” We see the snippet from the raw source documents (after English translation) that the LLM used to support the assertion that a specific bank was a target for Novorossiya via the relationship that exists between the two entities in the graph.

Figure 2: GraphRAG provenance

By using the LLM-generated knowledge graph, GraphRAG vastly improves the “retrieval” portion of RAG, populating the context window with higher relevance content, resulting in better answers and capturing evidence provenance.

Being able to trust and verify LLM-generated results is always important. We care that the results are factually correct, coherent, and accurately represent content found in the source material. GraphRAG provides the provenance, or source grounding information, as it generates each response. It demonstrates that an answer is grounded in the dataset. Having the cited source for each assertion readily available also enables a human user to quickly and accurately audit the LLM’s output directly against the original source material.

However, this isn’t all that’s possible using GraphRAG.

Whole dataset reasoning

Baseline RAG struggles with queries that require aggregation of information across the dataset to compose an answer. Queries such as “What are the top 5 themes in the data?” perform terribly because baseline RAG relies on a vector search of semantically similar text content within the dataset. There is nothing in the query to direct it to the correct information.

However, with GraphRAG we can answer such questions, because the structure of the LLM-generated knowledge graph tells us about the structure (and thus themes) of the dataset as a whole. This allows the private dataset to be organized into meaningful semantic clusters that are pre-summarized. The LLM uses these clusters to summarize these themes when responding to a user query.

We illustrate whole-dataset reasoning abilities by posing the following question to the two systems:

Query: “What are the top 5 themes in the data?“

Baseline RAG GraphRAG 1. Improving the quality of life in cities and urban development.

The meeting between Vladimir Putin and the President of the United Arab Emirates, Mohammed bin Zayed Al Nahyan.
The National Rating of the Investment Climate in the Regions of Russia.
The state of the Russian economy and the growth of domestic brands.
The discussion on the meaning of life and the role of science in understanding it. The data presents a wide range of themes, but the top five most prevalent themes can be identified as follows:
Conflict and Military Activity: A significant portion of the data revolves around the ongoing conflict in Ukraine, with numerous reports detailing military activities, occupations, and attacks by various entities. This theme includes the actions of the Ukrainian military, Russian forces, and various other military units and entities [Data: Reports (513, 241, 151, 493, 343, +more)].
Political and Governmental Entities: Many reports focus on political figures and governmental entities, highlighting their roles, relationships, and actions in the context of the current situation. This includes entities like the Kremlin, the President of Ukraine, the US State Department, and various regional administrations [Data: Reports (98, 493, 265, 129, 196, +more)].
Infrastructure and Environmental Concerns: Several reports highlight potential environmental impacts and infrastructure threats. These include potential environmental damage due to industrial activities, flooding crises, and threats to infrastructure due to military conflicts [Data: Reports (308, 154, 201, 348, 216, +more)].
Community Analysis and Threat Assessment: Several reports provide detailed analyses of specific communities, often centered around a particular location or entity. These analyses typically discuss the key entities within the community, their relationships, and any associated threats. Many reports also include assessments of potential threats associated with the entities or situations they discuss [Data: Reports (323, 596, 518, 583, 179, +more)].
Health, Humanitarian Concerns, and Emergency Situations: Some reports focus on health and humanitarian issues, such as the Shereshevsky-Turner Syndrome community and the potential risk of intestinal diseases in the Odessa region. The data also covers emergency situations, including natural disasters, accidents, and attacks [Data: Reports (204, 149, 372, 154, 532, +more)].

回到上一頁