GraphRAG Evaluation

GraphRAG evaluation: how do you know it works?

A GraphRAG system should not be judged only by whether the final answer sounds plausible. Evaluation needs to inspect retrieval quality, graph traversal, citations, permissions, entity coverage, and the path that connected the user's question to the evidence.

Plain RAG evaluation often starts with answer quality and source citation. Those still matter, but GraphRAG adds another moving part: the graph can change which evidence is even considered. If the graph expands through the wrong node, merges two different services, ignores a useful alias, or follows a forbidden edge, the generated answer may look polished while the retrieval path is broken.

The evaluation goal is not only "did the model answer correctly?" It is "did the system find the right entities, walk the right relationships, respect access boundaries, assemble grounded context, and explain why those facts were used?"

GraphRAG evaluation loop

flowchart LR
    Query["Evaluation query"] --> Expected["Expected answer and evidence"]
    Query --> Retrieve["Run GraphRAG retrieval"]
    Retrieve --> Trace["Capture trace: entities, aliases, paths, ACLs"]
    Trace --> Context["Selected context and citations"]
    Context --> Answer["Generated answer"]
    Expected --> Score["Score answer, evidence, path, permissions"]
    Answer --> Score
    Score --> Regression{"Regression found?"}
    Regression -- "Yes" --> Fix["Fix extraction, taxonomy, ranking, or policy"]
    Regression -- "No" --> Monitor["Monitor production traces"]
    Fix --> Retrieve

RAG evaluation

Measures whether retrieved chunks answer the question, whether citations are useful, and whether the final response stays grounded in the selected text.

GraphRAG evaluation

Measures all of that plus entity linking, alias resolution, traversal paths, relationship usefulness, permission filtering, and graph freshness.

Start with answer correctness, but do not stop there

Answer correctness is still the visible outcome. Users care whether the answer is accurate, complete, and useful. A good evaluation set should include questions with known answers, expected evidence, acceptable answer variants, and explicit failure cases where the system should say it does not know.

The risk is that answer scoring alone hides retrieval defects. A model can produce a correct answer from memorized knowledge, from a lucky chunk, or from a partial path that will fail on the next similar question. GraphRAG needs lower-level measurements that inspect how the answer was built.

Retrieval set quality

Did the system retrieve the documents, chunks, entities, and relationships that a human would expect for this question?

Graph path quality

Were the traversed nodes and edges relevant, authorized, fresh, and useful for expanding beyond semantic similarity?

Answer grounding

Does every important claim map back to allowed evidence, and do citations point to sources that actually support the answer?

Evaluate the retrieval trace

A ContextGraph answer should produce a trace: seed entities, aliases resolved, graph paths considered, permission decisions, candidate evidence, reranking scores, selected context, and final citations. That trace is the best place to see whether GraphRAG is helping or merely adding complexity.

Trace evaluation asks concrete questions. Did the query mention CDAI and resolve it to Content Discovery AI? Did the traversal include the owning team and the most recent incident? Did it avoid a stale design document? Did it exclude customer-restricted evidence for a user who cannot see it?

Useful metrics

A practical evaluation suite mixes automatic metrics, curated review, and regression tests. The exact scoring depends on the product, but the categories are stable enough to start from.

Evidence recall

The expected source chunks, documents, nodes, and edges appear in the candidate set before final context selection.

Evidence precision

Retrieved context is relevant enough that the model is not forced to sift through noisy neighbors or graph hubs.

Path relevance

Traversed relationships are meaningful for the question, not just connected in a mechanically valid way.

Citation support

Each important claim is backed by cited evidence that directly supports it, rather than a nearby document that only mentions similar terms.

Permission safety

Unauthorized nodes, edges, aliases, documents, and explanations cannot affect retrieval, ranking, final context, or user-visible traces.

Freshness

The selected evidence reflects the current graph state and does not prefer stale relationships over newer source material.

Build an evaluation set that exercises the graph

Do not evaluate GraphRAG only with simple lookup questions. If a question can be answered by one chunk found through vector search, it does not prove much about the graph. Include questions that require aliases, relationships, ownership, temporal order, permissions, and multi-hop evidence.

The strongest examples are usually drawn from real user behavior: incidents where support had to connect a service to an owner, product questions where a codename had to resolve to a launch name, compliance questions where access boundaries mattered, or architecture questions where dependencies changed over time.

Sample evaluation case

Question: "Why did CDAI stop returning recommendations for the EU tenant after the May rollout, and who owns the fix?"

alias: CDAI -> Content Discovery AI service: Content Discovery AI / expected incident: May rollout recommendation outage / expected tenant: EU tenant / permission checked owner: Search Platform / expected blocked: restricted customer notes / excluded

Test failure modes directly

A mature evaluation suite should contain questions designed to fail unsafe or sloppy systems. Ask about ambiguous acronyms that should not merge. Ask about public services connected by private edges. Ask about stale decisions that were superseded. Ask about over-connected platform nodes that attract irrelevant evidence. Ask about missing aliases where the answer should be incomplete until taxonomy is fixed.

These cases are valuable because they make hidden quality problems visible before users find them. They also give engineers a clear way to judge whether a change to extraction, canonicalization, graph storage, or ranking improved the system or only moved the problem around.

Run retrieval regression tests

Track which entities, paths, chunks, and citations appear for important queries before and after graph or model changes.

Review negative cases

Keep tests where the correct behavior is to refuse, ask for more scope, or avoid a tempting but unauthorized path.

Track drift over time

Watch for stale edges, changing aliases, new teams, renamed services, and source documents whose permissions or truth value changed.

Human review still matters

Automated scoring helps catch regressions quickly, but GraphRAG quality has a human part. Subject-matter reviewers can tell whether a path is meaningful, whether a relationship is too weak to justify expansion, whether a citation is convincing, and whether an answer is technically correct but operationally misleading.

The best review UI shows the answer beside the graph trace. Reviewers should be able to mark missing entities, bad aliases, irrelevant edges, weak citations, stale facts, and permission concerns without reading raw logs.

What to monitor in production

Production evaluation should continue after launch. Log retrieval traces, graph versions, selected evidence, permission filters, citation coverage, user feedback, and answer abstentions. When users downvote an answer, the system should preserve enough retrieval detail to reproduce the failure.

Monitoring should also surface graph-level health: duplicate canonical nodes, orphan entities, high-degree hubs, stale relationships, extraction confidence drops, and sources that are frequently retrieved but rarely cited.