GraphRAG Build Log

Taxonomy keeps a context graph from splitting the truth

A context graph only improves retrieval when every important thing has a stable identity. If the same service, concept, product, or team appears under two names, the graph starts storing partial truth in separate places.

Entity extraction finds mentions. Taxonomy and canonicalization decide what those mentions mean. That second step is where a GraphRAG system learns that CDAI and Content Discovery AI may refer to the same microservice or concept, while another similar acronym in a different tenant might not. Without that identity layer, retrieval becomes fragmented: one query follows the acronym node, another follows the expanded-name node, and neither sees the complete evidence trail.

This matters because graph retrieval depends on connected evidence. Ownership, incidents, documents, deployments, dependencies, customers, and decisions all attach to nodes. If the graph stores duplicate nodes for the same thing, each node gets a partial neighborhood. The answer may look grounded, but it is grounded in only half of the system's memory.

Duplicate nodes

CDAI owns the rollout note. Content Discovery AI owns the incident report. Retrieval sees two weak neighborhoods instead of one strong source of truth.

Canonical node

Content Discovery AI is the canonical service. CDAI is an alias with provenance, confidence, and tenant scope attached.

Taxonomy is the contract for graph identity

A taxonomy defines which kinds of things deserve graph nodes and what rules govern their identity. It should answer practical questions: is this mention a service, a project, a feature, or a broad concept? Can two names be merged automatically? Does an acronym mean the same thing everywhere? Which source is authoritative when names disagree?

The taxonomy does not need to be huge. It needs to be specific enough to protect retrieval behavior. For ContextGraph, useful starting types are often service, product, feature, team, document, incident, customer, concept, repository, and dependency. Each type can have different merge rules because identity behaves differently across them.

Detect candidate aliases

Collect acronyms, expanded names, casing variants, product codenames, renamed services, and repeated co-mentions in source text.

Compare context

Check owners, repositories, documents, incidents, upstream systems, downstream users, and tenant boundaries before proposing a merge.

Store the decision

Keep a canonical name, alias list, confidence, source evidence, reviewer history, and a way to split nodes when a merge turns out to be wrong.

The CDAI example

Imagine a set of engineering documents where one team writes "CDAI", another writes "Content Discovery AI", and an incident template uses "content-discovery-ai" as the repository slug. A plain extraction pass may produce three entities. A context graph should treat those as candidates for one canonical node, then verify the merge with surrounding evidence.

Good evidence might include the same owning team, links to the same repository, shared deployment names, repeated mentions in the same documents, matching incident IDs, and dependency edges pointing to the same upstream services. Weak evidence might be acronym similarity alone. Dangerous evidence might be a shared acronym used by two unrelated teams.

Alias resolution flow

flowchart LR
    Mention["Mention: CDAI"] --> Candidate["Candidate entity"]
    Candidate --> TypeCheck{"Type matches existing service?"}
    TypeCheck -- "No" --> Separate["Keep separate node"]
    TypeCheck -- "Yes" --> AliasCheck{"Alias or acronym match?"}
    AliasCheck -- "No" --> Review["Human or model-assisted review"]
    AliasCheck -- "Yes" --> Evidence["Compare owner, repo, runbook, incidents"]
    Evidence --> Decision{"Enough scoped evidence?"}
    Decision -- "Yes" --> Merge["Merge into Content Discovery AI"]
    Decision -- "No" --> Separate
    Merge --> Record["Store alias, provenance, confidence, scope"]

Sample merge record

The Search Platform team migrated recommendations traffic from CDAI to the content-discovery-ai deployment after the Content Discovery AI latency incident. The CDAI dashboard, content-discovery-ai repository, and Content Discovery AI runbook all list Search Platform as owner.

canonical: Content Discovery AI alias: CDAI alias: content-discovery-ai type: Service owner: Search Platform evidence: dashboard + repo + runbook

Why duplicate nodes damage retrieval

Duplicate nodes reduce recall because each name has only part of the evidence. A query about "CDAI incidents" may miss runbooks filed under "Content Discovery AI". A query about "Content Discovery AI owners" may miss dashboard notes that only use the acronym. The graph still returns connected evidence, but the connections are attached to the wrong split identities.

Duplicates also make explanations less trustworthy. GraphRAG systems often need to show why a document was retrieved. If a response says it followed the CDAI node but ignores the Content Discovery AI node, the explanation exposes the underlying identity problem. Users learn that the graph is organized by spelling, not meaning.

Why bad merges are just as harmful

The opposite failure is over-merging. If CDAI means one service in Search Platform and another concept in a customer-success workspace, merging both into one node will pollute retrieval. The graph may connect unrelated incidents, teams, and documents, causing the model to receive confident but wrong context.

That is why every merge needs scope. Alias rules should know where they apply: globally, inside a tenant, inside a repository, inside a product line, or only after a reviewer approves the candidate. Scope turns canonicalization from a brittle dictionary into an auditable identity system.

Strong merge signals

Same owner, same repository, same source-system ID, same deployment, same runbook, reciprocal links, repeated co-mentions, and matching dependency neighborhoods.

Merge safeguards

Tenant scope, entity type checks, confidence thresholds, human review for high-impact nodes, source provenance, merge history, and reversible splits.

A practical canonicalization loop

Start with deterministic rules for obvious aliases: casing, punctuation, repository slugs, known acronyms, and names from authoritative systems. Then add model-assisted candidate discovery for messier language. The model can propose that two names might refer to the same thing, but the graph layer should require evidence before the merge affects retrieval.

Store aliases as first-class facts. An alias should have provenance, not just a string in an array. The system should know where "CDAI" came from, when it was last observed, which canonical node it maps to, who approved it if approval was needed, and what confidence or rule created the mapping.

Finally, evaluate canonicalization with retrieval tasks, not only merge accuracy. The important question is whether users get better connected evidence. If alias handling increases recall without introducing unrelated context, the taxonomy is doing its job.