GraphRAG Build Log

Entity extraction is the first hard requirement for GraphRAG

A graph cannot improve retrieval until the system knows what durable things exist in the text. Entity extraction turns raw passages into typed nodes that can be linked, normalized, permissioned, updated, and retrieved across sources.

Plain RAG can survive with chunks alone. GraphRAG cannot. The graph needs stable anchors: products, people, teams, projects, documents, systems, incidents, decisions, concepts, and events. Without extracted entities, the graph is just metadata around text. With entities, retrieval can move from "find similar passages" to "find the related evidence around this thing."

Entity extraction is mandatory because it creates the identity layer. It tells the pipeline that "Graph RAG", "GraphRAG", and "graph retrieval augmented generation" may refer to the same concept; that "Atlas" may be a product in one tenant and a project codename in another; and that a document mention should be connected back to source provenance instead of floating as an isolated chunk.

Recognize candidates

Detect mentions in source text and assign the best initial type.

Normalize identity

Map aliases, casing, abbreviations, and near-duplicates to canonical entities.

Link evidence

Keep spans, confidence, source IDs, timestamps, permissions, and extraction model versions.

Entity extraction to graph update

flowchart LR
    Raw["Raw document"] --> Chunk["Chunk and preserve spans"]
    Chunk --> Extract["Entity extraction"]
    Extract --> Type["Type assignment"]
    Type --> Canonical["Canonicalization and alias checks"]
    Canonical --> Evidence["Evidence links and permissions"]
    Evidence --> Relations["Relation extraction"]
    Relations --> Graph["Graph update"]
    Graph --> Eval["Retrieval evaluation"]

Why GLiNER-style extraction is a good fit

A model like GLiNER is useful because entity extraction in GraphRAG is rarely a fixed public NER problem. You usually do not only need person, organization, and location. You need domain-specific labels: feature, service, customer, incident, dependency, decision, repository, experiment, regulation, or whatever your knowledge graph needs to retrieve against.

The practical advantage is label flexibility. Instead of training a bespoke NER model for every taxonomy change, you can provide the entity types you care about and use the model to detect candidates across messy internal text. That makes it a strong first extraction layer before rule checks, canonicalization, confidence thresholds, and human review for sensitive merges.

The right pattern is not to blindly trust the model. Use it to produce typed candidates with spans and confidence, then validate them against dictionaries, existing graph nodes, tenant boundaries, and relation constraints. GraphRAG needs extraction that is useful and auditable, not just extraction that is impressive.

What good sample data looks like

Your sample set should look like the documents users will actually query. Do not evaluate entity extraction only on clean paragraphs. Include source types with different writing styles: documentation, tickets, Slack-style notes, meeting summaries, design docs, incident reports, changelogs, support cases, and query logs when you have them.

A good sample should contain repeated mentions of the same entity under different names, ambiguous names that should not be merged, nested concepts, acronyms, product codenames, dates, ownership references, and low-context fragments. You want to test whether the extractor can find useful candidates and whether the rest of the pipeline can reject bad ones.

Sample paragraph

During the May rollout, Atlas Search started routing enterprise queries through the GraphRAG evaluator after the EU Support team reported missing citations in tenant-restricted design docs. Mira Chen linked the issue to the context merger introduced in cg-doc-8432, while the Phoenix migration remained blocked by the legacy ACL sync job.

Atlas Search: Product GraphRAG evaluator: System EU Support: Team tenant-restricted design docs: DocumentSet Mira Chen: Person cg-doc-8432: Document Phoenix migration: Project legacy ACL sync job: System

How many entity types are a good base?

Start smaller than you think. A practical base is usually eight to twelve entity types. Fewer than that and the graph becomes too vague. More than that and the extractor, reviewers, and downstream retrieval logic often become inconsistent before you have enough examples to justify the complexity.

The first taxonomy should cover the entities that change retrieval behavior. If an entity type does not affect expansion, filtering, permissions, ranking, or explanation, it probably does not need to be a first-class graph node yet.

Good starting types

Person, organization, team, product, project, system, document, concept, event, incident, customer, and location when geography matters.

Domain-specific additions

Feature, repository, API, policy, metric, experiment, regulation, vendor, dataset, or permission group when those objects drive retrieval decisions.