Reference Resolution API

Status: Generated from current Python docstrings and type hints.

Reference extraction, resolution, loading, and collision checks.

gaia.engine.lang.refs

Gaia reference extraction, resolution, and loading.

Public API

extract(text) -> ExtractionResult
resolve(key, label_table, references) -> RefKind
check_collisions(label_table, references) -> None
validate_groups(groups, markers, label_table, references) -> None
load_references(path) -> dict[str, dict]
RefKind, RefMarker, BracketGroup, ExtractionResult, ReferenceError

ReferenceError

ReferenceError(message: str, *, location: str | None = None)

Bases: Exception

Base error for reference handling.

Raised from extractor, resolver, loader for any structural or semantic failure. Compile turns these into hard errors.

Create a reference error with an optional location prefix.

Source code in gaia/engine/lang/refs/errors.py

def __init__(self, message: str, *, location: str | None = None) -> None:
    """Create a reference error with an optional location prefix."""
    self.location = location
    if location:
        super().__init__(f"{location}: {message}")
    else:
        super().__init__(message)

BracketGroup `dataclass`

BracketGroup(raw: str, start: int, end: int, marker_indices: tuple[int, ...])

A complete [...] Pandoc citation group containing one or more refs.

Attributes:

Name	Type	Description
`raw`	`str`	Full bracket group text including the brackets (for error messages).
`start`	`int`	Character offset in source text (position of `[`).
`end`	`int`	Character offset (exclusive; position after `]`).
`marker_indices`	`tuple[int, ...]`	Indices into `ExtractionResult.markers` of all markers that belong to this group.

ExtractionResult `dataclass`

ExtractionResult(markers: tuple[RefMarker, ...], groups: tuple[BracketGroup, ...])

Result of scanning a piece of text for reference markers.

Attributes:

Name	Type	Description
`markers`	`tuple[RefMarker, ...]`	All extracted markers, in source order. Includes both bracketed (strict) and bare (opportunistic) markers.
`groups`	`tuple[BracketGroup, ...]`	All bracket groups detected, in source order.

RefMarker `dataclass`

RefMarker(key: str, start: int, end: int, strict: bool, group_index: int | None = None)

A single @key reference marker extracted from text.

Attributes:

Name	Type	Description
`key`	`str`	The identifier after `@`.
`start`	`int`	Character offset in source text.
`end`	`int`	Character offset (exclusive).
`strict`	`bool`	True if inside a `[...]` group, False if bare.
`group_index`	`int \| None`	Index into the parent `ExtractionResult.groups` list if inside a bracket group, otherwise None.

extract

extract(text: str) -> ExtractionResult

Scan text for reference markers, returning bracket groups and bare markers.

Parameters:

Name	Type	Description	Default
`text`	`str`	The source text to scan for markers.	required

Returns:

Type	Description
`ExtractionResult`	ExtractionResult containing:
`ExtractionResult`	markers: All extracted markers in source order (both bare and strict)
`ExtractionResult`	groups: All bracket groups containing at least one marker

Group membership is tracked by marker identity during Pass 1, then converted to final list indices AFTER the markers list is sorted. This makes the group indices robust to the sort step even when bare markers appear before or between bracket groups.

Source code in gaia/engine/lang/refs/extractor.py

def extract(text: str) -> ExtractionResult:
    """Scan text for reference markers, returning bracket groups and bare markers.

    Args:
        text: The source text to scan for markers.

    Returns:
        ExtractionResult containing:
        - markers: All extracted markers in source order (both bare and strict)
        - groups: All bracket groups containing at least one marker

    Group membership is tracked by marker identity during Pass 1, then
    converted to final list indices AFTER the markers list is sorted. This
    makes the group indices robust to the sort step even when bare markers
    appear before or between bracket groups.
    """
    if not text:
        return ExtractionResult(markers=(), groups=())

    markers: list[RefMarker] = []
    # During Pass 1, groups are recorded with the actual RefMarker objects
    # that belong to them (not list indices), so we can rebuild indices
    # after sort. Each tuple: (raw_text, start, end, group_markers).
    group_records: list[tuple[str, int, int, list[RefMarker]]] = []
    # Character positions covered by bracket groups, so the bare scanner can
    # skip them.
    bracket_spans: list[tuple[int, int]] = []

    # Pass 1: bracket groups
    for group_match in _BRACKET_GROUP_RE.finditer(text):
        group_start = group_match.start()
        group_end = group_match.end()
        body = group_match.group(1)
        body_offset = group_match.start(1)

        group_markers: list[RefMarker] = []
        group_index = len(group_records)
        for key_match in _INNER_KEY_RE.finditer(body):
            marker = RefMarker(
                key=key_match.group(1),
                start=body_offset + key_match.start(1) - 1,  # include `@`
                end=body_offset + key_match.end(1),
                strict=True,
                group_index=group_index,
            )
            markers.append(marker)
            group_markers.append(marker)

        if not group_markers:
            continue

        group_records.append((text[group_start:group_end], group_start, group_end, group_markers))
        bracket_spans.append((group_start, group_end))

    def _inside_bracket(pos: int) -> bool:
        return any(start <= pos < end for start, end in bracket_spans)

    # Pass 2: bare markers (not inside bracket groups)
    for match in _BARE_AT_RE.finditer(text):
        if _inside_bracket(match.start()):
            continue
        markers.append(
            RefMarker(
                key=match.group(1),
                start=match.start(),
                end=match.end(),
                strict=False,
                group_index=None,
            )
        )

    # Sort markers by source position AFTER all Pass 1 and Pass 2 markers
    # are collected. Group membership still points at the original RefMarker
    # objects, so we rebuild marker_indices by identity lookup against the
    # sorted list.
    markers.sort(key=lambda m: m.start)
    new_index_of: dict[int, int] = {id(m): i for i, m in enumerate(markers)}

    groups = tuple(
        BracketGroup(
            raw=raw,
            start=start,
            end=end,
            marker_indices=tuple(new_index_of[id(m)] for m in group_markers),
        )
        for raw, start, end, group_markers in group_records
    )

    return ExtractionResult(markers=tuple(markers), groups=groups)

load_references

load_references(path: Path) -> dict[str, dict[str, Any]]

Load and validate a references.json file.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to references.json. If missing, returns an empty dict.	required

Returns:

Type	Description
`dict[str, dict[str, Any]]`	dict mapping citation key → CSL-JSON entry.

Raises:

Type	Description
`ReferenceError`	on invalid JSON, wrong top-level type, or any entry that fails schema validation.

Source code in gaia/engine/lang/refs/loader.py

def load_references(path: Path) -> dict[str, dict[str, Any]]:
    """Load and validate a references.json file.

    Args:
        path: Path to references.json. If missing, returns an empty dict.

    Returns:
        dict mapping citation key → CSL-JSON entry.

    Raises:
        ReferenceError: on invalid JSON, wrong top-level type, or any entry
            that fails schema validation.
    """
    if not path.exists():
        return {}

    try:
        raw = json.loads(path.read_text(encoding="utf-8"))
    except json.JSONDecodeError as e:
        raise ReferenceError(
            f"invalid JSON in references file: {e.msg} (line {e.lineno}, col {e.colno})",
            location=str(path),
        ) from e

    if not isinstance(raw, dict):
        raise ReferenceError(
            f"references.json must be a JSON object keyed by citation key, "
            f"got {type(raw).__name__}",
            location=str(path),
        )

    for key, entry in raw.items():
        _validate_entry(key, entry, location=str(path))

    return raw

check_collisions

check_collisions(label_table: dict[str, Any], references: dict[str, Any]) -> None

Fail fast if any key appears in both the label table and references.

Raises:

Type	Description
`ReferenceError`	listing all colliding keys.

Per spec §3.5, collision is a compile error — never a warning — to prevent silent semantic drift when a bibliography is imported.

Source code in gaia/engine/lang/refs/resolver.py

def check_collisions(
    label_table: dict[str, Any],
    references: dict[str, Any],
) -> None:
    """Fail fast if any key appears in both the label table and references.

    Raises:
        ReferenceError: listing all colliding keys.

    Per spec §3.5, collision is a compile error — never a warning — to
    prevent silent semantic drift when a bibliography is imported.
    """
    collisions = sorted(set(label_table) & set(references))
    if collisions:
        quoted = ", ".join(f"'{k}'" for k in collisions)
        raise ReferenceError(
            f"ambiguous reference key(s) {quoted}: "
            f"same identifier exists as both a knowledge label and a "
            f"citation key. rename one side to disambiguate."
        )

resolve

resolve(key: str, label_table: dict[str, Any], references: dict[str, Any]) -> RefKind

Resolve a single reference key.

Must only be called after check_collisions has passed, which guarantees a key is in at most one table.

Source code in gaia/engine/lang/refs/resolver.py

def resolve(
    key: str,
    label_table: dict[str, Any],
    references: dict[str, Any],
) -> RefKind:
    """Resolve a single reference key.

    Must only be called after `check_collisions` has passed, which guarantees
    a key is in at most one table.
    """
    if key in references:
        return "citation"
    if key in label_table:
        return "knowledge"
    return "unknown"

validate_groups

validate_groups(groups: Iterable[BracketGroup], markers: tuple[RefMarker, ...] | list[RefMarker], label_table: dict[str, Any], references: dict[str, Any]) -> None

Enforce the homogeneous-group rule (spec §3.2).

A [...] group must contain only knowledge refs or only citations. Mixing them in one group is a compile error because the rendering pipeline cannot faithfully process mixed Pandoc groups through citeproc-py.

Unknown keys are NOT flagged here — they have their own disposition (strict → error at marker level; opportunistic → literal). This function only fires on the specific knowledge+citation mix.

Raises:

Type	Description
`ReferenceError`	on the first mixed group encountered, listing which keys are knowledge and which are citations.

Source code in gaia/engine/lang/refs/resolver.py

def validate_groups(
    groups: Iterable[BracketGroup],
    markers: tuple[RefMarker, ...] | list[RefMarker],
    label_table: dict[str, Any],
    references: dict[str, Any],
) -> None:
    """Enforce the homogeneous-group rule (spec §3.2).

    A `[...]` group must contain only knowledge refs or only citations.
    Mixing them in one group is a compile error because the rendering
    pipeline cannot faithfully process mixed Pandoc groups through
    citeproc-py.

    Unknown keys are NOT flagged here — they have their own disposition
    (strict → error at marker level; opportunistic → literal). This function
    only fires on the specific knowledge+citation mix.

    Raises:
        ReferenceError: on the first mixed group encountered, listing which
            keys are knowledge and which are citations.
    """
    markers_seq = tuple(markers)
    for group in groups:
        knowledge_keys: list[str] = []
        citation_keys: list[str] = []
        for idx in group.marker_indices:
            marker = markers_seq[idx]
            kind = resolve(marker.key, label_table, references)
            if kind == "knowledge":
                knowledge_keys.append(marker.key)
            elif kind == "citation":
                citation_keys.append(marker.key)
        if knowledge_keys and citation_keys:
            raise ReferenceError(
                f"mixed-type reference group {group.raw!r}: contains both "
                f"knowledge refs ({', '.join(knowledge_keys)}) and "
                f"citations ({', '.join(citation_keys)}). "
                f"split into separate bracketed groups — one for knowledge "
                f"refs and one for citations."
            )

Reference Resolution API

gaia.engine.lang.refs

ReferenceError

BracketGroup dataclass

ExtractionResult dataclass

RefMarker dataclass

extract

load_references

check_collisions

resolve

validate_groups

BracketGroup `dataclass`

ExtractionResult `dataclass`

RefMarker `dataclass`