Skip to content

Reference Resolution API

Status: Generated from current Python docstrings and type hints.

Reference extraction, resolution, loading, and collision checks.

gaia.engine.lang.refs

Gaia reference extraction, resolution, and loading.

Public API
  • extract(text) -> ExtractionResult
  • resolve(key, label_table, references) -> RefKind
  • check_collisions(label_table, references) -> None
  • validate_groups(groups, markers, label_table, references) -> None
  • load_references(path) -> dict[str, dict]
  • RefKind, RefMarker, BracketGroup, ExtractionResult, ReferenceError

ReferenceError

ReferenceError(message: str, *, location: str | None = None)

Bases: Exception

Base error for reference handling.

Raised from extractor, resolver, loader for any structural or semantic failure. Compile turns these into hard errors.

Create a reference error with an optional location prefix.

Source code in gaia/engine/lang/refs/errors.py
13
14
15
16
17
18
19
def __init__(self, message: str, *, location: str | None = None) -> None:
    """Create a reference error with an optional location prefix."""
    self.location = location
    if location:
        super().__init__(f"{location}: {message}")
    else:
        super().__init__(message)

BracketGroup dataclass

BracketGroup(raw: str, start: int, end: int, marker_indices: tuple[int, ...])

A complete [...] Pandoc citation group containing one or more refs.

Attributes:

Name Type Description
raw str

Full bracket group text including the brackets (for error messages).

start int

Character offset in source text (position of [).

end int

Character offset (exclusive; position after ]).

marker_indices tuple[int, ...]

Indices into ExtractionResult.markers of all markers that belong to this group.

ExtractionResult dataclass

ExtractionResult(markers: tuple[RefMarker, ...], groups: tuple[BracketGroup, ...])

Result of scanning a piece of text for reference markers.

Attributes:

Name Type Description
markers tuple[RefMarker, ...]

All extracted markers, in source order. Includes both bracketed (strict) and bare (opportunistic) markers.

groups tuple[BracketGroup, ...]

All bracket groups detected, in source order.

RefMarker dataclass

RefMarker(key: str, start: int, end: int, strict: bool, group_index: int | None = None)

A single @key reference marker extracted from text.

Attributes:

Name Type Description
key str

The identifier after @.

start int

Character offset in source text.

end int

Character offset (exclusive).

strict bool

True if inside a [...] group, False if bare.

group_index int | None

Index into the parent ExtractionResult.groups list if inside a bracket group, otherwise None.

extract

extract(text: str) -> ExtractionResult

Scan text for reference markers, returning bracket groups and bare markers.

Parameters:

Name Type Description Default
text str

The source text to scan for markers.

required

Returns:

Type Description
ExtractionResult

ExtractionResult containing:

ExtractionResult
  • markers: All extracted markers in source order (both bare and strict)
ExtractionResult
  • groups: All bracket groups containing at least one marker

Group membership is tracked by marker identity during Pass 1, then converted to final list indices AFTER the markers list is sorted. This makes the group indices robust to the sort step even when bare markers appear before or between bracket groups.

Source code in gaia/engine/lang/refs/extractor.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
def extract(text: str) -> ExtractionResult:
    """Scan text for reference markers, returning bracket groups and bare markers.

    Args:
        text: The source text to scan for markers.

    Returns:
        ExtractionResult containing:
        - markers: All extracted markers in source order (both bare and strict)
        - groups: All bracket groups containing at least one marker

    Group membership is tracked by marker identity during Pass 1, then
    converted to final list indices AFTER the markers list is sorted. This
    makes the group indices robust to the sort step even when bare markers
    appear before or between bracket groups.
    """
    if not text:
        return ExtractionResult(markers=(), groups=())

    markers: list[RefMarker] = []
    # During Pass 1, groups are recorded with the actual RefMarker objects
    # that belong to them (not list indices), so we can rebuild indices
    # after sort. Each tuple: (raw_text, start, end, group_markers).
    group_records: list[tuple[str, int, int, list[RefMarker]]] = []
    # Character positions covered by bracket groups, so the bare scanner can
    # skip them.
    bracket_spans: list[tuple[int, int]] = []

    # Pass 1: bracket groups
    for group_match in _BRACKET_GROUP_RE.finditer(text):
        group_start = group_match.start()
        group_end = group_match.end()
        body = group_match.group(1)
        body_offset = group_match.start(1)

        group_markers: list[RefMarker] = []
        group_index = len(group_records)
        for key_match in _INNER_KEY_RE.finditer(body):
            marker = RefMarker(
                key=key_match.group(1),
                start=body_offset + key_match.start(1) - 1,  # include `@`
                end=body_offset + key_match.end(1),
                strict=True,
                group_index=group_index,
            )
            markers.append(marker)
            group_markers.append(marker)

        if not group_markers:
            continue

        group_records.append((text[group_start:group_end], group_start, group_end, group_markers))
        bracket_spans.append((group_start, group_end))

    def _inside_bracket(pos: int) -> bool:
        return any(start <= pos < end for start, end in bracket_spans)

    # Pass 2: bare markers (not inside bracket groups)
    for match in _BARE_AT_RE.finditer(text):
        if _inside_bracket(match.start()):
            continue
        markers.append(
            RefMarker(
                key=match.group(1),
                start=match.start(),
                end=match.end(),
                strict=False,
                group_index=None,
            )
        )

    # Sort markers by source position AFTER all Pass 1 and Pass 2 markers
    # are collected. Group membership still points at the original RefMarker
    # objects, so we rebuild marker_indices by identity lookup against the
    # sorted list.
    markers.sort(key=lambda m: m.start)
    new_index_of: dict[int, int] = {id(m): i for i, m in enumerate(markers)}

    groups = tuple(
        BracketGroup(
            raw=raw,
            start=start,
            end=end,
            marker_indices=tuple(new_index_of[id(m)] for m in group_markers),
        )
        for raw, start, end, group_markers in group_records
    )

    return ExtractionResult(markers=tuple(markers), groups=groups)

load_references

load_references(path: Path) -> dict[str, dict[str, Any]]

Load and validate a references.json file.

Parameters:

Name Type Description Default
path Path

Path to references.json. If missing, returns an empty dict.

required

Returns:

Type Description
dict[str, dict[str, Any]]

dict mapping citation key → CSL-JSON entry.

Raises:

Type Description
ReferenceError

on invalid JSON, wrong top-level type, or any entry that fails schema validation.

Source code in gaia/engine/lang/refs/loader.py
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
def load_references(path: Path) -> dict[str, dict[str, Any]]:
    """Load and validate a references.json file.

    Args:
        path: Path to references.json. If missing, returns an empty dict.

    Returns:
        dict mapping citation key → CSL-JSON entry.

    Raises:
        ReferenceError: on invalid JSON, wrong top-level type, or any entry
            that fails schema validation.
    """
    if not path.exists():
        return {}

    try:
        raw = json.loads(path.read_text(encoding="utf-8"))
    except json.JSONDecodeError as e:
        raise ReferenceError(
            f"invalid JSON in references file: {e.msg} (line {e.lineno}, col {e.colno})",
            location=str(path),
        ) from e

    if not isinstance(raw, dict):
        raise ReferenceError(
            f"references.json must be a JSON object keyed by citation key, "
            f"got {type(raw).__name__}",
            location=str(path),
        )

    for key, entry in raw.items():
        _validate_entry(key, entry, location=str(path))

    return raw

check_collisions

check_collisions(label_table: dict[str, Any], references: dict[str, Any]) -> None

Fail fast if any key appears in both the label table and references.

Raises:

Type Description
ReferenceError

listing all colliding keys.

Per spec §3.5, collision is a compile error — never a warning — to prevent silent semantic drift when a bibliography is imported.

Source code in gaia/engine/lang/refs/resolver.py
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def check_collisions(
    label_table: dict[str, Any],
    references: dict[str, Any],
) -> None:
    """Fail fast if any key appears in both the label table and references.

    Raises:
        ReferenceError: listing all colliding keys.

    Per spec §3.5, collision is a compile error — never a warning — to
    prevent silent semantic drift when a bibliography is imported.
    """
    collisions = sorted(set(label_table) & set(references))
    if collisions:
        quoted = ", ".join(f"'{k}'" for k in collisions)
        raise ReferenceError(
            f"ambiguous reference key(s) {quoted}: "
            f"same identifier exists as both a knowledge label and a "
            f"citation key. rename one side to disambiguate."
        )

resolve

resolve(key: str, label_table: dict[str, Any], references: dict[str, Any]) -> RefKind

Resolve a single reference key.

Must only be called after check_collisions has passed, which guarantees a key is in at most one table.

Source code in gaia/engine/lang/refs/resolver.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def resolve(
    key: str,
    label_table: dict[str, Any],
    references: dict[str, Any],
) -> RefKind:
    """Resolve a single reference key.

    Must only be called after `check_collisions` has passed, which guarantees
    a key is in at most one table.
    """
    if key in references:
        return "citation"
    if key in label_table:
        return "knowledge"
    return "unknown"

validate_groups

validate_groups(groups: Iterable[BracketGroup], markers: tuple[RefMarker, ...] | list[RefMarker], label_table: dict[str, Any], references: dict[str, Any]) -> None

Enforce the homogeneous-group rule (spec §3.2).

A [...] group must contain only knowledge refs or only citations. Mixing them in one group is a compile error because the rendering pipeline cannot faithfully process mixed Pandoc groups through citeproc-py.

Unknown keys are NOT flagged here — they have their own disposition (strict → error at marker level; opportunistic → literal). This function only fires on the specific knowledge+citation mix.

Raises:

Type Description
ReferenceError

on the first mixed group encountered, listing which keys are knowledge and which are citations.

Source code in gaia/engine/lang/refs/resolver.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def validate_groups(
    groups: Iterable[BracketGroup],
    markers: tuple[RefMarker, ...] | list[RefMarker],
    label_table: dict[str, Any],
    references: dict[str, Any],
) -> None:
    """Enforce the homogeneous-group rule (spec §3.2).

    A `[...]` group must contain only knowledge refs or only citations.
    Mixing them in one group is a compile error because the rendering
    pipeline cannot faithfully process mixed Pandoc groups through
    citeproc-py.

    Unknown keys are NOT flagged here — they have their own disposition
    (strict → error at marker level; opportunistic → literal). This function
    only fires on the specific knowledge+citation mix.

    Raises:
        ReferenceError: on the first mixed group encountered, listing which
            keys are knowledge and which are citations.
    """
    markers_seq = tuple(markers)
    for group in groups:
        knowledge_keys: list[str] = []
        citation_keys: list[str] = []
        for idx in group.marker_indices:
            marker = markers_seq[idx]
            kind = resolve(marker.key, label_table, references)
            if kind == "knowledge":
                knowledge_keys.append(marker.key)
            elif kind == "citation":
                citation_keys.append(marker.key)
        if knowledge_keys and citation_keys:
            raise ReferenceError(
                f"mixed-type reference group {group.raw!r}: contains both "
                f"knowledge refs ({', '.join(knowledge_keys)}) and "
                f"citations ({', '.join(citation_keys)}). "
                f"split into separate bracketed groups — one for knowledge "
                f"refs and one for citations."
            )