Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Full release notes with details on each version: [GitHub Releases](https://githu

## Unreleased

- Feat: `graphify update`/`watch` now warns when an incremental merge collapses a god-node (#1652, thanks @Ns2384-star). An incremental `--update` REPLACES each re-extracted file's prior nodes/edges; if a re-extraction emits a *different* id for an entity that already exists as a well-connected hub, the old hub and all its edges are dropped (in-file edges go with the replaced contribution, cross-file edges are discarded by build since their endpoint id no longer has a node) — the hub silently collapses from many edges to ~0 while the node count may even rise, so the count-based shrink guard never fires (#1651). `build_merge` now snapshots pre-merge hub degrees (nodes with degree ≥ 20, counted with the same directedness as the merge and only across edges whose endpoints both exist) and prints a stderr WARNING — not a hard error, since a genuine large refactor can legitimately shed a hub's edges — listing each hub that vanished or lost more than half its degree. The alert is suppressed for a hub whose `source_file` was intentionally pruned this run (a requested deletion, not corruption), mirroring the shrink guard's prune exemption, so `--update` deletions don't cause alarm fatigue on the exact signal meant to catch a real id-drift collapse.
- Fix: a malformed semantic chunk no longer crashes `extract` and discards every successful chunk (#1631, thanks @ssazy). When an LLM returned a well-formed object whose `edges` (or `nodes`/`hyperedges`) array carried a stray non-dict entry — a nested list where an edge object belongs — the AST+semantic merge and the semantic-cache write both called `.get()` per entry and raised `AttributeError: 'list' object has no attribute 'get'`. On a 34-chunk run where 33 succeeded, that meant no `graph.json` was written and the cache write failed too, so a re-run re-extracted everything. `_parse_llm_json` now sanitizes each fragment at the single parse chokepoint (keeping only dict entries and coercing a non-list value to `[]`), so the cache writer, the adaptive-retry merge, and the CLI merge are all protected in one place.
- Fix: an unresolved bare npm import no longer aliases onto an unrelated same-named local file (#1638, thanks @EveX1). `import colors from "tailwindcss/colors"` in a `.tsx` file emitted an `imports_from` edge to the bare id `colors`, and build.py's pre-migration alias index (which registers every local file's bare stem) then remapped it onto an unrelated `backend/utils/colors.py` — a confident (`EXTRACTED`) cross-language phantom edge, and one per `.tsx` file sharing the import. In a real monorepo eight unrelated `.tsx` files all landed on a single Python module. Common package subpaths (`colors`, `utils`, `types`, `config`, `client`) collide this way constantly. The external-import fallback now namespaces its target with the `ref` prefix (the same J-4 convention used for tsconfig `extends`/`$ref` externals), so it can never collapse to a local file/symbol id; the ref-namespaced target has no node, so build drops it as an external reference — the correct outcome for a third-party import.
- Fix: `graph.json` node/edge ordering is now stable run-to-run for document/semantic corpora (#1632, thanks @umeshpsatwe). With a parallel LLM backend, `extract_corpus_parallel` merged chunk results in completion order, so which network call happened to return first reordered the nodes and edges even when the model returned identical content — churning `graph.json` between otherwise-identical runs. Chunks are now merged in deterministic submission order after the pool drains (matching the serial path); the progress callback still fires in completion order so long local runs aren't silent. Note: the semantic content the LLM extracts is itself nondeterministic run-to-run — this fix removes the pipeline's own ordering churn, not the model's variance.
Expand Down
137 changes: 137 additions & 0 deletions graphify/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -732,6 +732,111 @@ def deduplicate_by_label(nodes: list[dict], edges: list[dict]) -> tuple[list[dic
return deduped_nodes, deduped_edges


# Incremental --update (build_merge, below) REPLACES each re-extracted file's
# prior nodes/edges. If a re-extraction emits a DIFFERENT id for an entity that
# already exists as a well-connected hub, the old hub and its cross-file edges
# are dropped and a fresh node is created with only this file's edges — the hub
# silently collapses (e.g. 174 edges -> ~0) while total node count may even rise,
# so the count-based shrink guard never catches it (#1651/#1652b). We snapshot
# pre-merge hub degrees and WARN (not raise — a real refactor can legitimately
# shed edges) when a former hub vanishes or loses most of its degree.
HUB_DEGREE_MIN = 20
DEGREE_DROP_FRAC = 0.5
_HUB_DROP_REPORT_LIMIT = 10


def _hub_degrees(
nodes: list[dict], edges: list[dict], threshold: int = HUB_DEGREE_MIN,
*, directed: bool = False,
) -> dict[str, tuple[str, int]]:
"""Map id -> (label, degree) for every node whose degree >= threshold.

Mirrors G.degree() on the post-merge graph so the before/after comparison is
like-for-like. Two subtleties:

- Directedness matches the merge (`directed`): a DiGraph counts a bidirectional
pair as degree 2 where an undirected Graph collapses it to 1, so snapshotting
with the wrong type would skew the ratio for build_merge(directed=True).
- An edge is counted only when BOTH endpoints are in the node set. build_from_json
DROPS an edge whose endpoint id has no node rather than orphaning it onto an
auto-created bare node, so counting dangling stored edges here (as add_edge's
implicit node creation would) overstates the pre-merge degree.
"""
g: nx.Graph = nx.DiGraph() if directed else nx.Graph()
node_ids: set = set()
for n in nodes:
nid = n.get("id")
if nid is not None:
g.add_node(nid, label=n.get("label", nid))
node_ids.add(nid)
for e in edges:
s, t = e.get("source"), e.get("target")
if s in node_ids and t in node_ids:
g.add_edge(s, t)
return {
nid: (g.nodes[nid].get("label", nid), deg)
for nid, deg in g.degree()
if deg >= threshold
}


def _hub_degree_drops(
pre_hubs: dict[str, tuple[str, int]],
G: nx.Graph,
drop_frac: float = DEGREE_DROP_FRAC,
) -> list[tuple[str, int, int]]:
"""Pre-merge hubs that vanished or lost more than drop_frac of their degree.

Returns [(label, before, after)] sorted worst-drop first.
"""
post_deg = dict(G.degree())
drops: list[tuple[str, int, int]] = []
for nid, (label, before) in pre_hubs.items():
if before <= 0:
continue
after = post_deg.get(nid, 0)
if (before - after) / before > drop_frac:
drops.append((label, before, after))
drops.sort(key=lambda d: d[1] - d[2], reverse=True)
return drops


def _warn_hub_degree_drops(
pre_hubs: dict[str, tuple[str, int]],
G: nx.Graph,
exclude: set[str] | None = None,
) -> None:
"""Emit a stderr WARNING for any hub node that collapsed during a merge.

Hubs in `exclude` are skipped: their source_file was intentionally pruned this
run, so their collapse is an operator-requested deletion, not the #1651 id-drift
corruption the alert exists to catch. Warning on them would be alarm fatigue on
the exact signal that matters (mirrors the shrink guard's `not prune_sources`
exemption). A hub that collapsed WITHOUT its file being pruned still warns.
"""
if exclude:
pre_hubs = {nid: v for nid, v in pre_hubs.items() if nid not in exclude}
drops = _hub_degree_drops(pre_hubs, G)
if not drops:
return
print(
f"[graphify] WARNING: {len(drops)} hub node(s) lost >"
f"{int(DEGREE_DROP_FRAC * 100)}% of their edges after this merge — "
"possible silent data loss from a re-extraction changing an entity's id "
"(#1651).",
file=sys.stderr,
)
for label, before, after in drops[:_HUB_DROP_REPORT_LIMIT]:
suffix = " (node dropped entirely)" if after == 0 else ""
print(
f"[graphify] - '{label}': {before} -> {after} edges{suffix}",
file=sys.stderr,
)
extra = len(drops) - _HUB_DROP_REPORT_LIMIT
if extra > 0:
print(f"[graphify] ... and {extra} more hub node(s).", file=sys.stderr)


def build_merge(
new_chunks: list[dict],
graph_path: str | Path | None = None,
Expand Down Expand Up @@ -780,6 +885,23 @@ def build_merge(
existing_hyperedges = []
had_graph = False

# Snapshot the pre-merge graph's hub-node degrees so we can warn if this merge
# collapses one (#1651/#1652b). Captured from the graph AS LOADED, before the
# replace-per-source filter below drops the re-extracted files' contribution.
pre_merge_hubs = (
_hub_degrees(existing_nodes, existing_edges, directed=directed)
if had_graph else {}
)
# Remember each hub's source_file from the graph AS LOADED, so the degree-drop
# alert below can exempt a hub whose collapse is explained by an intentional
# prune — using the same source_file-in-prune_set rule the node prune uses.
pre_merge_hub_sources: dict[str, str | None] = {}
if pre_merge_hubs:
for _n in existing_nodes:
_nid = _n.get("id")
if _nid in pre_merge_hubs and _nid not in pre_merge_hub_sources:
pre_merge_hub_sources[_nid] = _n.get("source_file")

# Effective root for relativizing absolute source_file / prune paths back to the
# stored relative source_file keys. When the caller passes root we use it;
# otherwise fall back to the graph's recorded scan root, so absolute
Expand Down Expand Up @@ -906,6 +1028,21 @@ def _kept(item: dict) -> bool:
f"Pass prune_sources explicitly if you intend to remove nodes."
)

# Degree-drop alert (#1652b): warn when a former hub collapsed. Unlike the
# count-based shrink guard above, this runs on every path — including the
# normal dedup=True --update — and warns rather than raises, since a genuine
# large refactor can legitimately shed a hub's edges.
if pre_merge_hubs:
# Exempt hubs whose source_file was pruned this run: their collapse is a
# deletion the operator asked for (the normal way --update drops a removed
# file), not the #1651 id-drift corruption the alert exists to catch. A hub
# that collapsed WITHOUT its file being pruned still warns.
pruned_hub_ids = {
nid for nid, sf in pre_merge_hub_sources.items()
if sf in prune_set or _norm_source_file(sf, _eff_root) in prune_set
}
_warn_hub_degree_drops(pre_merge_hubs, G, exclude=pruned_hub_ids)

return G


Expand Down
Loading