Add MinerU artifact source tracking and hybrid wiki retrieval by tulongshaonian771 · Pull Request #1 · KnowledgeXLab/Little_Heta

tulongshaonian771 · 2026-06-02T06:02:33Z

Summary

This PR improves Little Heta's document ingestion and retrieval pipeline.

Changes include:

Track MinerU parsed artifacts, including images and source annotations.
Preserve parsed image assets under the local KB raw workspace.
Improve PDF/Office parsing paths with local and cloud MinerU support.
Add custom provider profiles for chat, multimodal, embedding, and audio adapters.
Add hybrid wiki retrieval with vector + FTS5 keyword search.
Add FTS backfill logic for existing vector indexes.
Use English labels for insert result output.

Validation

Tested the following flows:

heta init with custom provider
local MinerU endpoint: http://127.0.0.1:8000
heta insert for PDF, PPTX, XLSX, PNG
heta insert for MP4 with custom audio adapter configured
heta query with Chinese + code-style queries
related unit tests: 50 passed

Notes

Custom audio/video parsing works when audio_api_key, audio_model, and audio_base_url are configured manually. The
interactive heta init flow does not yet prompt for custom audio adapter settings.

- add heta recall command with L0/L1/L2 multi-layer retrieval and LLM ranker - add clean_memory to wipe all memory tables while preserving schema - redesign L2 conflict detection: embedding similarity + LLM judge, no threshold - fix same-session conflict: exclude current session from conflict candidates - fix variable-precision time: store when_text + when_resolved + when_precision - enforce input-language consistency in all extraction prompts - fix object_type list coercion from LLM output - add tests: test_clean_memory (9 cases), test_memory_ingest (14 cases) - add seed_memories.sh for QA test data seeding and eval_qa.py for evaluation

Introduces `heta ask`: an outer agent that decides between two tools — recall_memory (fast multi-layer memory retrieval) and query_kb (deep wiki sub-agent) — and synthesises a final answer. Memory layers include raw turns (FTS5), episodes, atomic facts, and a new kb_insight layer that caches distilled knowledge points from KB pages. Adds `heta mem-clean` to wipe memory data while preserving schema.

Before storing each new insight, embed it, retrieve the top-5 most similar existing insights, and ask the LLM to judge whether the new insight is already fully covered. Insights below the 0.7 similarity threshold bypass the LLM call entirely. On failure the check defaults to storing, so no insight is silently dropped.

…e-back" This reverts commit 266caa8.

Centralize the color palette in branding.py as the single source of truth, add a coral-red error color, and theme the typer --help screen. Recolor the four off-palette commands (ask, recall, remember, mem-clean) and switch the rest to import the shared colors. - ask: boxed result card with title, Answer label, and in-box sources - recall: add --debug flag; non-debug shows the answer plus friendly sources while debug shows ranking/reason/scored evidence; fix KeyError on the kb_insight layer and show a friendly note when evidence is thin - smart_query: the outer agent now identifies as Little Heta

Drop internal jargon (outer agent, KB, vector database, "Schema is preserved", read-only) from the command summaries shown by `heta --help` so first-time users can tell what each command does.

wangwenwu and others added 30 commits May 15, 2026 14:26

feat: add memory ingest pipeline and remember command

9afa246

fix: harden pdf insert pipeline

c9d61b2

feat: process insert sources sequentially

bc9f5cf

fix: deduplicate wiki vector sync changes

455b104

feat: add insert merge progress bar

acc27d5

fix: start insert progress at one percent

0bc9376

feat: add insert planning toggle

e1ca69f

feat: support image insert captions

1bacb30

feat: support audio insert transcripts

7908935

fix: tighten query source attribution

0e93963

fix: validate query used sources

55ed99a

Revert "feat: deduplicate kb insights via semantic search before writ…

2afb9b1

…e-back" This reverts commit 266caa8.

feat: merge feature/kb-insight-dedup into main

e5068b3

feat: support code file insert

13bb2d9

fix: preserve code raw source links

8904673

fix: preserve code symbol signatures

fe81e0d

fix: retry invalid query json answers

61bda57

feat: merge feature/invalidate-kb-memory into main

b848a32

feat: merge feature/mem-show-insights into main

dc26256

feat: support html insert parsing

06860bf

fix: clean html wiki page extraction

e3ab278

fix: improve html article extraction

2e4d463

fix: improve html documentation extraction

0cff8da

feat: merge feature/insight-distill-then-answer into main

56e3b11

feat: support mineru office documents

cf1682e

docs: reword CLI command help in plain language

89cdc87

Drop internal jargon (outer agent, KB, vector database, "Schema is preserved", read-only) from the command summaries shown by `heta --help` so first-time users can tell what each command does.

tulongshaonian771 and others added 12 commits May 18, 2026 15:03

chore: prepare pypi release

da20fc5

docs: update workspace paths and mineru links

aef9d3b

docs: invite stars and issues

b7fe002

docs: clarify wiki foundation wording

ff3e16d

docs: update memory speed benchmark

5f9320c

docs: use static python version badge

7d41934

docs: clarify KB/memory separation and four memory types in README

fab2440

docs: use static pypi badge

0acf38c

chore: remove uv lockfile

cde57c8

feat: track mineru image sources

6a40580

Add hybrid wiki retrieval and provider clients

6b21956

chore: use english insert result labels

8a398f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MinerU artifact source tracking and hybrid wiki retrieval#1

Add MinerU artifact source tracking and hybrid wiki retrieval#1
tulongshaonian771 wants to merge 42 commits into
mainfrom
feature/mineru-image-source

tulongshaonian771 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tulongshaonian771 commented Jun 2, 2026

Summary

Validation

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant