Skip to content

Add MinerU artifact source tracking and hybrid wiki retrieval#1

Open
tulongshaonian771 wants to merge 42 commits into
mainfrom
feature/mineru-image-source
Open

Add MinerU artifact source tracking and hybrid wiki retrieval#1
tulongshaonian771 wants to merge 42 commits into
mainfrom
feature/mineru-image-source

Conversation

@tulongshaonian771
Copy link
Copy Markdown
Collaborator

Summary

This PR improves Little Heta's document ingestion and retrieval pipeline.

Changes include:

  • Track MinerU parsed artifacts, including images and source annotations.
  • Preserve parsed image assets under the local KB raw workspace.
  • Improve PDF/Office parsing paths with local and cloud MinerU support.
  • Add custom provider profiles for chat, multimodal, embedding, and audio adapters.
  • Add hybrid wiki retrieval with vector + FTS5 keyword search.
  • Add FTS backfill logic for existing vector indexes.
  • Use English labels for insert result output.

Validation

Tested the following flows:

  • heta init with custom provider
  • local MinerU endpoint: http://127.0.0.1:8000
  • heta insert for PDF, PPTX, XLSX, PNG
  • heta insert for MP4 with custom audio adapter configured
  • heta query with Chinese + code-style queries
  • related unit tests: 50 passed

Notes

Custom audio/video parsing works when audio_api_key, audio_model, and audio_base_url are configured manually. The
interactive heta init flow does not yet prompt for custom audio adapter settings.

wangwenwu and others added 30 commits May 15, 2026 14:26
- add heta recall command with L0/L1/L2 multi-layer retrieval and LLM ranker
- add clean_memory to wipe all memory tables while preserving schema
- redesign L2 conflict detection: embedding similarity + LLM judge, no threshold
- fix same-session conflict: exclude current session from conflict candidates
- fix variable-precision time: store when_text + when_resolved + when_precision
- enforce input-language consistency in all extraction prompts
- fix object_type list coercion from LLM output
- add tests: test_clean_memory (9 cases), test_memory_ingest (14 cases)
- add seed_memories.sh for QA test data seeding and eval_qa.py for evaluation
Introduces `heta ask`: an outer agent that decides between two tools —
recall_memory (fast multi-layer memory retrieval) and query_kb (deep wiki
sub-agent) — and synthesises a final answer. Memory layers include raw
turns (FTS5), episodes, atomic facts, and a new kb_insight layer that
caches distilled knowledge points from KB pages. Adds `heta mem-clean`
to wipe memory data while preserving schema.
Before storing each new insight, embed it, retrieve the top-5 most
similar existing insights, and ask the LLM to judge whether the new
insight is already fully covered. Insights below the 0.7 similarity
threshold bypass the LLM call entirely. On failure the check defaults
to storing, so no insight is silently dropped.
Centralize the color palette in branding.py as the single source of
truth, add a coral-red error color, and theme the typer --help screen.
Recolor the four off-palette commands (ask, recall, remember, mem-clean)
and switch the rest to import the shared colors.

- ask: boxed result card with title, Answer label, and in-box sources
- recall: add --debug flag; non-debug shows the answer plus friendly
  sources while debug shows ranking/reason/scored evidence; fix KeyError
  on the kb_insight layer and show a friendly note when evidence is thin
- smart_query: the outer agent now identifies as Little Heta
Drop internal jargon (outer agent, KB, vector database, "Schema is
preserved", read-only) from the command summaries shown by `heta --help`
so first-time users can tell what each command does.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant