Add MinerU artifact source tracking and hybrid wiki retrieval#1
Open
tulongshaonian771 wants to merge 42 commits into
Open
Add MinerU artifact source tracking and hybrid wiki retrieval#1tulongshaonian771 wants to merge 42 commits into
tulongshaonian771 wants to merge 42 commits into
Conversation
- add heta recall command with L0/L1/L2 multi-layer retrieval and LLM ranker - add clean_memory to wipe all memory tables while preserving schema - redesign L2 conflict detection: embedding similarity + LLM judge, no threshold - fix same-session conflict: exclude current session from conflict candidates - fix variable-precision time: store when_text + when_resolved + when_precision - enforce input-language consistency in all extraction prompts - fix object_type list coercion from LLM output - add tests: test_clean_memory (9 cases), test_memory_ingest (14 cases) - add seed_memories.sh for QA test data seeding and eval_qa.py for evaluation
Introduces `heta ask`: an outer agent that decides between two tools — recall_memory (fast multi-layer memory retrieval) and query_kb (deep wiki sub-agent) — and synthesises a final answer. Memory layers include raw turns (FTS5), episodes, atomic facts, and a new kb_insight layer that caches distilled knowledge points from KB pages. Adds `heta mem-clean` to wipe memory data while preserving schema.
Before storing each new insight, embed it, retrieve the top-5 most similar existing insights, and ask the LLM to judge whether the new insight is already fully covered. Insights below the 0.7 similarity threshold bypass the LLM call entirely. On failure the check defaults to storing, so no insight is silently dropped.
…e-back" This reverts commit 266caa8.
Centralize the color palette in branding.py as the single source of truth, add a coral-red error color, and theme the typer --help screen. Recolor the four off-palette commands (ask, recall, remember, mem-clean) and switch the rest to import the shared colors. - ask: boxed result card with title, Answer label, and in-box sources - recall: add --debug flag; non-debug shows the answer plus friendly sources while debug shows ranking/reason/scored evidence; fix KeyError on the kb_insight layer and show a friendly note when evidence is thin - smart_query: the outer agent now identifies as Little Heta
Drop internal jargon (outer agent, KB, vector database, "Schema is preserved", read-only) from the command summaries shown by `heta --help` so first-time users can tell what each command does.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves Little Heta's document ingestion and retrieval pipeline.
Changes include:
Validation
Tested the following flows:
heta initwith custom providerhttp://127.0.0.1:8000heta insertfor PDF, PPTX, XLSX, PNGheta insertfor MP4 with custom audio adapter configuredheta querywith Chinese + code-style queries50 passedNotes
Custom audio/video parsing works when
audio_api_key,audio_model, andaudio_base_urlare configured manually. Theinteractive
heta initflow does not yet prompt for custom audio adapter settings.