feat: use ASR segments for CJK NLP splitting#574
Open
sld272 wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
概要
新增 ASR segment 级文本作为 NLP 分句的优先输入,同时保留 word/char 级时间戳数据用于最终字幕时间轴对齐。并修复 spaCy 语言选择逻辑,为旧任务加入 fallback。
背景
对于日语/中文,WhisperX alignment 的
words可能是字符级。旧流程会把这些字符级行直接拼成一整条超长字符串送进 spaCy,带来两个问题:但最终时间戳对齐仍然需要字符级时间戳。因此本次修改把两个职责拆开:
修改
_2_ASR_SEGMENTS = "output/log/asr_segments.xlsx"。save_segments()保存 ASR segment 级文本。save_segments()。whisper.language和whisper.detected_language。whisper.language,就优先使用;只有auto才使用detected_language。split_by_mark.py:asr_segments.xlsx。cleaned_chunks.xlsx重建文本。验证
py_compile检查。split_by_nlp.txt。