Skip to content

feat: add Chinese variant conversion via OpenCC#52

Open
fzlins wants to merge 2 commits into
masterfrom
feature/metadata-language
Open

feat: add Chinese variant conversion via OpenCC#52
fzlins wants to merge 2 commits into
masterfrom
feature/metadata-language

Conversation

@fzlins
Copy link
Copy Markdown
Owner

@fzlins fzlins commented May 26, 2026

Summary

Add automatic Chinese variant conversion (Simplified ↔ Traditional) applied after extraction, before saving metadata.json and import.csv.

Changes

  • requirements.txt — Add opencc-python-reimplemented>=0.1.7 (pure Python, no C++ build required)
  • config.ini — Add chinese_convert option
  • tmdb-import/util/chinese_convert.py (new) — Conversion module with lazy-singleton OpenCC converter and punctuation tables
  • tmdb-import/extractor.py — Hook conversion into pipeline; extract_from_url now returns metadata
  • README.md / docs/README.zh-CN.md — Document new config option

Configuration

# Leave empty to disable. Options: zh-CN, zh-TW, zh-HK
chinese_convert =

How it works

  1. After any extractor returns a Metadata object, checks metadata.language
  2. If language starts with zh (covers zh-CN, zh-TW, zh-HK, zh-SG, etc.) and chinese_convert is set, runs conversion in-place
  3. Both metadata.json and import.csv are saved with the converted text

OpenCC config mapping:

Target OpenCC config Description
zh-CN t2s Traditional → Simplified
zh-TW s2twp Simplified → Taiwan Traditional (with phrase conversion)
zh-HK s2hk Simplified → HK Traditional

Punctuation conversion (handled separately since OpenCC doesn't convert punctuation by default):

Target Conversion
zh-CN 「」『』""''
zh-TW / zh-HK ""''「」『』

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant