Description
mammoth.convert_to_html (and any other entry point that exercises body_xml.read_fld_char) crashes with IndexError: pop from empty list when it encounters a <w:fldChar w:fldCharType="end"/> (or "separate") that has no matching prior "begin" element.
The root cause is in mammoth/docx/body_xml.py:
def read_fld_char(element):
fld_char_type = element.attributes.get("w:fldCharType")
if fld_char_type == "begin":
complex_field_stack.append(...)
...
elif fld_char_type == "end":
complex_field = complex_field_stack.pop() # <-- line 206
...
elif fld_char_type == "separate":
complex_field_separate = complex_field_stack.pop() # <-- line 214
...
Both .pop() calls assume the stack is non-empty, which is true for well-formed documents but not guaranteed for arbitrary input. A document whose first fldChar is end (or separate) — for example produced by a buggy DOCX generator, hand-edited, recovered from a partially corrupted file, or carved out of a larger document — leaks IndexError to the caller.
This is similar in shape to #158 ('w:ilvl' when parsing malformed docx numbering), which you accepted and fixed by hardening the malformed-input path.
Reproduction
Minimal standalone repro (no template needed — builds the .docx in memory):
import io, zipfile, mammoth
DOCUMENT_XML = b"""<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p>
<w:r><w:fldChar w:fldCharType="end"/></w:r>
<w:r><w:t>Hello</w:t></w:r>
</w:p>
</w:body>
</w:document>"""
CONTENT_TYPES_XML = b"""<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Default Extension="xml" ContentType="application/xml"/>
<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
</Types>"""
PACKAGE_RELS = b"""<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/>
</Relationships>"""
buf = io.BytesIO()
with zipfile.ZipFile(buf, "w", zipfile.ZIP_DEFLATED) as z:
z.writestr("[Content_Types].xml", CONTENT_TYPES_XML)
z.writestr("_rels/.rels", PACKAGE_RELS)
z.writestr("word/document.xml", DOCUMENT_XML)
buf.seek(0)
mammoth.convert_to_html(buf)
Traceback (HEAD as of 2026-05-24, commit on master):
File ".../mammoth/docx/body_xml.py", line 206, in read_fld_char
complex_field = complex_field_stack.pop()
IndexError: pop from empty list
Switching the fldCharType in the repro to "separate" hits the matching crash on line 214.
Suggested fix
Guard both pops:
elif fld_char_type == "end":
if not complex_field_stack:
return _empty_result
complex_field = complex_field_stack.pop()
...
elif fld_char_type == "separate":
if not complex_field_stack:
return _empty_result
complex_field_separate = complex_field_stack.pop()
...
Happy to send a PR if useful.
Context
Found by tailtest, an adversarial test generator I'm building. Filing on behalf of the run; the issue is reproduced and confirmed independently against current master.
Description
mammoth.convert_to_html(and any other entry point that exercisesbody_xml.read_fld_char) crashes withIndexError: pop from empty listwhen it encounters a<w:fldChar w:fldCharType="end"/>(or"separate") that has no matching prior"begin"element.The root cause is in
mammoth/docx/body_xml.py:Both
.pop()calls assume the stack is non-empty, which is true for well-formed documents but not guaranteed for arbitrary input. A document whose firstfldCharisend(orseparate) — for example produced by a buggy DOCX generator, hand-edited, recovered from a partially corrupted file, or carved out of a larger document — leaksIndexErrorto the caller.This is similar in shape to #158 (
'w:ilvl' when parsing malformed docx numbering), which you accepted and fixed by hardening the malformed-input path.Reproduction
Minimal standalone repro (no template needed — builds the .docx in memory):
Traceback (HEAD as of 2026-05-24, commit on
master):Switching the
fldCharTypein the repro to"separate"hits the matching crash on line 214.Suggested fix
Guard both pops:
Happy to send a PR if useful.
Context
Found by tailtest, an adversarial test generator I'm building. Filing on behalf of the run; the issue is reproduced and confirmed independently against current
master.