MolParser is a toolkit for working with E-SMILES (extended SMILES) in OCSR and Markush workflows. The notation follows the formulation introduced in the MolParser paper.
| Path | Role |
|---|---|
utils/ |
MolParser utils (substitute known abbreviations, and render E-SMILES to structure) |
skills/molparser-extended-smiles/ |
E-SMILES skills (concise rules and examples for LLM / OCSR agents) |
pip install -r requirements.txtRun the examples below from the repository root so that from utils import ... resolves correctly.
E-SMILES combines a base SMILES with an optional extension:
SMILES<sep>EXTENSION
Common extension records:
<a>0:R[1]</a>— atom-indexed substituent or Markush placeholder<r>0:R[1]</r>— ring-indexed substituent (regio-uncertain attachment)<c>9:B</c>— abstract-ring or superatom placeholder<a>0:<dum></a>— explicit dummy attachment point|Sg:n|— structural repeating unit (SRU) marker
Full specification: [skills/molparser-extended-smiles/extended-smiles-spec.md](skills/molparser-extended-smiles/extended-smiles-spec.md)
Normalize an E-SMILES string and substitute known abbreviations. In this example, CF3 is read from utils/abbrevs_example.csv, attached to the dummy atom *, and folded into ordinary SMILES. No unresolved Markush group remains, so markush is False.
from utils import postprocess_caption
raw = "*c1ccccc1<sep><a>0:CF3</a>"
result = postprocess_caption(raw)
# caption: original input string
# smi: normalized RDKit SMILES after substituting known abbreviations
# esmi: normalized E-SMILES after substitution and index repair
# cxsmiles: CXSMILES generated from the normalized E-SMILES
# markush: True if unresolved Markush labels remain
# sru: True if a structural repeating unit marker was detected
# groups: unresolved E-SMILES extension records kept after normalization
for key in ("caption", "smi", "esmi", "cxsmiles", "markush", "sru", "groups"):
print(f"{key}: {result[key]}")Expected output:
caption: *c1ccccc1<sep><a>0:CF3</a>
smi: FC(F)(F)c1ccccc1
esmi: FC(F)(F)c1ccccc1<sep>
cxsmiles: FC(F)(F)c1ccccc1
markush: False
sru: False
groups:
Render the E-SMILES as SVG and save it locally:
from pathlib import Path
from utils import draw
svg_text = draw("*C(O)c1cc(C(=O)N(*)*)cc(-c2*ccc*2)c1<sep><a>0:CF3</a><a>9:R[3]</a><a>10:R[2]</a><a>14:X</a><a>18:Y</a><r>1:R[1]?1-3</r>", output_format="svg")
svg_path = Path("molecule.svg")
svg_path.write_text(svg_text, encoding="utf-8")molecule.svg is a local render artifact.
To obtain a PNG from that SVG (requires cairosvg from requirements.txt):
import cairosvg
png_path = Path("molecule.png")
cairosvg.svg2png(url=str(svg_path), write_to=str(png_path))Load these files for the agent:
skills/molparser-extended-smiles/SKILL.mdskills/molparser-extended-smiles/extended-smiles-spec.mdskills/molparser-extended-smiles/figure-index.md
1. Base SMILES
2. E-SMILES in SMILES<sep>EXTENSION format
3. Markush status
4. Unsupported or ambiguous chemistry
python skills/molparser-extended-smiles/validate_esmiles.py "<your_esmiles>"Then normalize and render with postprocess_caption and draw.
- Uni-Parser — agent-oriented scientific document parsing with the latest MolParser. Demo
- MolParser — end-to-end molecular recognition. Demo
- MolDetv2 weights — lightweight molecule detector. Demo