glossa-lab

Author: Tristen Pierson, BitConcepts Research
ORCID: 0009-0003-7269-956X

Agentic computational linguistics research platform for statistical analysis, decipherment, and hypothesis testing of ancient and unknown writing systems — with a primary focus on the Indus Script.

Decipherment Status (v4 preprint): 161 H+M candidate readings (75 HIGH + 86 MEDIUM) covering 90.96% of Holdat IVS tokens · 59% agreement with Parpola (1994) · Fish-sign isolation test: 0/140 isolated across all 9 sites and Gulf catalog · M267 reclassified as genitive particle · 3-slot positional grammar z=10.3 (0/2000 permutations) · Independent replication: Nair 2026 (arXiv:2604.17828)

Preprint (v4): Pierson, T.K. (2026). A Falsifiable Computational Decipherment Hypothesis for the Indus Valley Script: 161 Candidate Proto-Dravidian Anchors and a Three-Slot Positional Grammar. Zenodo. DOI: 10.5281/zenodo.20414696

Built and maintained by BitConcepts LLC

Overview

Glossa Lab is a production research tool combining a Python backend, React frontend, and Windows/Linux/macOS service support. It provides an end-to-end environment for:

Corpus management — upload, register, inspect, and sanitise sign-sequence corpora
Statistical analysis — entropy, Zipf, positional profiles (T/I/M), writing-system classification
Decipherment experiments — SA-based sign-to-phoneme hypothesis generation, benchmarks vs known scripts
Experiment Builder — composable graph experiments using atomic nodes (no coding required); new Evidence Graph category with 7 nodes for comparative literature analysis
Study Builder — multi-experiment research workflows as visual graphs
Glossa AI — embedded research assistant that runs analyses, proposes hypotheses, and navigates the tool
Discovery engine — continuous literature discovery across arXiv, EuropePMC, CrossRef, DOAJ and more
Evidence Graph — per-project literature library, automated paper sweep (configurable via sweep.yaml), claim extraction, cross-hypothesis falsification matrix, and hidden hypothesis generation
AI Provider Registry — unified management of cloud (OpenAI, Anthropic, Mistral, Google…), local (Ollama), and self-hosted (vLLM) AI backends with model scoring and smart assignment
Reports & Data — PDF, Markdown, JSON, CSV export of all results

System architecture

[ Tray ] ─────┐
              │
[ Frontend ] ─┼──→ [ Backend Service (FastAPI) ] ──→ [ Pipelines / Jobs / Models ]
              │              │
[ CLI / Dev ] ┘         [ SQLite DB ]
                              │
                    [ Provider Registry ] ──→ [ Cloud / Ollama / vLLM ]

Key principles

The backend is the source of truth
The tray and frontend are interfaces, not runtime owners
All communication occurs through explicit REST APIs
Service lifecycle is deterministic and observable — every background process logs START/COMPLETE

Components

Backend (Python / FastAPI)

REST API + background job engine
SQLite database (providers, model scores, discovery items, experiments, studies)
AI provider registry with test/probe on startup and on-demand
HuggingFace Open LLM Leaderboard sync (nightly) + static fallback scores
Discovery engine with 10+ fetchers (arXiv, EuropePMC, CrossRef, PubMed, DOAJ…)
RAG index for research context injection
Ollama auto-detection and lifecycle management

Frontend (React / TypeScript / Vite)

Built artefact (frontend/dist/) is committed to the repo so the server only needs git pull — no Node.js required on the deployment target.

Key panels:

Provider Registry — add/test/manage AI providers; badges: 🦙 Ollama · ☁️ Cloud · ⚡ vLLM/Custom · 🤗 HuggingFace
Model Assignments — assign primary/fallback models per bucket (Reasoning / Conversational / Long-form / Global) with draft/apply workflow, scores, filter, and swap
Experiment Builder — visual DAG editor with Evidence Graph palette category (7 nodes)
Study Builder — multi-experiment research workflows (accessible via Projects)
Discovery View — literature feed with → Evidence import action for Indus/Harappan items
Evidence Graph — three-tab workspace: Library (PDF upload, URL import), Claims (filterable), Sweep (configurable sweep + candidate import)
Foundation Check — research integrity dashboard (17 checks; must be PASS before external communication)
Bottom Panel — structured Logs (JSON → human-readable), Jobs, Terminal

Tray (Windows/macOS)

Local control surface. Start/stop/restart backend, open UI, quick status.

Indus Script Decipherment

161 H+M candidate readings (75 HIGH + 86 MEDIUM) covering 90.96% of the Holdat IVS corpus — a falsifiable computational decipherment hypothesis for the Indus Script (~2600–1900 BCE).

Metric	Value
H+M candidate readings	161 (75 HIGH + 86 MEDIUM)
Token coverage (H+M)	90.96% (6,363/7,002 Holdat tokens)
Seal coverage	69.8% (1,165/1,670 seals fully covered by H+M)
Parpola agreement	59% (44/75 HIGH readings in Parpola 1994)
Positional grammar	z=10.3; 0/2000 permutations exceeded observed
Fish-sign isolation	0/140 isolated (0/113 corpus + 0/27 Gulf)
External replication	Nair 2026 (arXiv:2604.17828) on ICIT corpus
Grammar accuracy	93.2% sign-level at 161 H+M (Phase-170)
Preprint DOI	10.5281/zenodo.20414696

Key files

backend/reports/
├── INDUS_FINAL_ANCHORS.json                        ← anchor table with all readings
glossa-corpus/indus/
├── pierson_2026_indus_decipherment.tex              ← preprint source (LaTeX)
└── pierson_2026_indus_decipherment_preprint_v4.pdf  ← preprint PDF (CC BY 4.0)
research/indus/
└── phase_reports/                                   ← all phase analysis reports

Repository structure

glossa-lab/
├─ LICENSE              ← MIT (source code)
├─ AGENTS.md            ← agent operating rules (read first, every session)
├─ LEDGER.md            ← session ledger (sole continuity authority)
├─ README.md
├─ CITATIONS.md         ← citation registry for all research data
├─ setup-os.cmd / setup-os.sh  ← start/stop/restart
├─ shell.cmd / shell.sh        ← tool wrapper (pytest, ruff, python)
├─ .github/
│  └─ workflows/ci.yml  ← GitHub Actions CI
├─ backend/             ← Python FastAPI application
│  ├─ glossa_lab/       ← app modules (api/, experiments/, discovery/, ...)
│  ├─ glossa_mcp/       ← MCP server (Warp/Oz agent integration, 27 tools)
│  ├─ scripts/          ← all research and utility scripts
│  └─ tests/
├─ frontend/            ← React / TypeScript / Vite
│  ├─ src/
│  └─ dist/             ← built artefact (committed for server deploy)
├─ tray/                ← system tray app
├─ services/            ← systemd / launchd / Windows service definitions
├─ docs/
│  ├─ images/           ← diagrams and sign images
│  ├─ governance/       ← governance docs
│  ├─ research/         ← decipherment research docs
│  ├─ USER_GUIDE.md
│  ├─ architecture.md
│  └─ REQUIREMENTS.md
├─ data/                ← canonical corpus and reference data
│  ├─ crosswalks/       ← sign crosswalk CSVs (M-number ↔ Parpola, ICIT/Fuls)
│  ├─ raw/              ← raw source corpora
│  ├─ normalized/       ← cleaned / extracted corpus files
│  └─ import/           ← staged import artifacts
├─ outputs/             ← generated computational artifacts
│  └─ analysis/         ← summary JSON analysis files
├─ reports/             ← human-readable research reports (PDF, Markdown)
├─ research/            ← public preprint outputs
│  └─ indus/            ← preprint PDF, anchor table, phase reports (CC BY 4.0)
├─ scripts/             ← project-wide utility scripts
├─ glossa-corpus/       ← internal corpus store
├─ glossa-indus/        ← Evidence Graph data store
│  ├─ config/sweep.yaml
│  ├─ literature/ · claims/ · hypotheses/ · raw/
│  └─ scripts/
└─ corpora/             ← external corpus downloads (gitignored, ~3 GB)

Quick start

Windows

# First-time install (registers autostart, installs deps)
setup-os.cmd install

# Start backend + tray
setup-os.cmd start

# Verify
curl.exe -sf http://localhost:8001/api/v1/health

Linux (systemd)

cd backend && python3 -m venv venv && venv/bin/pip install -e .
sudo systemctl start glossa-lab
curl -sf http://localhost:8001/api/v1/health

Open http://localhost:8001 in your browser.

Development workflow

All non-trivial work follows the proposal-first cycle in AGENTS.md. Frontend changes require a rebuild before they are visible:

cd frontend && npm run build
# Verify served bundle:
curl.exe -sf http://localhost:8001/ | Select-String 'index-[A-Za-z0-9]+\.js'

MCP server (Warp / Oz)

Glossa Lab ships a FastMCP server that exposes 27 backend operations as MCP tools, allowing Warp's Oz agent to query and control the system directly — no manual API calls required.

What it covers

Category	Tools
Status	`get_status`, `get_system_metrics`
Jobs	`list_jobs`, `get_job`, `create_job`, `cancel_job`, `get_job_results`
Experiments	`list_experiments`, `get_experiment`, `run_experiment`
Research loop	`start_research_loop`, `get_research_loop_status`, `stop_research_loop`, `get_research_loop_results`, `get_anchor_staging`
Foundation check	`run_foundation_check`
Discovery	`list_discovery_items`, `get_discovery_stats`, `trigger_discovery_fetch`, `update_discovery_item_status`
Dashboard	`get_latest_insight`, `get_dashboard_highlights`
Anchor sets	`list_anchor_sets`, `get_anchor_set`, `create_anchor_set`
Reports	`list_reports`, `get_report`

Setup

Start the backend (setup-os.cmd start or uvicorn glossa_lab.main:create_app --factory --port 8001).
In Warp, open Settings → Agents → MCP Servers and add a new server with:

{
  "glossa-lab": {
    "command": "C:/Users/trist/Development/BitConcepts/glossa-lab/backend/venv/Scripts/python.exe",
    "args": ["C:/Users/trist/Development/BitConcepts/glossa-lab/backend/glossa_mcp/server.py"]
  }
}

Adjust the path to match your install location. The server defaults to http://127.0.0.1:8001; override with the GLOSSA_BASE_URL environment variable if needed.

Source

backend/glossa_mcp/
├── __init__.py
└── server.py   ← FastMCP server (edit here to add tools)

Project discipline

This project follows strict research governance enforced by both convention and tooling:

Append-only ledger — Every session's work is recorded in LEDGER.md. No ledger entry = work not done.
Data provenance — Every data file must have a citation traceable to CITATIONS.md. No uncited data in the pipeline.
Graph-first experiments — All research phases are registered as navigable experiment graph nodes (see backend/glossa_lab/experiment_graph*.py). No ad-hoc scripts without graph registration.
Foundation checks — backend/scripts/foundation_check.py must pass before any external communication or publication. This guards against regressions in anchor data, grammar metrics, and sign accounting.
Public/private boundary — Private correspondence lives in .correspondence/ (gitignored). No third-party emails or private contact details in tracked files.
AI disclosure — All AI-assisted work is disclosed in publications and the ledger. Statistical tests are designed and interpreted by the author; AI tooling is used for scripting, data management, and literature search.

Full governance rules: docs/governance/

Documentation

File	Purpose
`AGENTS.md`	Agent operating rules — read first every session
`LEDGER.md`	Append-only session ledger — the sole continuity authority
`CITATIONS.md`	Research data citation registry
`docs/governance/`	Hard rules, session protocol, roles, verification
`docs/USER_GUIDE.md`	Full user guide (all panels)
`docs/architecture.md`	System architecture
`docs/REQUIREMENTS.md`	Formal requirements (R1–R16)
`docs/TESTS.md`	Test specification
`docs/research/`	Decipherment research documents
`research/indus/`	Public outputs — preprint PDF, anchor table, phase reports (CC BY 4.0)
`backend/glossa_mcp/server.py`	MCP server — 27 tools for Warp/Oz agent integration

Current research status (June 2026 — Preprint v4)

161 H+M candidate readings — 75 HIGH + 86 MEDIUM confidence (4 PROVISIONAL_MEDIUM flagged)
90.96% token coverage of the 7,002-token Holdat corpus; 69.8% of seals fully covered
59% Parpola agreement: 44/75 HIGH readings appear in Parpola (1994)
Fish-sign isolation test: 0/140 isolated across all 9 sites and Gulf deposit catalog
M267 reclassified: genitive particle (iN/in), not fish sign
Three-slot grammar (CLASSIFIER–TITLE–SUFFIX): z=10.3, 93.2% sign-level accuracy
External replication: Nair 2026 (arXiv:2604.17828) confirms non-random structure on ICIT corpus
4 provisional sibilant readings (M330, M165, M202, M198) added in Phase-163/166
Preprint v4 available at glossa-corpus/indus/pierson_2026_indus_decipherment_preprint_v4.pdf

Status

Preprint v4 published (Zenodo DOI: 10.5281/zenodo.20414696). Seeking peer review. Backend and frontend operational at http://localhost:8001.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

glossa-lab

Overview

System architecture

Key principles

Components

Backend (Python / FastAPI)

Frontend (React / TypeScript / Vite)

Tray (Windows/macOS)

Indus Script Decipherment

Key files

Repository structure

Quick start

Windows

Linux (systemd)

Development workflow

MCP server (Warp / Oz)

What it covers

Setup

Source

Project discipline

Documentation

Current research status (June 2026 — Preprint v4)

Status

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 994 Commits
.agents/skills		.agents/skills
.github		.github
backend		backend
data		data
docs		docs
external		external
frontend		frontend
glossa-corpus		glossa-corpus
glossa-indus		glossa-indus
outputs		outputs
reports		reports
research/indus		research/indus
scripts		scripts
services		services
tray		tray
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
ATTRIBUTION.md		ATTRIBUTION.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LEDGER.md		LEDGER.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json
project.yml.example		project.yml.example
scaffold.yml		scaffold.yml
setup-os.cmd		setup-os.cmd
setup-os.sh		setup-os.sh
shell.cmd		shell.cmd
shell.sh		shell.sh

Folders and files

Latest commit

History

Repository files navigation

glossa-lab

Overview

System architecture

Key principles

Components

Backend (Python / FastAPI)

Frontend (React / TypeScript / Vite)

Tray (Windows/macOS)

Indus Script Decipherment

Key files

Repository structure

Quick start

Windows

Linux (systemd)

Development workflow

MCP server (Warp / Oz)

What it covers

Setup

Source

Project discipline

Documentation

Current research status (June 2026 — Preprint v4)

Status

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages