VerbaGuard

Framework-independent PHP library for language-aware text moderation.

VerbaGuard normalizes text, matches dictionary terms with deterministic exact matching, and returns explainable results --- with zero runtime Composer dependencies.

For project principles and long-term boundaries, see FOUNDATION.md.

What is VerbaGuard?

VerbaGuard is a small PHP moderation engine that detects profanity and insults in user-generated text. It is designed for applications that need:

predictable, testable behavior
language-specific normalization (Turkish out of the box)
explainable matches with byte-accurate spans for masking
no framework lock-in

It is a reference implementation, not a hosted moderation service. You bring your own dictionaries and integrate via a simple PHP API.

Features

Zero runtime dependencies --- only PHP 8.2+ and ext-mbstring
Language profiles --- plug in dictionaries and normalizers per language
Turkish profile --- built-in curated Turkish dictionary (20 entries in v0.4) and Turkish character normalization
Obfuscation resistance --- leetspeak, repeated letters, and separator tricks (s.i.k.t.i.r)
Exact matching --- no substring false positives from embedded terms
Explainable output --- ProfanityMatch objects with term, category, severity, and byte span
Masking --- preserve non-matching UTF-8 text around detected spans

Installation

composer require verbaguard/verbaguard

Requirements

Requirement Notes

PHP ^8.2 Required ext-mbstring Required ext-intl Optional; enables Unicode NFC normalization

Quick Start

use VerbaGuard\VerbaGuard;

$guard = VerbaGuard::turkish();

$guard->contains('hello');           // false
$guard->score('amk');                // 25
$guard->mask('bu bir amk test');     // bu bir *** test
$result = $guard->analyze('amk');     // AnalysisResult

Inspect a result

$result = $guard->analyze('SİKTİR');

$result->hasProfanity();  // true
$result->score();          // 50
$result->severity();       // high
$result->matches();        // ProfanityMatch[]
$result->masked('*');      // ******

Obfuscation examples

$guard->contains('s.i.k.t.i.r');  // true  — separator spelled-chain
$guard->contains('4mk');           // true  — leetspeak token
$guard->contains('malzeme');       // false — no embedded substring match
$guard->contains('n o r m a l');   // false — full chain is "normal"

Public API

`VerbaGuard`

Method Description

turkish(): self Guard with Turkish profile

forLanguages(array $profiles): self Guard with custom profiles

analyze(string $text): AnalysisResult Full analysis

contains(string $text): bool Whether any match exists

mask(string $text, string $mask = '*'): string Mask matched spans

`score(string $text): int` Aggregate severity score

`AnalysisResult`

Method Description

hasProfanity(): bool Any matches found score(): int Sum of match severity weights severity(): string Highest match severity matches(): array List of ProfanityMatch masked(string $mask = '*'): string Masked original text

`ProfanityMatch`

Method Description

original(): string Matched substring normalized(): string Normalized lookup form term(): string Dictionary canonical term language(): string Profile code category(): string e.g. profanity, insult severity(): string low, medium, high start(): int Byte offset in original UTF-8 length(): int Byte length in original UTF-8

Supported Public API

Public (stable)

Integrate through these symbols only:

VerbaGuard\VerbaGuard
VerbaGuard\AnalysisResult
VerbaGuard\ProfanityMatch
VerbaGuard\Severity
VerbaGuard\Contracts\LanguageProfile
VerbaGuard\Dictionary\Dictionary
VerbaGuard\Dictionary\Entry
VerbaGuard\Language\TurkishProfile
VerbaGuard\Normalizer\Normalizer (extension interface)

Internal (unsupported)

Not part of the public contract. May change without notice:

VerbaGuard\Pipeline\Pipeline
VerbaGuard\Pipeline\Matcher
VerbaGuard\Pipeline\TextSegments
VerbaGuard\Pipeline\Scorer
VerbaGuard\Pipeline\NormalizationPipeline
Concrete normalizers (UnicodeNormalizer, TurkishNormalizer, LeetspeakNormalizer, etc.)

See docs/specification.md for full behavioral details.

TurkishProfile

TurkishProfile is the built-in language profile for Turkish text moderation.

use VerbaGuard\VerbaGuard;

$guard = VerbaGuard::turkish();

It provides:

Dictionary --- curated Turkish dictionary in data/tr.php (20 entries in v0.4; intended as a reference implementation, not a production-complete lexicon)
Normalization --- Turkish lowercase and ASCII folding (ş→s, ı→i, ğ→g, etc.)

Use VerbaGuard::forLanguages() with a custom profile when you need a production dictionary.

Language Profiles

A language profile bundles a dictionary and profile-specific normalizers.

v0.2 dictionary authoring: write only term, category, and severity in dictionary rows. Do not author normalized --- it is derived at build time via Dictionary::fromRows() and a normalizeKey callable that must match the profile's runtime normalization chain.

use VerbaGuard\Contracts\LanguageProfile;
use VerbaGuard\Dictionary\Dictionary;
use VerbaGuard\Normalizer\Normalizer;
use VerbaGuard\Pipeline\NormalizationPipeline;
use VerbaGuard\VerbaGuard;

final class ExampleProfile implements LanguageProfile
{
    public function code(): string
    {
        return 'ex';
    }

    public function dictionary(): Dictionary
    {
        $rows = [
            [
                'term' => 'badword',
                'category' => 'profanity',
                'severity' => 'medium',
            ],
        ];

        $normalization = new NormalizationPipeline($this->normalizers());

        return Dictionary::fromRows(
            $rows,
            static fn (string $term): string => $normalization->normalize($term),
        );
    }

    public function normalizers(): array
    {
        return [
            new class implements Normalizer {
                public function normalize(string $text): string
                {
                    return mb_strtolower($text, 'UTF-8');
                }
            },
        ];
    }
}

$guard = VerbaGuard::forLanguages([new ExampleProfile()]);

Multiple profiles can be passed; matches from all profiles are merged and deduplicated.

Normalization Pipeline

Global stages run in fixed order for every token and spelled chain:

Unicode NFC (when ext-intl available)
  → Language-specific normalizers (from profile)
  → Leetspeak map (4→a, @→a, 1→i, …)
  → Repeated letter collapse (aaa → a)

Normalization prepares text for exact dictionary lookup. It does not perform fuzzy or substring matching.

Matcher Overview

The matcher (v2.2, frozen) uses two deterministic paths:

1. Exact token matching

Letter/digit runs are tokenized, normalized, and looked up with exact equality. malzeme does not match mal.

2. Separator spelled-chain matching

Single-letter runs separated by punctuation or spaces form a chain. The full concatenated chain is normalized and matched exactly. s i k t i r matches siktir; n o r m a l does not match mal.

Policy: false positives are worse than false negatives. Matches use byte-accurate spans --- no approximate offset mapping.

Quality

As of v0.4.0, VerbaGuard includes:

20 curated Turkish dictionary entries
210 Turkish corpus cases
- 87 clean
- 55 profane
- 56 obfuscated
- 12 edge
0 false positives
0 false negatives
100% detection coverage across the curated corpus

The corpus and expansion workflow are documented in:

docs/dictionary-expansion-policy.md
docs/batch1-tr-lexicon-research.md

Philosophy

VerbaGuard prioritizes correctness and explainability over aggressive recall.

Small, readable codebase over clever abstractions
Deterministic behavior over heuristic tuning
Framework independence over ecosystem coupling
Curated dictionaries over megadictionary bundles
Open, inspectable matches over black-box scores

See FOUNDATION.md for the full principles document.

Performance Goals

O(n) passes over input text
No runtime Composer dependencies
Minimal allocations in the hot path
No reflection in matching
Suitable for per-request moderation in web applications

Benchmarks are not published yet; performance work follows correctness and API stability.

Known Limitations

Reference dictionary --- data/tr.php contains a small curated Turkish lexicon (20 entries in v0.4). It is intentionally conservative and should be expanded for production deployments.
Short terms --- very short entries like mal or aq match when they appear as standalone tokens in otherwise innocent sentences.
ext-intl optional --- without it, Unicode NFC normalization is skipped.
No NLP context --- homonyms, sarcasm, and intent are not evaluated.

Contributing

Contributions are welcome. Please read:

CONTRIBUTING.md --- workflow and commit standards
FOUNDATION.md --- project principles
docs/specification.md --- behavioral contract

Run tests before opening a pull request:

composer install
composer test

Offensive Language Notice

This repository contains a minimal seed dictionary with explicit profanity and insults for automated testing only. The words are intentionally offensive and are included solely to verify detection, scoring, and masking behavior. Do not use them outside test contexts.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.cursor		.cursor
.github		.github
data		data
docs		docs
src		src
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
FOUNDATION.md		FOUNDATION.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
composer.json		composer.json
phpunit.xml.dist		phpunit.xml.dist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VerbaGuard

What is VerbaGuard?

Features

Installation

Quick Start

Inspect a result

Obfuscation examples

Public API

`VerbaGuard`

`score(string $text): int` Aggregate severity score

`AnalysisResult`

`ProfanityMatch`

Supported Public API

Public (stable)

Internal (unsupported)

TurkishProfile

Language Profiles

Normalization Pipeline

Matcher Overview

1. Exact token matching

2. Separator spelled-chain matching

Quality

Philosophy

Performance Goals

Known Limitations

Contributing

Offensive Language Notice

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VerbaGuard

What is VerbaGuard?

Features

Installation

Quick Start

Inspect a result

Obfuscation examples

Public API

VerbaGuard

score(string $text): int Aggregate severity score

AnalysisResult

ProfanityMatch

Supported Public API

Public (stable)

Internal (unsupported)

TurkishProfile

Language Profiles

Normalization Pipeline

Matcher Overview

1. Exact token matching

2. Separator spelled-chain matching

Quality

Philosophy

Performance Goals

Known Limitations

Contributing

Offensive Language Notice

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`VerbaGuard`

`score(string $text): int` Aggregate severity score

`AnalysisResult`

`ProfanityMatch`

Packages