The Entity Resolver is the core component of the Basic ERE (Entity Resolution Engine) service. It is responsible for identifying and linking entities across different data sources, ensuring consistency and accuracy in entity identification and consolidation.
This page provides configuration details and guidelines for the Entity Resolver component.
Configuration files control the entity resolution algorithm behavior, including similarity matching, blocking rules, thresholds, and statistical priors. This directory contains two primary configuration files.
The main configuration file for the entity resolver. Defines:
- Entity fields to extract from RDF (legal name, address components)
- Database settings (DuckDB type and location)
- Clustering thresholds (when a mention joins an existing cluster)
- Output limits (max candidate clusters per resolution)
- Blocking rules (which mention pairs to compare)
- Statistical models (Splink comparisons, cold-start priors, EM training)
Maps RDF entity types to extraction rules. Defines:
- Supported entity types (e.g., ORGANISATION)
- RDF type discriminator (which RDF class identifies an organization)
- Field property paths (how to traverse RDF to extract attributes)
The entity fields defined in resolver.yaml must match the field names in rdf_mapping.yaml.
| Parameter | Type | Default | Purpose |
|---|---|---|---|
| entity_fields | list | [legal_name, country_code, nuts_code, post_code, post_name, thoroughfare] |
RDF attributes to extract and use for similarity computation |
| duckdb.type | string | persistent |
Database mode: persistent (file-based) or in-memory (test/ephemeral) |
| duckdb.path | string | data/app.duckdb |
Database file location (only for persistent mode; overridden by DUCKDB_PATH env var) |
| cache_strategy | string | tf_incremental |
Search space caching: tf_incremental (term-frequency incremental) |
| threshold | float (0.0–1.0) | 0.20 |
Minimum match probability for cluster assignment. Lower = more sensitive (higher recall, lower precision); higher = more selective |
| top_n | int | 100 |
Maximum cluster candidates returned per mention (pruning limit) |
| match_weight_threshold | float | -10 |
Pre-filter on Splink match weight; -10 captures below-threshold links needed for full candidate output |
| auto_train_threshold | int | 50 |
Mention count at which to trigger background EM training (0 = disabled) |
| probability_two_random_records_match | float (0.0–1.0) | 0.003 |
Fellegi-Sunter prior λ: baseline probability any two records match (affects all m/u probability ratios) |
The splink.comparisons section defines similarity functions and their thresholds. Each comparison produces gamma levels (discrete similarity buckets).
| Field | Type | Thresholds | Purpose |
|---|---|---|---|
| legal_name | jaro_winkler | [0.9, 0.8] | Primary identifier; primary signal for match determination |
| country_code | exact_match | — | Blocking rule only (not used in comparison); preserves EM flexibility |
| nuts_code | exact_match | — | EU regional code; exact match or missing data |
| post_code | jaro_winkler | [0.95, 0.85] | Postal/ZIP code; typo-tolerant with high thresholds |
| post_name | jaro_winkler | [0.90, 0.80] | City name; captures spelling variants and abbreviations |
| thoroughfare | jaro_winkler | [0.95, 0.85] | Street address; highly specific, typos uncommon |
Interpretation: Each threshold defines a gamma level. For example, legal_name with thresholds [0.9, 0.8] produces three levels:
- Gamma 2: JW ≥ 0.9 (highest match)
- Gamma 1: 0.8 ≤ JW < 0.9 (medium match)
- Gamma 0: JW < 0.8 (no match)
Pairs are compared only if at least one blocking rule matches. This reduces computation significantly.
Current configuration:
blocking_rules:
- country_code # Primary: same country (strict)
- [country_code, nuts_code] # Secondary: same country AND same EU regionSemantics:
- Rule 1: Pair must have matching country codes
- Rule 2: Pair must have matching country codes AND matching NUTS codes (if both present)
- At least one rule must fire for comparison to occur
Effect: Drastically reduces comparison volume while preserving global comparisons within country.
Before EM training, Splink uses cold-start m/u probabilities. These are empirically tuned defaults:
- m_probability: Likelihood of observing this gamma level given the records match
- u_probability: Likelihood of observing this gamma level given the records are independent (random pair)
Example (legal_name):
m_probabilities: [0.9, 0.6, 0.025, 0.005] # Gamma levels 2, 1, 0, (implicit no-match)
u_probabilities: [0.00001, 0.0004, 0.004, 0.99559]- If records match, high JW (gamma 2) is very likely (m=0.9)
- If records are random, high JW is very unlikely (u=0.00001)
- Likelihood ratio: 0.9 / 0.00001 = 90,000× evidence for match
Note: Once EM training completes (at auto_train_threshold), these are replaced by empirically learned parameters.
- Increase recall (catch more matches): Lower
thresholdor lowerm_probabilitiesfor weak signals - Increase precision (reduce false positives): Raise
thresholdor raiseu_probabilities
Fields with higher m/u ratios have stronger influence. To emphasize address over name:
- Increase
m_probabilitiesfor address fields (post_code, thoroughfare) - Decrease
m_probabilitiesfor weaker signals (post_name)
- Tighter blocking (fewer comparisons): Add more rules, require more fields to match
- Looser blocking (more comparisons, slower): Fewer rules, match-any semantics
Example: To enable cross-country matches within same organization parent:
blocking_rules:
- legal_name # Only compare if legal names are very similarOnce auto_train_threshold is reached, background EM estimation updates m/u parameters based on your data. This is more accurate than cold-start but requires representative data (ideally ~500+ mentions).
To disable: Set auto_train_threshold: 0
-
Splink documentation: Splink — Entity resolution at scale
-
Fellegi-Sunter model: The Fellegi-Sunter model in Splink
-
ERE algorithm: See
../../docs/algorithm.mdfor detailed explanation of the online greedy clustering approach.
Too many false positives (precision too low):
- Increase
threshold(e.g., 0.20 → 0.35) - Increase
u_probabilitiesfor weak signals (less evidence needed) - Tighten blocking rules (fewer comparisons, more selective)
Missing matches (recall too low):
- Decrease
threshold(e.g., 0.20 → 0.10) - Decrease
m_probabilitiesfor strong signals (more generous) - Loosen blocking rules (more comparisons, more chances to match)
Slow performance:
- Reduce
entity_fields(fewer attributes to compare) - Tighten blocking rules (fewer pairs to evaluate)
- Decrease
match_weight_thresholdto filter more low-confidence pairs before storing
Training not converging:
- Increase
auto_train_threshold(collect more data before training) - Manually inspect cluster quality to ensure ground truth is reasonable
- Consider adjusting cold-start priors as fallback