Supplementary data repository for affiliation matching repo
The repository is organized as follows:
-
system_1/- Contains results from System 1 (improved system with better recall)system_1.csv- ROR ID predictions with improved performance
-
system_2/- Contains baseline system results and codetest_results.csv- Baseline ROR harvester resultsror_harvester.py- Script to fetch ROR IDs from affiliation strings
-
system_3/- Contains zbMATH-OpenAlex integration system results and coderesult.csv- System 3 ROR harvester resultszbmath_openalex_harvester.py- Script that queries zbMATH for DOIs and OpenAlex for ROR IDsrun.log- Detailed log file from the harvesting processREADME.md- Documentation for the zbMATH-OpenAlex harvester
-
helpers/- Helper scripts for evaluation and data processingcalculate_ir_metrics.py- Calculate precision, recall, and F1-scorefilter_csv.py- Clean CSV files by removing duplicates and empty ROR IDs
-
tests/- Unit tests for all componentstest_ror_harvester.py- Tests for ROR harvestertest_calculate_ir_metrics.py- Tests for metrics calculationtest_filter_csv.py- Tests for CSV filteringREADME.md- Testing documentation
-
Root directory - Test data and documentation
testset.csv- Test set with affiliation stringstruth_table.csv- Ground truth ROR IDs for evaluationsample_input.csv- Sample input file for testing
This repository contains tools and results for matching affiliation strings to ROR (Research Organization Registry) IDs. The baseline system queries the ROR API for each affiliation string and extracts ROR IDs from the results.
- Python 3.6 or higher
requestslibrary (install with:pip install -r requirements.txt)
System 1 shows improved recall compared to the baseline:
============================================================
INFORMATION RETRIEVAL METRICS SUMMARY
============================================================
Micro-averaged Metrics (aggregated over all ROR IDs):
Precision: 0.8710
Recall: 0.8826
F1-Score: 0.8767
Macro-averaged Metrics (averaged per paper):
Precision: 0.8879
Recall: 0.8921
F1-Score: 0.8845
Counts:
True Positives: 1721
False Positives: 255
False Negatives: 229
Total Papers: 1008
============================================================
The baseline ROR harvester approach:
============================================================
INFORMATION RETRIEVAL METRICS SUMMARY
============================================================
Micro-averaged Metrics (aggregated over all ROR IDs):
Precision: 0.9328
Recall: 0.7831
F1-Score: 0.8514
Macro-averaged Metrics (averaged per paper):
Precision: 0.9396
Recall: 0.8015
F1-Score: 0.8247
Counts:
True Positives: 1527
False Positives: 110
False Negatives: 423
Total Papers: 1008
============================================================
System 1 achieves better recall (0.8826 vs 0.7831 micro-averaged) at the cost of slightly lower precision (0.8710 vs 0.9328 micro-averaged), resulting in a better overall F1-score (0.8767 vs 0.8514 micro-averaged).
System 3 uses a different approach by integrating zbMATH and OpenAlex APIs:
============================================================
INFORMATION RETRIEVAL METRICS SUMMARY
============================================================
Micro-averaged Metrics (aggregated over all ROR IDs):
Precision: 0.8225
Recall: 0.8462
F1-Score: 0.8342
Macro-averaged Metrics (averaged per paper):
Precision: 0.8793
Recall: 0.8539
F1-Score: 0.8181
Counts:
True Positives: 1650
False Positives: 356
False Negatives: 300
Total Papers: 1008
============================================================
System 3 demonstrates balanced performance with micro-averaged precision of 0.8225 and recall of 0.8462, achieving an F1-score of 0.8342. This positions it between the baseline System 2 (higher precision, lower recall) and System 1 (lower precision, higher recall).
During the harvesting process, System 3 encountered 65 papers (6.44% of the 1009 unique papers) where the zbMATH API returned 404 errors, indicating that the article metadata was not available.
Adjusted Performance (Excluding Missed Papers):
When excluding the 65 papers that couldn't be processed, System 3 achieves:
Micro-averaged Metrics (943 papers):
Precision: 0.8225
Recall: 0.9011 (improved from 0.8462)
F1-Score: 0.8600 (improved from 0.8342)
Macro-averaged Metrics:
Precision: 0.8710
Recall: 0.9117 (improved from 0.8539)
F1-Score: 0.8735 (improved from 0.8181)
Key Findings:
- When excluding missed papers, System 3 achieves higher recall than System 1 (0.9011 vs 0.8826 micro-averaged)
- System 1 maintains higher precision (0.8710 vs 0.8225) and better overall F1-score (0.8767 vs 0.8600)
- The 65 missed papers contained 119 ground truth ROR IDs that System 3 couldn't retrieve
Tools for Analysis:
To analyze missed papers and calculate adjusted metrics:
# Extract missed papers from run log
python3 system_3/extract_missed_papers.py system_3/run.log system_3/missed_papers.csv
# Calculate adjusted metrics excluding missed papers
python3 system_3/calculate_adjusted_metrics.py testset.csv truth_table.csv system_3/result.csv system_3/missed_papers.csvFor detailed analysis, see system_3/ANALYSIS.md.
The project includes comprehensive unit tests. See tests/README.md for detailed testing documentation.
Quick start:
# Run all tests
python3 -m unittest discover -s tests -p 'test_*.py'The filter_csv.py script cleans up CSV files by removing duplicate rows and rows with empty ROR IDs:
python helpers/filter_csv.py <input_csv> <output_csv>This script:
- Removes duplicate rows: Identical rows (same
anandror_id) are removed - Removes rows with empty ROR IDs: Rows that have an
anvalue but an emptyror_idare removed
Example Usage:
python helpers/filter_csv.py results.csv filtered_results.csvExample: Given an input file with duplicates and empty ROR IDs:
an,ror_id
1,052gg0110
1,052gg0110
2,
3,042nb2s44The filter will produce:
an,ror_id
1,052gg0110
3,042nb2s44The duplicate row for an=1 is removed, and the row an=2 with empty ror_id is filtered out.
The calculate_ir_metrics.py script compares predicted ROR IDs against ground truth and calculates standard Information Retrieval metrics.
Important Note on Terminology:
- Each row in the CSV files has an
an(author number) identifier representing a paper/publication - Each paper may have multiple affiliations (affiliation strings)
- The ROR harvester processes all affiliations for a paper and returns a set of ROR IDs
- Metrics are calculated per paper (per
an), comparing all predicted ROR IDs for that paper against the ground truth
python helpers/calculate_ir_metrics.py <testset.csv> <truth_table.csv> <test_results.csv> [detailed_output.csv]Arguments:
testset.csv- Test set with paper IDs (CSV with 'an' column) - defines the papers to evaluatetruth_table.csv- Ground truth ROR IDs (CSV with 'an' and 'ror_id' columns)test_results.csv- Predicted ROR IDs (CSV with 'an' and 'ror_id' columns)detailed_output.csv- Optional: Save per-paper metrics to CSV file
Example:
python helpers/calculate_ir_metrics.py testset.csv truth_table.csv system_1/system_1.csv detailed_metrics.csvThe script calculates both micro-averaged and macro-averaged metrics:
Micro-averaged metrics (aggregated over all ROR IDs):
- Aggregates True Positives, False Positives, and False Negatives across all papers
- Then calculates precision, recall, and F1-score from the aggregated counts
- Gives more weight to papers with many ROR IDs
- Precision: What fraction of all predicted ROR IDs (across all papers) are correct?
- Recall: What fraction of all true ROR IDs (across all papers) were found?
- F1-Score: Harmonic mean of micro-precision and micro-recall
Macro-averaged metrics (averaged per paper):
- Calculates precision, recall, and F1-score for each paper individually
- Then averages these metrics across all papers
- Treats each paper equally regardless of how many ROR IDs it has
- Precision: Average precision across all papers
- Recall: Average recall across all papers
- F1-Score: Average F1-score across all papers
Counts:
- True Positives (TP): Correct ROR ID predictions
- False Positives (FP): Incorrect ROR ID predictions
- False Negatives (FN): Missed ROR IDs from ground truth
Important: The evaluation includes all papers from the test set, ensuring consistent evaluation across the entire dataset. The script now requires testset.csv as input to define the exact set of papers to evaluate.
How these papers are handled in metrics calculation:
-
Papers with predictions but no ground truth:
- Ground truth: empty set
{} - Predictions: whatever ROR IDs were predicted, e.g.,
{ror1, ror2} - Metrics calculation:
- TP = 0 (no true positives possible without ground truth)
- FP = number of predicted ROR IDs (all predictions are false positives)
- FN = 0 (no ground truth to miss)
- Precision = 0.0 (no predictions are correct)
- Recall = 1.0 (standard IR convention: nothing to miss when ground truth is empty)
- Ground truth: empty set
-
Papers with ground truth but no predictions:
- Ground truth: e.g.,
{ror1, ror2} - Predictions: empty set
{} - Metrics calculation:
- TP = 0 (nothing predicted)
- FP = 0 (no false predictions)
- FN = number of ground truth ROR IDs (all missed)
- Precision = 1.0 (standard IR convention: no false positives when predictions are empty)
- Recall = 0.0 (nothing found)
- Ground truth: e.g.,
-
Papers with no ground truth and no predictions (perfect match):
- Ground truth: empty set
{} - Predictions: empty set
{} - Metrics calculation:
- TP = 0, FP = 0, FN = 0 (perfect agreement)
- Precision = 1.0 (no false positives)
- Recall = 1.0 (nothing to miss)
- Interpretation: A system that correctly predicts nothing when there's nothing to predict is performing perfectly
- Ground truth: empty set
Impact on overall metrics:
- Micro-averaging: Papers with no ground truth contribute to the total false positives count, which lowers the overall precision. This is intentional - the system is penalized for predicting ROR IDs when there should be none.
- Macro-averaging: With standard IR conventions:
- Papers with predictions but no ground truth contribute P=0.0, R=1.0 (nothing to miss)
- Papers with ground truth but no predictions contribute P=1.0 (no false positives), R=0.0
Example from dataset:
Paper 7199025:
Ground truth: {} (0 affiliations)
Predicted: {03kw9gc02} (1 ROR ID)
Result: TP=0, FP=1, FN=0, Precision=0.0, Recall=1.0 (nothing to miss)
Key changes:
- The script now requires
testset.csvto define the papers to evaluate - All 1008 distinct papers from the test set are evaluated
- Papers only in
truth_table.csvbut not intestset.csvare excluded from evaluation - This ensures consistent micro-averaging across the complete test set
This approach ensures that:
- The system is evaluated on all papers in the test set (1008 papers)
- Predicting ROR IDs for papers with no affiliations is counted as an error
- The evaluation reflects real-world performance where some papers may have no affiliations
- Micro-averaging considers the complete test set for accurate metrics
The script prints a summary to stdout. Example output format (using baseline test_results.csv):
============================================================
INFORMATION RETRIEVAL METRICS SUMMARY
============================================================
Micro-averaged Metrics (aggregated over all ROR IDs):
Precision: 0.9328
Recall: 0.7823
F1-Score: 0.8509
Macro-averaged Metrics (averaged per paper):
Precision: 0.9393
Recall: 0.7985
F1-Score: 0.8218
Counts:
True Positives: 1527
False Positives: 110
False Negatives: 423
Total Papers: 1008
============================================================
If a detailed output file is specified, per-paper metrics are saved as CSV with columns:
an: Paper identifierprecision,recall,f1: Per-paper metricstp,fp,fn: Counts for this papertrue_count,pred_count: Number of ROR IDs in truth and prediction