A study of Named Entity Recognition on Twitter data (TweeBank_NER), examining whether model errors and uncertainty align with places where human annotators disagree.
Install dependencies with:
pip install torch transformers seqeval scikit-learn numpyAll data files are in the data/ folder:
train.jsonl,dev.jsonl,test.jsonl— official TweeBank_NER splitsHjalte_500.jsonl,mar_500.jsonl,val_500.jsonl— manual annotations of the first 500 test tweets by each group membertest_pred_base.jsonl— baseline BERT predictions on the test setbertweet_pred.jsonl— BERTweet predictions on the first 500 test tweets
All scripts should be run from the code/ directory.
cd codeFine-tunes bert-base-cased on the TweeBank training data and saves predictions on the test set to test_pred_base.jsonl.
python3 train-base.pyRuns the pretrained bertweet-tb2-ner model on the test set and saves predictions to bertweet_pred_test.jsonl.
python3 bertweet.pyComputes pairwise F1 between the three annotators for each entity type (PER, LOC, ORG, MISC).
python3 IAA.pyComputes span-level F1 (strict, unlabeled, and loose) for both models and both human annotators against the gold standard.
python3 F1.pyOpen and run data/ann_comparison.ipynb in Jupyter to reproduce the disagreement analysis and confusion matrix figures.
jupyter notebook data/ann_comparison.ipynb