-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathreadme.txt
More file actions
76 lines (59 loc) · 3.82 KB
/
readme.txt
File metadata and controls
76 lines (59 loc) · 3.82 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
Step 1: Download the dataset from https://drive.google.com/file/d/1vfqzgDFMZyn1mHlzFx-t1_KuAiNgC64f/view. (News and Movies)
1.1 Adopt the Auto Phrase (source: https://github.com/shangjingbo1226/AutoPhrase) to extract high quality phrases in both the datasets. Then use segmentation model to parse the same corpus recommended parameters for segmentation is HIGHLIGHT_MULTI=0.7 HIGHLIGHT_SINGLE=1.0. Here remove all UPPER-CASE letters with lower case. Also join the most relevant words (extracted using Auto Phrase) with "_", so that they are treated as single word and represents the background context precisely.
Code for text processing
###import argparse (source: https://github.com/shangjingbo1226/AutoPhrase)
import os
import csv
import numpy as np
from tqdm import tqdm
from nltk import word_tokenize
import re
def phrase_process():
f = open(os.path.join('AutoPhrase', 'model_News', "DBLP" , 'segmentation.txt'))
g = open(args.out_file, 'w')
for line in tqdm(f):
doc = ''
temp = re.split(r'<phrase_Q=\d\.\d+>', line)
for seg in temp:
temp2 = seg.split('</phrase>')
if len(temp2) > 1:
doc += ("_").join(temp2[0].split(" ")) + temp2[1]
else:
doc += temp2[0]
g.write(doc.strip()+'\n')
print("Phrase segmented corpus written to {}".format(args.out_file))
return
def preprocess():
f = open(os.path.join("AutoPhrase","model_Movies/DBLP","text.txt"))
docs = f.readlines()
f_out = open(args.out_file, 'w')
for doc in tqdm(docs):
f_out.write(' '.join([w.lower() for w in word_tokenize(doc.strip())]) + '\n')
return
if __name__=="__main__":
parser = argparse.ArgumentParser(description='main', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--mode', type=int)
parser.add_argument('--dataset', default="News")
parser.add_argument('--in_file', default='segmentation.txt')
parser.add_argument('--out_file', default='./AutoPhrase/model_News/DBLP/text.txt')
args, _ = parser.parse_known_args()
phrase_process()
preprocess()
###
Now dev_set : First 100 documents in every dataset, along with newly added prompt.
(Movies: https://drive.google.com/file/d/1--fTTd_qlAE0R-ZpFWRUk9cOxTCWBc_W/view?usp=drive_link)
(News: https://drive.google.com/file/d/1-1Zr00xO3p6Hxav2C7NNUBOBv9jQnvIV/view?usp=drive_link)
Step 2: Setting Up the Environment
Install the necessary packages:
!pip install transformers
Step 3: Initializing the Zero-Shot Classifier
Leverage the multi NLI model from transformers to generate pseudo labels:
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
Step 4: Generating Pseudo Labels
Run the pseudo_labels.ipynb script to produce the desired labels for the dataset. (It approximately takes 3 hrs to get the pseudo labels for 2000 documents, so have processed it batch-wise and concatenate the results locally)
Generated 2999 pseudo labels for Movies and 6000 pseudo labels for News.
Step 5: Embedding and Model Training
Utilized the CatE embeddings (Movies: https://drive.google.com/file/d/1-02FptOEHnrXzFon0GNlTlbivgR526Xa/view?usp=sharing, News: https://drive.google.com/file/d/10TmQpT8KpUznrWD0Y1A0DSzfUL78OUff/view?usp=sharing) that are generated using the https://github.com/yumeng5/CatE are used for converting the words into 100 dimensional vectors. Built a GradientBoostingClassifier model.
See the TX_file.ipynb file for step by step code implementation. While executing the above git-hub codes some of the scikit-learn packages are outdated so kindly refer to this stack overflow post which helps in de-bugging the errors,https://stackoverflow.com/questions/72572969/problems-using-spherecluster-package-for-spherical-k-mean-clustering
Step 6: Check for Data Imbalance and Do Hyperparameter tuning. Once the model is finalized, predict the test results.