GitHub - LokeshSreenathJ/Text-Classification-using-Category-Label-Embedding-and-Weakly-Supervised-Learning: Building a model that utilizes Category guided Embeddings and Weakly Supervised Learning.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Category_Embeddings_Movies.bin		Category_Embeddings_Movies.bin
Category_Embeddings__News.bin		Category_Embeddings__News.bin
Text_Classification_CatE.ipynb		Text_Classification_CatE.ipynb
main.py		main.py
pseudo_labels.py		pseudo_labels.py
readme.txt		readme.txt
report.pdf		report.pdf
requirements.txt		requirements.txt
test.sh		test.sh
tx_file.py		tx_file.py

Repository files navigation

Step 1: Download the dataset from https://drive.google.com/file/d/1vfqzgDFMZyn1mHlzFx-t1_KuAiNgC64f/view. (News and Movies)
	1.1 Adopt the Auto Phrase (source: https://github.com/shangjingbo1226/AutoPhrase) to extract high quality phrases in both the datasets. Then use segmentation model to parse the same corpus recommended parameters for segmentation is  HIGHLIGHT_MULTI=0.7 HIGHLIGHT_SINGLE=1.0. Here remove all UPPER-CASE letters with lower case. Also join the most relevant words (extracted using Auto Phrase) with "_", so that they are treated as single word and represents the background context precisely.  
Code for text processing			
###import argparse  (source: https://github.com/shangjingbo1226/AutoPhrase)
import os
import csv
import numpy as np
from tqdm import tqdm
from nltk import word_tokenize
import re

def phrase_process():
	f = open(os.path.join('AutoPhrase', 'model_News', "DBLP" , 'segmentation.txt'))
	g = open(args.out_file, 'w')
	for line in tqdm(f):
		doc = ''
		temp = re.split(r'<phrase_Q=\d\.\d+>', line)
		for seg in temp:
			temp2 = seg.split('</phrase>')
			if len(temp2) > 1:
				doc += ("_").join(temp2[0].split(" ")) + temp2[1]
			else:
				doc += temp2[0]
		g.write(doc.strip()+'\n')
	print("Phrase segmented corpus written to {}".format(args.out_file))
	return
def preprocess():
	f = open(os.path.join("AutoPhrase","model_Movies/DBLP","text.txt"))
	docs = f.readlines()
	f_out = open(args.out_file, 'w')
	for doc in tqdm(docs):
		f_out.write(' '.join([w.lower() for w in word_tokenize(doc.strip())]) + '\n')
	return


if __name__=="__main__":

	parser = argparse.ArgumentParser(description='main', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
	parser.add_argument('--mode', type=int)
	parser.add_argument('--dataset', default="News")
	parser.add_argument('--in_file', default='segmentation.txt')
	parser.add_argument('--out_file', default='./AutoPhrase/model_News/DBLP/text.txt')
	args, _ = parser.parse_known_args()


phrase_process()
preprocess()
###

Now dev_set : First 100 documents in every dataset, along with newly added prompt.
(Movies: https://drive.google.com/file/d/1--fTTd_qlAE0R-ZpFWRUk9cOxTCWBc_W/view?usp=drive_link)
(News: https://drive.google.com/file/d/1-1Zr00xO3p6Hxav2C7NNUBOBv9jQnvIV/view?usp=drive_link)


Step 2: Setting Up the Environment
Install the necessary packages:
!pip install transformers


Step 3: Initializing the Zero-Shot Classifier
Leverage the multi NLI model from transformers to generate pseudo labels:

from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")


Step 4: Generating Pseudo Labels
Run the pseudo_labels.ipynb script to produce the desired labels for the dataset. (It approximately takes 3 hrs to get the pseudo labels for 2000 documents, so have processed it batch-wise and concatenate the results locally)
Generated 2999 pseudo labels for Movies and 6000 pseudo labels for News.


Step 5: Embedding and Model Training
Utilized the CatE embeddings (Movies: https://drive.google.com/file/d/1-02FptOEHnrXzFon0GNlTlbivgR526Xa/view?usp=sharing, News: https://drive.google.com/file/d/10TmQpT8KpUznrWD0Y1A0DSzfUL78OUff/view?usp=sharing) that are generated using the https://github.com/yumeng5/CatE are used for converting the words into 100 dimensional vectors. Built a GradientBoostingClassifier model.
See the TX_file.ipynb file for step by step code implementation. While executing the above git-hub codes some of the scikit-learn packages are outdated so kindly refer to this stack overflow post which helps in de-bugging the errors,https://stackoverflow.com/questions/72572969/problems-using-spherecluster-package-for-spherical-k-mean-clustering

Step 6: Check for Data Imbalance and Do Hyperparameter tuning. Once the model is finalized, predict the test results.