Course Project: News Stance DetectionQiang Zhang, Bill LamposFeuary 23, 20181 Task DefinitionIn...

Question

Course Project: News Stance DetectionQiang Zhang, Bill LamposFeuary 23, 20181 Task DefinitionIn context of news, a claim is made in a news headline, as well as in the piece of text in an articleody. Quite often, the headline of a news article is created so that it is attractive to the readers,even though the body of the article may be about a different subject/may have another claim thanthe headline.Stance Detection involves estimating the relative perspective (or stance), of two pieces of textelative, i.e. do the two pieces agree, disagree, discuss or are unrelated to one another. Your taskin this project is to estimate the stance of a body text from a news article relative to a headline.The goal in stance detection is to detect whether the headline and the body of an articlehave the same claim. The stance can be categorized as one of the four labels: “agree”, “disagree”,“discuss” and “unrelated”. Formal definitions of the four stances are as:• “agree” – the body text agrees with the headline;• “disagree” – the body text disagrees with the headline;• “discuss” – the body text discusses the same claim as the headline, but does not take aposition;• “unrelated" – the body text discusses a different claim but not that in the headline.2 DatasetWe will be using the publicly available FNC-1 dataset 1. This dataset is divided into a trainingset and a testing set. The ratio of training data over testing data is about 2:1. Every data sampleis a pair of a headline and a body. There are 49972 pairs in the training set, with 49972 uniqueheadlines and 1683 unique bodies. This means that an article body can be seen in more than onepair.“unrelated” data takes the majority (over 70%) in both sets while the percentage of “disagree”is less than 3%. The percentage of “agree” and “discuss” are less than 20% and 10%, respectively.Severe class imbalance exits in the FNC-1 dataset.FNC-1 implements an official baseline 2 that may be helpful to read files, and to split the traindataset into a training subset and a validation subset.3 Involved SubtasksThe course project involves several subtasks that are required to be solved. This is a researchoriented project so you are expected to be creative and coming up with your own solutions isstrongly encouraged for any part of the project.• Split the training set into a training subset and a validation subset with the data numbeproportion about 9:1.The training subset and the validation subset should have similar ratiosof the four classes. Statistics of the ratios should be presented.1https:github.com/FakeNewsChallenge/fnc-12https:github.com/FakeNewsChallenge/fnc-1-baseline1mailto: XXXXXXXXXXmailto: XXXXXXXXXXhttps:github.com/FakeNewsChallenge/fnc-1https:github.com/FakeNewsChallenge/fnc-1-baseline• Extract vector representation of headlines and bodies in the all the datasets, and computethe cosine similarity between these two vectors. You can use representations based on bag-of-words or other methods like Word2Vec for vector based representations. You are encouragedto explore alternative representations as well.• Establish language model based representations of the headlines and the article bodies in allthe datasets and calculate the KL-divergence for each pair of headlines and article bodies.Feel free to explore different smoothing techniques for language model based representations.• Propose and implement alternative features/distances that might be helpful for the stancedetection task. Describe feature meaning and extraction process.• Choose two kinds of representative distances/features that you think may be most importantfor stance detection and plot the distance distribution for the four stances. Comment on whyyou think these are the important features and try to validate their importance using thedata.• Using the features that you have created, implement a linear regression and a logistic re-gression model using gradient descent for stance classification. The implementations of theselearning algorithms should be your own.• Analyse the performance of your models using the test set. Describe the evaluation metricyou use and explain why you think would be suited for this task. Feel free to use alternativemetrics that you think may fit. Compare and contrast the performance of the two modelsyou have implemented. Analyse the effect of learning rate on both models.• Explore which features are the most important for the stance detection task by analysingtheir importance for the machine learning models you have built.• Do a literature review regarding the stance detection task, iefly summarize and comparethe features and models that have been proposed for this task.• Propose ways to improve the machine learning models you have implemented. You caneither propose new machine learning models, new ways of sampling/using the training data,or propose new features. You are allowed to use existing liaries/packages for this part.4 What to submitYou are expected to submit all the code you have written, together with a written report up to 5pages. Your report should describe the work you have done for each of the aforementioned steps.Unless otherwise stated above, all the code should be your own and you are not allowed to reuseany code that is available online. You are allowed to use both Python and Java as the programminglanguage.5 DeadlineThe deadline for submitting your project is midnight on April 6th.2    Task Definition    Dataset    Involved Subtasks    What to submit    Deadline Machine Learning for Data Mining and Information RetrievalAssociation Rule Mining and Machine LearningEmine Yilmaz XXXXXXXXXXSome slides courtesy Andrew Ng@Stanford, Bing Liu@UICmailto: XXXXXXXXXX2Identifying Relationships Between Items: Association rule mining• Proposed by Agrawal et al in 1993. • It is an important data mining model studied extensively by the database and data mining community. • Assume all data are categorical.• Initially used for Market Basket Analysis to find how items purchased y customers are related.Bread  Milk [sup = 5%, conf = 100%]3Transaction data: supermarket data• Market basket transactions:t1: {ead, cheese, milk}t2: {apple, eggs, salt, yogurt}… …tn: {biscuit, eggs, milk}• Concepts:• An item i: an item/article in a basket• I = {i1, i2, …, im}: : the set of all items sold in the store• A transaction t and t  I : items purchased in a basket• A transactional dataset: A set of transactions T = {t1, t2, …, tn}4Transaction data: a set of documents• A text document data set. Each document is treated as a “bag” of keywordsdoc1: Student, Teach, School doc2: Student, School doc3: Teach, School, City, Game doc4: Baseball, Basketballdoc5: Basketball, Player, Spectator  doc6: Baseball, Coach, Game, Teamdoc7: Basketball, Team, City, Game 5The model: rules• A transaction t contains X, a set of items (itemset) in I, if X  t.• An association rule is an implication of the form:X  Y, where X, Y  I, and X Y = • An itemset is a set of items.• E.g., X = {milk, ead, cereal} is an itemset.• A k-itemset is an itemset with k items.• E.g., {milk, ead, cereal} is a 3-itemset6Rule strength measures• Support: The rule holds with support sup in T (the transaction data set) if sup% of transactions contain X  Y. • sup = Pr(X  Y)• Confidence: The rule X->Y holds in T with confidence conf if conf% of transactions that contain X also contain Y.• conf = Pr(Y | X)• An association rule is a pattern that states when X occurs, Y occurs with certain probability. 7Support and Confidence• Support count: The support count of an itemset X, denoted by X.count, in a data set T is the number of transactions in T that contain X. Assume T has n transactions. • Then, ncountYXsupport).  ( countXcountYXconfidence.).  ( 8Goal and key features• Goal: Find all rules that satisfy the user-specified minimum support(minsup) and minimum confidence (minconf).• Key Features• Completeness: find all rules.• Mining with data on hard disk (not in memory)9An example• Transaction data• Assume:minsup = 30%minconf = 80%• An example frequent itemset:{Chicken, Clothes, Milk}    [sup = 3/7]• Association rules from the itemset:Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]… …Clothes, Chicken Milk, [sup = 3/7, conf = 3/3]t1: Beef, Chicken, Milkt2: Beef, Cheeset3: Cheese, Bootst4: Beef, Chicken, Cheeset5: Beef, Chicken, Clothes, Cheese, Milkt6: Chicken, Clothes, Milkt7: Chicken, Milk, Clothes10Many mining algorithms• There are a large number of them!!• They use different strategies and data structures. • Their resulting sets of rules are all the same. • Given a transaction data set T, and a minimum support and a minimum confident, the set of association rules existing in T is uniquely determined.• Any algorithm should find the same set of rules although their computational efficiencies and memory requirements may be different. • We study only one: the Apriori Algorithm11The Apriori algorithm• Probably the best known algorithm• Two steps:• Find all itemsets that have minimum support (frequent itemsets, also called large itemsets).• Use frequent itemsets to generate rules. • E.g., a frequent itemset{Chicken, Clothes, Milk}       [sup = 3/7]and one rule from the frequent itemsetClothes  Milk, Chicken [sup = 3/7, conf = 3/3]12Step 1: Mining all frequent itemsets• A frequent itemset is an itemset whose support  is ≥ minsup.• Key idea: The apriori property (downward closure property): any subsets of a frequent itemset are also frequent itemsetsAB     AC    AD     BC    BD    CDA XXXXXXXXXXB XXXXXXXXXXC XXXXXXXXXXDABC XXXXXXXXXXABD       ACD XXXXXXXXXXBCD13The Algorithm• Iterative algo. (also called level-wise search): Find all 1-item frequent itemsets; then all 2-item frequent itemsets, and so on.• In each iteration k, only

Saurabh · Accepted Answer

Solution/baseline.pyimport numpy as np
from nltk import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.neighbors import KNeighborsClassifier
from scipy.sparse import hstack
from utils.dataset import DataSet
from utils.generate_test_splits import split
from utils.score import report_score
dataset = DataSet()
data_splits = split(dataset)
training_data = data_splits['training']
dev_data = data_splits['dev']
test_data = data_splits['test']
LABELS = ['agree', 'disagree', 'discuss', 'unrelated']
class Preprocessor(object):
    def __init__(self):
        self.tokenizer = RegexpTokenizer(r'\w+')
        self.stopwords_eng = stopwords.words('english')
        self.lemmatizer = WordNetLemmatizer()
        
    def __call__(self, doc):
        return [self.lemmatizer.lemmatize(t) for t in self.tokenizer.tokenize(doc)]
        
    def process(self, text):
        tokens = self.tokenizer.tokenize(text.lower())
        tokens_processed = []
        for t in tokens:
            if t in self.stopwords_eng: continue
            tokens_processed.append(self.lemmatizer.lemmatize(t))
        return tokens_processed
        
class Document(object):
    def __init__(self, data):
        self.stances = []
        self.headlines = []
        self.body_texts = []
        self.size = 0
        for dict_item in data:
            label_index = LABELS.index(dict_item['Stance'])
            headline = dict_item['Headline']
            body = dataset.articles[dict_item['Body ID']]
            self.stances.append(label_index)
            self.headlines.append(headline)
            self.body_texts.append(body)
        self.size = len(self.stances)
        self.stances = np.asarray(self.stances)
        
    def get_full_text(self):
        full_texts = []
        for i in range(self.size):
            text = '
'.join((self.headlines[i], self.body_texts[i]))
            full_texts.append(text)
        return full_texts
if __name__ == '__main__':
    #preprocessor = Preprocessor()
    training_doc = Document(training_data)
    test_doc = Document(test_data)
    
    vectorizer = CountVectorizer(ngram_range=(1,2), min_df=2, 
                                 stop_words='english')
    train_headline = vectorizer.fit_transform(training_doc.headlines)
    test_headline = vectorizer.transform(test_doc.headlines)
    train_body = vectorizer.fit_transform(training_doc.body_texts)
    test_body = vectorizer.transform(test_doc.body_texts)
    
    ch2 = SelectKBest(chi2, k=1000)
    ch2.fit(train_headline, training_doc.stances)
    train_headline = ch2.transform(train_headline)
    test_headline = ch2.transform(test_headline)
    ch2.fit(train_body, training_doc.stances)
    train_body = ch2.transform(train_body)
    test_body = ch2.transform(test_body)
    
    train_features = hstack((train_headline, train_body))
    test_features = hstack((test_headline, test_body))
    
    classifier = KNeighborsClassifier(n_neighbors=5)
    classifier.fit(train_features, training_doc.stances)
    
    prediction = classifier.predict(test_features)
    
    actual_label = [LABELS[x] for x in test_doc.stances]
    predicted_label = [LABELS[x] for x in prediction]
    report_score(actual_label, predicted_label)
Solution/fnc-1.pyfrom __future__ import print_function
import numpy as np
from gensim.models import KeyedVectors
#from keras.preprocessing import sequence
#from keras.models import Sequential
#from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
#from keras.datasets import imdb
from nltk import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import RandomForestClassifier
#from sklearn.preprocessing import StandardScaler
#from sklearn.neural_network import MLPClassifier
from scipy.sparse import hstack, csr_matrix
from utils.dataset import DataSet
from utils.generate_test_splits import split
from utils.

Course Project: News Stance Detection Qiang Zhang, Bill Lampos February 23, 2018 1 Task Definition In context of news, a claim is made in a news headline, as well as in the piece of text in an article...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment