Course Project: News Stance Detection
Qiang Zhang, Bill Lampos
Fe
uary 23, 2018
1 Task Definition
In context of news, a claim is made in a news headline, as well as in the piece of text in an article
ody. Quite often, the headline of a news article is created so that it is attractive to the readers,
even though the body of the article may be about a different subject/may have another claim than
the headline.
Stance Detection involves estimating the relative perspective (or stance), of two pieces of text
elative, i.e. do the two pieces agree, disagree, discuss or are unrelated to one another. Your task
in this project is to estimate the stance of a body text from a news article relative to a headline.
The goal in stance detection is to detect whether the headline and the body of an article
have the same claim. The stance can be categorized as one of the four labels: “agree”, “disagree”,
“discuss” and “unrelated”. Formal definitions of the four stances are as:
• “agree” – the body text agrees with the headline;
• “disagree” – the body text disagrees with the headline;
• “discuss” – the body text discusses the same claim as the headline, but does not take a
position;
• “unrelated" – the body text discusses a different claim but not that in the headline.
2 Dataset
We will be using the publicly available FNC-1 dataset 1. This dataset is divided into a training
set and a testing set. The ratio of training data over testing data is about 2:1. Every data sample
is a pair of a headline and a body. There are 49972 pairs in the training set, with 49972 unique
headlines and 1683 unique bodies. This means that an article body can be seen in more than one
pair.
“unrelated” data takes the majority (over 70%) in both sets while the percentage of “disagree”
is less than 3%. The percentage of “agree” and “discuss” are less than 20% and 10%, respectively.
Severe class imbalance exits in the FNC-1 dataset.
FNC-1 implements an official baseline 2 that may be helpful to read files, and to split the train
dataset into a training subset and a validation subset.
3 Involved Subtasks
The course project involves several subtasks that are required to be solved. This is a research
oriented project so you are expected to be creative and coming up with your own solutions is
strongly encouraged for any part of the project.
• Split the training set into a training subset and a validation subset with the data numbe
proportion about 9:1.The training subset and the validation subset should have similar ratios
of the four classes. Statistics of the ratios should be presented.
1https:
github.com/FakeNewsChallenge/fnc-1
2https:
github.com/FakeNewsChallenge/fnc-1-baseline
1
mailto: XXXXXXXXXX
mailto: XXXXXXXXXX
https:
github.com/FakeNewsChallenge/fnc-1
https:
github.com/FakeNewsChallenge/fnc-1-baseline
• Extract vector representation of headlines and bodies in the all the datasets, and compute
the cosine similarity between these two vectors. You can use representations based on bag-of-
words or other methods like Word2Vec for vector based representations. You are encouraged
to explore alternative representations as well.
• Establish language model based representations of the headlines and the article bodies in all
the datasets and calculate the KL-divergence for each pair of headlines and article bodies.
Feel free to explore different smoothing techniques for language model based representations.
• Propose and implement alternative features/distances that might be helpful for the stance
detection task. Describe feature meaning and extraction process.
• Choose two kinds of representative distances/features that you think may be most important
for stance detection and plot the distance distribution for the four stances. Comment on why
you think these are the important features and try to validate their importance using the
data.
• Using the features that you have created, implement a linear regression and a logistic re-
gression model using gradient descent for stance classification. The implementations of these
learning algorithms should be your own.
• Analyse the performance of your models using the test set. Describe the evaluation metric
you use and explain why you think would be suited for this task. Feel free to use alternative
metrics that you think may fit. Compare and contrast the performance of the two models
you have implemented. Analyse the effect of learning rate on both models.
• Explore which features are the most important for the stance detection task by analysing
their importance for the machine learning models you have built.
• Do a literature review regarding the stance detection task,
iefly summarize and compare
the features and models that have been proposed for this task.
• Propose ways to improve the machine learning models you have implemented. You can
either propose new machine learning models, new ways of sampling/using the training data,
or propose new features. You are allowed to use existing li
aries/packages for this part.
4 What to submit
You are expected to submit all the code you have written, together with a written report up to 5
pages. Your report should describe the work you have done for each of the aforementioned steps.
Unless otherwise stated above, all the code should be your own and you are not allowed to reuse
any code that is available online. You are allowed to use both Python and Java as the programming
language.
5 Deadline
The deadline for submitting your project is midnight on April 6th.
2
Task Definition
Dataset
Involved Subtasks
What to submit
Deadline
Machine Learning for Data Mining and Information Retrieval
Association Rule Mining and
Machine Learning
Emine Yilmaz
XXXXXXXXXX
Some slides courtesy Andrew
[email protected], Bing
[email protected]mailto: XXXXXXXXXX
2
Identifying Relationships Between Items:
Association rule mining
• Proposed by Agrawal et al in 1993.
• It is an important data mining model studied extensively by the
database and data mining community.
• Assume all data are categorical.
• Initially used for Market Basket Analysis to find how items purchased
y customers are related.
Bread Milk [sup = 5%, conf = 100%]
3
Transaction data: supermarket data
• Market basket transactions:
t1: {
ead, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
• Concepts:
• An item i: an item/article in a basket
• I = {i1, i2, …, im}: : the set of all items sold in the store
• A transaction t and t I : items purchased in a basket
• A transactional dataset: A set of transactions T = {t1, t2, …, tn}
4
Transaction data: a set of documents
• A text document data set. Each document is treated as a “bag”
of keywords
doc1: Student, Teach, School
doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball
doc5: Basketball, Player, Spectator
doc6: Baseball, Coach, Game, Team
doc7: Basketball, Team, City, Game
5
The model: rules
• A transaction t contains X, a set of items (itemset) in I, if X t.
• An association rule is an implication of the form:
X Y, where X, Y I, and X Y =
• An itemset is a set of items.
• E.g., X = {milk,
ead, cereal} is an itemset.
• A k-itemset is an itemset with k items.
• E.g., {milk,
ead, cereal} is a 3-itemset
6
Rule strength measures
• Support: The rule holds with support sup in T (the transaction data
set) if sup% of transactions contain X Y.
• sup = Pr(X Y)
• Confidence: The rule X->Y holds in T with confidence conf if conf% of
transactions that contain X also contain Y.
• conf = Pr(Y | X)
• An association rule is a pattern that states when X occurs, Y occurs
with certain probability.
7
Support and Confidence
• Support count: The support count of an itemset X, denoted by X.count,
in a data set T is the number of transactions in T that contain X. Assume
T has n transactions.
• Then,
n
countYX
support
). (
countX
countYX
confidence
.
). (
8
Goal and key features
• Goal: Find all rules that satisfy the user-specified minimum support
(minsup) and minimum confidence (minconf).
• Key Features
• Completeness: find all rules.
• Mining with data on hard disk (not in memory)
9
An example
• Transaction data
• Assume:
minsup = 30%
minconf = 80%
• An example frequent itemset:
{Chicken, Clothes, Milk} [sup = 3/7]
• Association rules from the itemset:
Clothes Milk, Chicken [sup = 3/7, conf = 3/3]
… …
Clothes, Chicken Milk, [sup = 3/7, conf = 3/3]
t1: Beef, Chicken, Milk
t2: Beef, Cheese
t3: Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese, Milk
t6: Chicken, Clothes, Milk
t7: Chicken, Milk, Clothes
10
Many mining algorithms
• There are a large number of them!!
• They use different strategies and data structures.
• Their resulting sets of rules are all the same.
• Given a transaction data set T, and a minimum support and a minimum confident, the set
of association rules existing in T is uniquely determined.
• Any algorithm should find the same set of rules although their computational
efficiencies and memory requirements may be different.
• We study only one: the Apriori Algorithm
11
The Apriori algorithm
• Probably the best known algorithm
• Two steps:
• Find all itemsets that have minimum support (frequent itemsets, also
called large itemsets).
• Use frequent itemsets to generate rules.
• E.g., a frequent itemset
{Chicken, Clothes, Milk} [sup = 3/7]
and one rule from the frequent itemset
Clothes Milk, Chicken [sup = 3/7, conf = 3/3]
12
Step 1: Mining all frequent itemsets
• A frequent itemset is an itemset whose support is ≥ minsup.
• Key idea: The apriori property (downward closure property):
any subsets of a frequent itemset are also frequent itemsets
AB AC AD BC BD CD
A XXXXXXXXXXB XXXXXXXXXXC XXXXXXXXXXD
ABC XXXXXXXXXXABD ACD XXXXXXXXXXBCD
13
The Algorithm
• Iterative algo. (also called level-wise search): Find all 1-item
frequent itemsets; then all 2-item frequent itemsets, and so on.
• In each iteration k, only