3/6/22, 2:27 AM Assignment 4: Prediction
https:
wssu.instructure.com/courses/19153/assignments/325905?module_item_id= XXXXXXXXXX/2
Assignment 4: Prediction
Due Tuesday by 11:59pm Points 80 Submitting a file upload
Start Assignment
I Do the following experiments:
1. Run Weka's Naive Bayes on the original loan dataset (Loan_original.arff), 2-bin and 3-
in discretized data using training set test option..
Examine the e
or rate and identify the wrongly classified instances. To do the latter, right click on the
cu
ent line in the result list window and then select "Visualize classifier e
ors". The wrongly classified
instance will show in the plot in small squares. Click on an instance to get its information.
Answer the following questions:
XXXXXXXXXXa) How is the model represented? What do the counts mean and why are they incremented by 1
(i.e. actual value count+1)?
XXXXXXXXXXb) What actually happens when you test the classifier on the training set?
XXXXXXXXXXc) How do e
ors occur? What could be the reasons?
XXXXXXXXXXd) How does the e
or rate change over the three different data sets (original, 2-bin, 3-bin)? Any
guess why?
2. Run Weka's IBk algorithm on the original loan data (no discretization) with the "Use training set"
test option and examine the evaluation results (co
ectly/inco
ectly classified instances). Vary KNN (the
number of neighbors) with and without distance weighting. Try for example KNN=1,3,5,20 without
distance weighting and with weight = 1/distance. Compare the results and find explanations.
Answer the following questions by looking at what the algorithm does for each instance from the test set
(which in this case is also a training set). Find conceptual level explanations, no need to go into
computing distances:
XXXXXXXXXXe) What actually happens when you test the classifier on the training set?
XXXXXXXXXXf) How do e
ors occur? What could be the reasons?
XXXXXXXXXXg) How do the e
or rate change with the KNN parameter in IBk?
3. Decide on the application of a new customer.
3/6/22, 2:27 AM Assignment 4: Prediction
https:
wssu.instructure.com/courses/19153/assignments/325905?module_item_id= XXXXXXXXXX/2
Prepare a test set using the information of the new customer described in the Assignment in 3: A
customer applying for a 30 month loan with 80,000 yen monthly pay to buy a car from a custome
with the following data: male, employed, 22 years old, not ma
ied, does not live in a problematic
area, has worked 1 year for his last employer and has 500,000 yen in a bank.
XXXXXXXXXXFor more information use Handout Week 7
(https:
wssu.instructure.com/courses/19153/files/2875098/download?download_frd=1) “Using Weka 3 fo
classification and prediction”.
Run Naive Bayes and IBk with different parameters for KNN and distance weighting, all with
"Supplied test set" test option.
Compare the prediction results obtained with different algorithms.
Decide on the loan application of the new customer by using the outputs from the prediction
algorithms.
II Write a report on the prediction experiments described above. Include the following information
(DO NOT include data sets or classifier outputs):
The original 7 questions (4 about Bayes and 3 about IBK) with short answers to EACH ONE.
ONE Naive Bayes model (any version of the loan data set) with explanations of its parameters (the
answer to #1 (a) may be included here).
Results from predicting the new customer's classification (Experiments with Prediction, #3) with short
comment.
https:
wssu.instructure.com/courses/19153/files/2875098?wrap=1
https:
wssu.instructure.com/courses/19153/files/2875098/download?download_frd=1
Chapter 4
Data Mining
Practical Machine Learning Tools and Techniques
Slides for Chapter 4, Algorithms: the basic methods
of Data Mining by I. H. Witten, E. Frank,
M. A. Hall and C. J. Pal
2
Algorithms: The basic methods
• Simple probabilistic modeling
• Linear models
• Instance-based learning
3
Can combine probabilities using Bayes’s rule
• Famous rule from probability theory due to
• Probability of an event H given observed evidence E:
• A priori probability of H :
• Probability of event before evidence is seen
• A posteriori probability of H :
• Probability of event after evidence is seen
Thomas Bayes
Born: 1702 in London, England
Died: 1761 in Tun
idge Wells, Kent, England
P(H |E)= P(E |H)P(H) /P(E)
P(H )
P(H |E)
4
Naïve Bayes for classification
• Classification learning: what is the probability of the
class given an instance?
• Evidence E = instance’s non-class attribute values
• Event H = class value of instance
• Naïve assumption: evidence splits into parts (i.e.,
attributes) that are conditionally independent
• This means, given n attributes, we can write Bayes’ rule
using a product of per-attribute probabilities:
P(H |E)= P(E1 |H)P(E3 |H)… P(En |H)P(H) /P(E)
5
Weather data example
?TrueHighCoolSunny
PlayWindyHumidityTemp.Outlook
Evidence E
Probability of
class “yes”
P(yes | E)= P(Outlook = Sunny | yes)
P(Temperature =Cool | yes)
P(Humidity = High | yes)
P(Windy = True | yes)
P(yes) / P(E)
=
2 / 9´3 / 9´3 / 9´3 / 9´9 /14
P(E)
6
The “zero-frequency problem”
• What if an attribute value does not occur with every
class value?
(e.g., “Humidity = high” for class “yes”)
• Probability will be zero:
• A posteriori probability will also be zero:
(Regardless of how likely the other values are!)
• Remedy: add 1 to the count for every attribute value-
class combination (Laplace estimator)
• Result: probabilities will never be zero
• Additional advantage: stabilizes probability estimates
computed from small samples of data
P(Humidity =High | yes)= 0
P(yes |E)= 0
7
Modified probability estimates
• In some cases adding a constant different from 1 might
e more appropriate
• Example: attribute outlook for class yes
• Weights don’t need to be equal
(but they must sum to 1)
Sunny Overcast Rainy
8
Missing values
• Training: instance is not included in frequency count for
attribute value-class combination
• Classification: attribute will be omitted from calculation
• Example:
?TrueHighCool?
PlayWindyHumidityTemp.Outlook
Likelihood of “yes” = 3/9 3/9 3/9 9/14 = 0.0238
Likelihood of “no” = 1/5 4/5 3/5 5/14 = 0.0343
P(“yes”) = 0.0238 / XXXXXXXXXX) = 41%
P(“no”) = 0.0343 / XXXXXXXXXX) = 59%
9
Numeric attributes
• Usual assumption: attributes have a normal or Gaussian
probability distribution (given the class)
• The probability density function for the normal
distribution is defined by two parameters:
• Sample mean
• Standard deviation
• Then the density function f(x) is
10
Statistics for weather data
• Example density value:
5
14
5
No
9
14
9
Yes
Play
3/5
2/5
3
2
No
3/9
6/9
3
6
Yes
True
False
True
False
Windy
=9.7
=86
95, …
90, 91,
70, 85,
NoYesNoYesNoYes
=10.2
=79
80, …
70, 75,
65, 70,
Humidity
=7.9
=75
85, …
72,80,
65,71,
=6.2
=73
72, …
69, 70,
64, 68,
2/53/9Rainy
Temperature
0/54/9Overcast
3/52/9Sunny
23Rainy
04Overcast
32Sunny
Outlook
11
Classifying a new day
• A new day:
• Missing values during training are not included
in calculation of mean and standard deviation
?true9066Sunny
PlayWindyHumidityTemp.Outlook
Likelihood of “yes” = 2/9 0.0340 0.0221 3/9 9/14 = XXXXXXXXXX
Likelihood of “no” = 3/5 0.0221 0.0381 3/5 5/14 = XXXXXXXXXX
P(“yes”) = XXXXXXXXXX / XXXXXXXXXX000108) = 25%
P(“no”) = XXXXXXXXXX / XXXXXXXXXX000108) = 75%
12
Probability densities
• Probability densities f(x) can be greater than 1; hence,
they are not probabilities
• However, they must integrate to 1: the area under the
probability density curve must be 1
• Approximate relationship between probability and
probability density can be stated as
assuming ε is sufficiently small
• When computing likelihoods, we can treat densities just
like probabilities
P(x-e / 2 £ X £ x+e / 2) »e f (x)
13
Multinomial naïve Bayes I
• Version of naïve Bayes used for document classification
using bag of words model
• n1,n2, ... , nk: number of times word i occurs in the
document
• P1,P2, ... , Pk: probability of obtaining word i when
sampling from documents in class H
• Probability of observing a particular document E given
probabilities class H (based on multinomial distribution):
• Note that this expression ignores the probability of
generating a document of the right length
• This probability is assumed to be constant for all classes
14
Multinomial naïve Bayes II
• Suppose dictionary has two words, yellow and blue
• Suppose P(yellow | H) = 75% and P(blue | H) = 25%
• Suppose E is the document “blue yellow blue”
• Probability of observing document:
Suppose there is another class H' that has
P(yellow | H’) = 10% and P(blue| H’) = 90%:
• Need to take prior probability of class into account to make the
final classification using Bayes’ rule
• Factorials do not actually need to be computed: they drop out
• Underflows can be prevented by using logarithms
P({blue yellowblue} |H )= 3!´
0.751
1!
´
0.252
2!
=
27
64
P({blue yellowblue} |H )= 3!´
0.11
1!
´
0.92
2!
=
243
1000
15
Naïve Bayes: discussion
• Naïve Bayes works surprisingly well even if independence
assumption is clearly violated
• Why? Because classification does not require accurate
probability estimates as long as maximum probability is
assigned to the co
ect class
• However: adding too many redundant attributes will cause
problems (e.g., identical attributes)
• Note also: many numeric attributes are not normally
distributed (kernel density estimators can be used instead)
16
Classification
• Any regression technique can be used for classification
• Training: perform a regression for each class, setting the output to 1
for training instances that belong to class, and 0 for those that don’t
• Prediction: predict class co
esponding to model with largest output
value (membership value)
• For linear regression this method is also known as multi-
esponse linear regression
• Problem: membership values are not in the [0,1] range, so
they cannot be considered proper probability estimates
• In practice, they are often simply clipped into the [0,1]
ange and normalized to sum to 1
17
Linear models: logistic regression
• Can we do better than using linear regression for
classification?
• Yes, we can, by applying logistic regression
• Logistic regression builds a linear model for a transformed
target variable
• Assume we have two classes
• Logistic regression replaces the target
y this target
• This logit transformation maps [0,1] to (- , + ), i.e., the new
target values are no longer restricted to the