CS 301 Fall 2022 Sample Exam Solution Time 2 hour 30 minutes Total points -45 Name...

Question

CS 301 Fall 2022 Sample Exam Solution
Time 2 hour 30 minutes
Total points -45

Name --------------------------------------------------------

1. Multiple choice or single answer question. You do not need to show your work.
XXXXXXXXXX=10)
a. Which one has the highest entropy (select all that applies)?
i. A fair coin
ii. An unfair coin.
iii. A 6-sided fair dice
iv. A 4 sided fair dice
. The accuracy on an unpruned decision tree is 40% on the test data, while that on the
training data is close to 60%. Write the accuracy expression of this using 0.632
ootstrap?

1
?
∑ 0.632 × XXXXXXXXXX × 0.6
?=1..?
c. The following data demonstrates the relationship between Math Score and Age of
the test writers. Write down the model equation.

score = XXXXXXXXXX * Age XXXXXXXXXX
d. The dataset contains the following 10 training points (each point contains their
coordinates and a class label).
X1(1,1) - Male, X2 (2,2) - Male, X3(2, 2.5) – Female, X4 (3,7) -Female, X5(9,9) –
Male, X6 (8,9) -Male, X7(3,3) – Male, X8(9,9) -Male, X9(9,10) - Female, X10(9,5) –
Female
Classify a test point X(8,7) using k (=3) nearest neighbor based classification. Use
Manhattan distance.
Answer: Male
e. A transaction Database contains three transactions as follows = {,
{, < a1, …, a50>} .Min support = 2.
Write down the closed itemsets and the co
esponding support count. Write down
the maximum itemsets.

Answer:
Closed itemsets:
a1, …, a75>: 2
a1, …, a50>: 3
Max itemsets: : 2

f. Select all those are true.
i. Supervised discretization could be obtained by applying information gain
ased criteria
ii. Supervised discretization could be obtained by simply finding the pure
intervals where the class labels are the same
iii. discretization could be obtained by applying Elbow method

2. Given the dataset below and a support threshold of 3 and confidence of 100%, generate
all the association rules that describe weather conditions with play outcome. Find out the
closed and the max patterns XXXXXXXXXX=10)
We have solved exactly this in the class.

3. Given to us are the following 6 objects. Run AGNES over it and compute the
dendrogram (use single link as inter cluster distance). Show each step and the computed
dissimilarity matrix (10)
Age Test-1 Test-2 Standing Gender AP courses
taken
A 20 P P Junior M Chem,
Math
B 19 P P Sophomore F CS, Math
C 19 F P Freshman F English
D 18 F F Freshman F
E 25 F F Senior M Math,
Physics
F 24 F F Senior M Math
G 21 F P Junior M CS, Chem
H 21 P F Sophomore M Physics
I 20 F P Junior F English
We have done similar problem before and AGNS on top
4. Given below is historical data that determines the play decision based on weather
parameters.

a. Classify X (Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
using Naïve Bayes classifier. (5)
i. Solution in the next page
. Compute the Naïve Bayes scores (up to two decimal points rounded) produced by
D1, D2, D3, D4, D5 considering their respective actual class to be their predicted
class. Draw the ROC curve of D1- D5. (5)
i. Produce score of D1, D2, D3, D4, D5
ii. ROC curve as we have covered in the class.
If the cost of false positive (positive class is Play=Yes) is 9, and false negative is 1,
find the best threshold computed for the records considered in b. (5)
You can ignore this one. I wont ask this question as this was not covered.

cs-301-sample-final-exam-withsolution-1pvub3cq.pdf

Karthi · Accepted Answer

a. i. A fair coin iii. A 6-sided fair dice 
In information theory, entropy is a measure of the amount of uncertainty in a random variable. 
For a fair coin, each possible outcome (heads or tails) is equally likely, so there is maximum 
uncertainty or entropy. Similarly, a 6-sided fair dice has six possible outcomes, each with equal 
probability, so it also has maximum entropy. An unfair coin or a 4-sided fair dice, on the other 
hand, would have less entropy because some outcomes are more likely than others.
b. 
The accuracy expression using 0.632 bootstrap for the given scenario is: 
0.632 × 0.4 + 0.368 × 0.6 
The 0.632 bootstrap is a method used to estimate the accuracy of a classifier. It involves 
repeatedly sampling the training data with replacement, building a classifier on each sample, 
and then averaging the accuracy of the classifiers. The 0.632 factor comes from the fact that 
each bootstrapped sample is typically about 63.2% of the size of the original training set. In the 
given scenario, the accuracy on the test data is 40%, while that on the training data is close to 
60%, so the 0.632 bootstrap method can be used to estimate the overall accuracy of the 
classifier.
c.  
The given data demonstrates a linear relationship between Math Score and Age of the test 
writers. The model equation for this relationship can be written as: 
score = -0.8157 * Age + 79.56886 
This equation represents a line with a slope of -0.8157 and a y-intercept of 79.56886. Given an 
age, the equation can be used to predict the corresponding math score. For example, if the age 
of a test writer is 20, their predicted math score would be -0.8157 * 20 + 79.56886 = 74.09206.
d.  
To classify a test point X(8,7) using k-nearest neighbor based classification with k=3 and 
Manhattan distance, we first need to calculate the distance between the test point and each of 
the 10 training points. The Manhattan distance between two points (x1, y1) and (x2, y2) is given 
by |x1-x2| + |y1-y2|. Using this formula, the distances between X(8,7) and each of the training 
points can be calculated as follows: 
X1(1,1): |1-8| + |1-7| = 14  
X2(2,2): |2-8| + |2-7| = 13  
X3(2,2.5): |2-8| + |2.5-7| = 13.5  
X4(3,7): |3-8| + |7-7| = 5  
X5(9,9): |9-8| + |9-7| = 4  
X6(8,9): |8-8| + |9-7| = 2  
X7(3,3): |3-8| + |3-7| = 9  
X8(9,9): |9-8| + |9-7| = 4  
X9(9,10): |9-8| + |10-7| = 6  
X10(9,5): |9-8| + |5-7| = 5 
Next, we need to sort the distances in ascending order and select the k=3 closest training 
points. The 3 closest training points are X4(3,7), X5(9,9), and X6(8,9), with distances of 5, 4, and 
2, respectively. Finally, we need to classify the test point based on the majority class label 
among these 3 points. In this case, all 3 points have the class label "Male", so the test point 
X(8,7) would be classified as "Male" using 3-nearest neighbor based classification with 
Manhattan distance.
e.  
In a transaction database, an itemset is a set of items that occur together in a transaction. The 
support of an itemset is the number of transactions in which the itemset appears. A closed 
itemset is an itemset such that no proper superset of the itemset has the same support. A 
maximum itemset is an itemset that is not a subset of any other itemset with the same support. 
Given the transaction database {, {, } and a 
minimum support of 2, the closed itemsets and the corresponding support counts are: 
: 2 : 3 
Both of these itemsets are closed because they have a support of 2 and there are no proper 
supersets of these itemsets with the same support. The maximum itemsets are: 
: 2 
This is the only maximum itemset because it is not a subset of any other itemset with the same 
support.
f. 
The following statements are true: 
i. Supervised discretization could be obtained by applying information gain based criteria. 
ii. Supervised discretization could be obtained by simply finding the pure intervals where the 
class labels are the same. 
iii. Discretization could be obtained by applying Elbow method. 
Supervised discretization is the process of dividing a continuous variable into a set of discrete 
intervals or bins. This can be useful for simplifying complex data and making it easier to analyze. 
There are several methods for performing supervised discretization, including information gain 
based criteria, pure interval finding, and the Elbow method. The first method uses information 
gain to find the intervals that are most useful for predicting the class label. The second method 
finds intervals where all the observations have the same class label. The third method uses the 
Elbow method to determine the optimal number of intervals. All of these methods can be used 
to obtain supervised discretization.
2.  
To generate all association rules that describe weather conditions with play outcome, we need 
to first find all frequent itemsets in the dataset using the support threshold of 3. For example, 
the itemset {Outlook=Sunny, Play=No} has a support of 3, since it appears in transactions 1, 2, 
and 8. Similarly, the itemset {Outlook=Overcast, Play=Yes} has a support of 4, since it appears in 
transactions 3, 7, 13,

CS 301 Fall 2022 Sample Exam Solution Time 2 hour 30 minutes Total points -45 Name -------------------------------------------------------- 1. Multiple choice or single answer question....

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment