CSI 5810 (Assignment # 1)
1. In this exercise, you will work with Census Income Data Set that you can download
from the following link:
https://archive.ics.uci.edu/ml/datasets/Census+Income
Once you have downloaded the data, you will prepare a data visualization
report. Feel free to provide any additional visualization that might help in better
understanding of the data. Write a paragraph about what characteristics of the
data you see via visualization.
2. This exercise is designed to make you familiar with multivariate normal
distribution generation and using the generated data.
a. Generate 300 3-dimensional vectors that come from a normal
distribution with mean vector as [1 2 1]t and 3x3 covariance matrix as
[5 0.8 -0.3; 0.8 3 0.6; -0.3 0.6 4]
b. Make scatter plots of x1 vs x2, x1 vs x3, and x2 vs x3. Explain whatever
relationships you can gather from these plots.
c. Calculate the mean vector and the covariance matrix using the 300
generated points.
d. Pick any 5 pairs of generated vectors and calculate the Euclidean and
the Mahalanobis distances between those pairs.
3. You will perform this exercise using the PCA-Exercise data posted on the course
page.
Suppose we are interested in reducing the six-dimensional records to two
dimensions by means of principal component analysis. List the eigenvalues and
eigenvectors obtained via PCA. Determine the reduced representation for all of the
records and plot the reduced representation in the form of a scatter plot.
Reconstruct the original data and compute the reconstruction error.
4. In this exercise, you will apply PCA to the Spoken Arabic Digit Dataset at the following
link:
https://archive.ics.uci.edu/ml/datasets/Spoken+Arabic+Digit
You will use stratified sampling to select only 100 vectors/class, and reduce the train
data to two dimensions [The class labels are not used in PCA]. List all eigenvalues and
make a scatter plot of the transformed data. Show transformed data points for any digit
pair of your choice in different colors or shapes.
5. Repeat Exercise #4 using t-SNE visualization method to visualize the entire train
data set. Comment on the results obtained.
Note: The submission should be in the form of a single PDF document.
Submission in any other format will not be graded.