Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

temp XXXXXXXXXX Homework Assignment #5 (Individual)Homework Assignment #5 (Individual) Using SVMs and PCA with new data: The Palmer Penguins DatasetUsing SVMs and PCA with new data: The Palmer...

1 answer below »

temp XXXXXXXXXX
Homework Assignment #5 (Individual)Homework Assignment #5 (Individual)
Using SVMs and PCA with new data: The Palmer Penguins DatasetUsing SVMs and PCA with new data: The Palmer Penguins Dataset
� Put your name here.� Put your name here.
� Put your _GitHub username_ here.� Put your _GitHub username_ here.
Goals for this homeworkGoals for this homework
assignmentassignment
By the end of this assignment, you should be able
to:
Use git to track your work and turn in you
assignment
Read in data and prepare it for modeling
Build, fit, and evaluate an SVC model of data
Use PCA to reduce the number of important
features
Build, fit, and evaluate an SVC model of PCA-transformed data
Systematically investigate the effects of the number of PCA components on an SVC model of data
Assignment instructions:Assignment instructions:
Work through the following assignment, making sure to follow all of the directions and answer all of the
questions.
There are 47 points (+2 bonus points)47 points (+2 bonus points) possible on this assignment. Point values for each part are included in the
section headers.
This assignment is due at 11:59 pm on Friday, December 3. It should be pushed to your repo (see Part 1) anddue at 11:59 pm on Friday, December 3. It should be pushed to your repo (see Part 1) and
submitted to D2Lsubmitted to D2L.
ImportsImports
It's useful to put all of the imports you need for this assignment in one place. Read through the assignment to
figure out which imports you'll need or add them here as you go.
In [ ]:
# Put all necessary imports here
1. Add to your Git repository to track your progress on your assignment1. Add to your Git repository to track your progress on your assignment
(4 points)(4 points)
As usual, for this assignment, you're going to add it to the cmse202-f21-turnin repository you created in
class so that you can track your progress on the assignment and preserve the final version that you turn in. In
order to do this you need to
� Do the following� Do the following :
1. Navigate to your cmse202-f21-turnin repository and create a new directory called hw-05 .
2. Move this notebook into that new directorynew directory in your repository, then add it and commit it to your repositoryadd it and commit it to your repository .
3. Finally, to test that everything is working, "git push" the file so that it ends up in your GitHub repository.
ImportantImportant: Make sure you've added your Professor and your TA as collaborators to your "turnin" respository
with "Read" access so that we can see your assignment (you should have done this in the previous homework
assignment)
Also importantAlso important: Make sure that the version of this notebook that you are working on is the same one that you
just added to your repository! If you are working on a different copy of the noteobok, none of your changes willnone of your changes will
e trackedbe tracked!
If everything went as intended, the file should now show up on your GitHub account in the " cmse202-f21-
turnin " repository inside the hw-05 directory that you just created. Periodically, you'll be asked to commityou'll be asked to commit
your changes to the repository and push them to the remote GitHub locationyour changes to the repository and push them to the remote GitHub location. Of course, you can always commit
your changes more often than that, if you wish. It can be good to get into a habit of committing your changes
any time you make a significant modification, or when you stop working on the project for a bit.
� Do thisDo this : Before you move on, put the command that your instructor should run to clone your repository in the
markdown cell below.
# Put the command for cloning your repository here!
2. Loading a new dataset: The Palmer Penguins data (8 points)2. Loading a new dataset: The Palmer Penguins data (8 points)
We've the seen the iris dataset a number of times in the course so far and it has a number of nice features that
make it useful for getting some practice with some of the machine learning methods that are around today.
However, recently a new dataset was suggested as a possible replacement/alternative for the iris data: the
"Palmer Penguins" -- perhaps you've already seen it before! This dataset also has some nice properties that
make it a good playground for experiment with machine learning tools. You can learn more about the dataset on
the their website.
Since the goal for this assignment is to practice using the SVM and PCA tools we've covered in class, we'll going
to use this relatively simple dataset and avoid any complicated data wrangling headaches!
The dataThe data
The penguins dataset is pretty straight forward, but you'll need to download the data and give yourself some
time to get familiar with it.
� Do This:� Do This: To get started, you'll need to download the following fileyou'll need to download the following file :
https:
aw.githubusercontent.com/msu-cmse-courses/cmse202-F21-
data/main/data/penguins_size.csv
Once you've downloaded the data, open the files using a text
owser or other tool on your computer and take aopen the files using a text
owser or other tool on your computer and take a
look at the data to get a sense for the information it contains.look at the data to get a sense for the information it contains. You'll probably also want to read through the
information on the palmerpenguins website to get a sense for what the values co
espond to. The website talks
about two different versions of the data, a simplified one and a "raw" one with more values. Which one are youWhich one are you
working with?working with?
2.1 Load the data2.1 Load the data
� Task XXXXXXXXXXpoints):� Task XXXXXXXXXXpoints): Read the penguin_size.csv file into your notebook. For the purposes of this
assignment, we're going to use "species" as the class that we'll be trying to predict with our classification
model. To make this clear, you should rename the rename the speciesspecies column to be column to be classclass . The species class should
cu
ently have the following class labels:
https:
allisonhorst.github.io/palmerpenguins
https:
allisonhorst.github.io/palmerpenguins
"Adelie"
"Chinstrap"
"Gentoo"
Once you've loaded in the data and changed the species column to class , display the DataFrame to makedisplay the DataFrame to make
sure it looks reasonablesure it looks reasonable. You should have 7 columns7 columns and 344 rows344 rows .
In [ ]:
# Put your code here
2.2 Relabeling the classes2.2 Relabeling the classes
To simplify the process of modeling the penguin data, we should convert the class labels from strings to
integers. For example, rather than Adelie , we can consider this to be class " 0 ".
� Task XXXXXXXXXXpoints):� Task XXXXXXXXXXpoints): Replace all of the strings in your "class" column with integers based on the following:
original labeloriginal label replaced labelreplaced label
Adelie 0
Chinstrap 1
Gentoo 2
Once you've replaced the labels, display your DataFrame and confirm that it looks co
ect.
In [ ]:
# Put your code here
2.3 Removing rows with missing data2.3 Removing rows with missing data
At this point, you've hopefully noticed that some of the rows seems to be missing data values as indicated by the
existence of NaN values. Since we don't necessarily know what to replace these values with, let's just play it
safe and remove all of the rows that have NaN in any of the column entries. This should help to ensure that we
don't end up with e
ors or confusing results when we try to classify the data.
� Task XXXXXXXXXXpoint):� Task XXXXXXXXXXpoint): Remove all of the rows that contain a NaN in any column. Make sure you actually store thisMake sure you actually store this
new version of your dataframe either in the original variable name or in a new variable namenew version of your dataframe either in the original variable name or in a new variable name. If everything went
as intended, you should find that you have 334 rows left over.
In [ ]:
# Put your code here
2.4 Separating the "features" from the "labels"2.4 Separating the "features" from the "labels"
As we've seen when working with sklearn it can be much easier to work with the data if we have separate
variables that store the features and the labels.
� Task XXXXXXXXXXpoint):� Task XXXXXXXXXXpoint): Split your DataFrame so that you have two separate DataFrames, one called features ,
which contains all of the penguin features, and one called labels , which contains all of the new penguin
integer labels you just created.
In [ ]:
# Put your code here
� Question XXXXXXXXXXpoint):Question XXXXXXXXXXpoint): How balanced is your set of penguin classes? Does it matter for the set of classes to be
alanced? Why or why not? (You might need to write a bit of code to figure out how balanced your set of
penguin classes is.)
✎ Erase this and put your answer here.
2.5 Dropping the non-numeric features2.5 Dropping the non-numeric features
The last thing we should probably do before you move on to building your classifier model is to drop the two
categorical (i.e. non-numeric) features from our set of features to avoid confusing or complicating the model.
� Task XXXXXXXXXXpoint):� Task XXXXXXXXXXpoint): Drop the two non-numeric columns from your new features dataframe. You should end
up with your final four features, which should all have floating point values. Display your new Display your new featuresfeatures
dataframe to make sure this is truedataframe to make sure this is true.
In [ ]:
# Put your code here
� STOP� STOP
Pause to commit your changes to your Git repository!Pause to commit your changes to your Git repository!
Take a moment to save your notebook, commit the changes to your Git repository using the commit message
"Committing Part 2", and push the changes to GitHub.
3. Building an SVC model (4 points)3. Building an SVC model (4 points)
Now, to tackle this classification problem, we will use a support vector machine just like we've done previously
(e.g. in the Day 19 and Day 20 assignmentsDay 19 and Day 20 assignments ). Of course, we could easily replace this with any sklearn classifie
we choose, but for now we will just use an SVC with a linear kernel.
3.1 Splitting the data3.1 Splitting the data
But first, we need to split our data into training and testing data!
� Task XXXXXXXXXXpoint):� Task XXXXXXXXXXpoint): Split your data into a training and testing set with a training set representing 75% of you
data. For reproducibility , set the random_state argument to XXXXXXXXXXPrint the lengths to show you have the
ight number of entries.
Answered 5 days After Nov 22, 2021

Solution

Sathishkumar answered on Nov 27 2021
137 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here