Practical Data Science with PythonCOSC 2670/2738Assignment 1 (Part 2)Assessment Type IndividualDue...

Question

Practical Data Science with PythonCOSC 2670/2738Assignment 1 (Part 2)Assessment Type IndividualDue Date 23:59, the 3rd of May, 2020Marks 20Please read all the following information before attempting your assign-ment. This is an individual assignment. You may not collude with any other people,or plagiarise their work. Each student is expected to present the results of his/her ownthinking and writing. Never copy other student’s work (even if they “explain it to youfirst”) and never give your written work to others. Keep any conversation high-level andnever show your solution to others. Never copy from the Web or any other resource. Re-member you are meant to generate the solution to the questions by yourself. Suspectedcollusion or plagiarism will be dealt with according to RMIT policy.In the submission (your PDF file) you will be required to certify that the submittedsolution represents your own work only by agreeing to the following statement:I certify that this is all my own original work. If I took any parts fromelsewhere, then they were non-essential parts of the assignment, and theyare clearly attributed in my submission. I will show we I agree to thishonor code by typing “Yes”:A sample format for this requirement is provided, and please find it in Canvas − Assignments − > Assignment1Part2.TasksThis is the part 2 of Assignment 1, and it includes two tasks. This is independent to youassignment 1, so your cuent assignment 1 will not affect this part 2.Task 1: An oral presentation of the work in Assignment 1 (10%)The presentation should iefly describe• How to prepare the data?• How to explore the data?• What are the results from your analysis?The presentation should be a maximum of 10 minutes. Your presentation slides shoulde:• Microsoft PowerPoint slides (with audio inserted for each slide by using: Insert− > Audio − > Record Audio).• or you can create your own presentation slides (e.g. PDF version) and please submityour own record of your presentation as well.Task 2: Short answer question (10%)The questions in the survey can be divided into two parts:• one is about people’s attitude or opinion about Start War movies, including:– Have you seen any of the 6 films in the Star Wars franchise?– Do you consider yourself to be a fan of the Star Wars film franchise?– Which of the following Star Wars films have you seen? Please select all thatapply. (Star Wars: Episode I The Phantom Menace; Star Wars: Episode IIAttack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars:Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back;Star Wars: Episode VI Return of the Jedi)– Please rank the Star Wars films in order of preference with 1 being your favoritefilm in the franchise and 6 being your least favorite film. (Star Wars: Episode IThe Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars:Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; StaWars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return ofthe Jedi)– Please state whether you view the following characters favorably, unfavorably,or are unfamiliar with him/her. (Han Solo, Luke Skywalker, Princess Leia Or-gana, Anakin Skywalker, Obi Wan Kenobi, Emperor Palpatine, Darth Vader,Lando Calrissian, Boba Fett, C-3P0, R2-D2, Jar Jar Binks, Padme Amidala,Yoda)– Which character shot first?– Are you familiar with the Expanded Universe?– Do you consider yourself to be a fan of the Expanded Universe?– Do you consider yourself to be a fan of the Star Trek franchise?• the other is about people’s demographics, including– Gende– Age– Household Income– Education– Location (Census Region)We would like to build a classifier (or some classifiers, for example one classifier pedemographic feature), which can classify people’s demographics (gender, age, householdincome, education, location (census region)) based on their attitude or opinion about2Start War movies. Please describe how to build this classifier (or these classifiers) byusing the data collected in the survey (the data provided in Assignment 1).Please note that this is a short-answer question, and no coding work is required. Yousubmission must be in PDF document, and must be at most 6 (in single columnformat) pages (including figures and references) with a font size between 10and 12 points. Penalties will apply if the report does not satisfy the requirement.What to Submit, When, and HowThe assignment is due at23:59, the 3rd of May, 2020 .Assignments submitted after this time will be subject to standard late submission penal-ties.You need to submit the following files:• your presentation slides and the oral audio presentation as required in Task 2.• Your Assignment1 Part2.pdf file includes your answers to Task 2.They must be submitted as ONE single zip file, named as your student number (foexample, XXXXXXXXXXzip if your student ID is s XXXXXXXXXXThe zip file must be submitted inCanvas:Assignments/Assignment 1 (Part 2).Please do NOT submit other unnecessary files.3 { "cells": [  {   "cell_type": "code",   "execution_count": 1,   "metadata": {},   "outputs": [],   "source": [    "#Task 1: Data Preparation
",    "# "You will start by loading the CSV data from the file (using appropriate pandas functions) and checking whether the loaded data is equivalent to the data in the source CSV file.
",    "# Then, you need to clean the data by using the knowledge we taught in the lectures. You need to deal with all the potential issues/eors in the data appropriately (such as: typos, extra whitespaces, sanity checks for impossible values, and missing values etc). "
",    "
",    "# Please structure code as follows: 
",    "# always provide one line of comments to explain the purpose of the code, e.g. load the data, checking the equivalent to original data, checking typos (do this for each other types of eors)
",    "
",    "#Code goes after this line by adding cells"   ]  },  {   "cell_type": "code",   "execution_count": 12,   "metadata": {},   "outputs": [],   "source": [    "#reading csv file using pandas
",    "import pandas as pd
",    "starwars = pd.read_csv("starwars.csv", encoding="ISO-8859-1")"   ]  },  {   "cell_type": "code",   "execution_count": 13,   "metadata": {},   "outputs": [    {     "name": "stdout",     "output_type": "stream",     "text": [      "(1187, 38)
"     ]    },    {     "data": {      "text/plain": [       "Index(['RespondentID',
",       "       'Have you seen any of the 6 films in the Star Wars franchise?',
",       "       'Do you consider yourself to be a fan of the Star Wars film franchise?',
",       "       'Which of the following Star Wars films have you seen? Please select all that apply.',
",       "       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
",       "       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
",       "       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
",       "       'Unnamed: 14',
",       "       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
",       "       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
",       "       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
",       "       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
",       "       'Unnamed: 28', 'Which character shot first?',
",       "       'Are you familiar with the Expanded Universe?',
",       "       'Do you consider yourself to be a fan of the Expanded Universe?æ',
",       "       'Do you consider yourself to be a fan of the Star Trek franchise?',
",       "       'Gender', 'Age', 'Household Income', 'Education',
",       "       'Location (Census Region)'],
",       "      dtype='object')"      ]     },     "execution_count": 13,     "metadata": {},     "output_type": "execute_result"    }   ],   "source": [    "#to print the shape of file and data type of columns
",    "print(starwars.shape)
",    "starwars.columns"   ]  },  {   "cell_type": "code",   "execution_count": 14,   "metadata": {},   "outputs": [],   "source": [    "#to find null values from RespondentID column
",    "starwars = starwars[starwars['RespondentID'].notnull()]"   ]  },  {   "cell_type": "code"

Pushpendra · Accepted Answer

Project Task-2: Classification of Movie Data analysisˆ  
 
Feature Engineering:  
 
Feature engineering is the process of using domain knowledge to extract features from raw 
data via data mining techniques. These features can be used to improve the performance 
of machine learning algorithms. 
 Feature engineering is a process where manually and automatically select those 
features in data that contribute most to the prediction variable or output. Having 
irrelevant features in your data can decrease the accuracy of many models. 
   
 Feature selection techniques in Python with scikit-learn library:  
 
1) Calculate the no of features which has low variance. This could be applied by using a 
threshold    value using Variance Threshold in the sklearn library. 
 
2) Remove the features which have a high correlation. Correlation can be positive or 
negative. 
 
3) Univariate Feature Selection (ANOVA):  
 
o Statistical tests can be used to select those features that have the strongest 
relationship with the output variable.  
o Use the chi-squared (chi2) statistical test for non-negative features to select the 
best features from the dataset. 
             4) Recursive Feature Elimination:  
o The Recursive Feature Elimination (or RFE) works by recursively removing attributes and 
building a model on those attributes that remain. It uses the model accuracy to identify 
which attributes (and combination of attributes) contribute the most to predicting the 
target attribute.
Training Model:  
Firstly prepare machine learning algorithm on training dataset and use predictions from this 
same dataset to evaluate performance. 
Split into Train and Test Sets:  
o The large amount of data and the complexity of the models require very long training 
times. It is typically to use a simple separation of data into training and test datasets or 
training and validation datasets use Python scikit-learn machine learning 
o Use 70% for training and the remaining 30% of the data for validation. The validation 
dataset can be specified to the fit () function. 
o The key parameter to understand about:  
1) Training Dataset 
2) Validation Dataset 
3)  Test Dataset 
Training Dataset: 
  
The sample of data used to fit the model. The actual dataset use to train the model. The 
model sees and learns from this data. 
 
Validation Dataset:  
 
The sample of data used to provide an unbiased evaluation of a model fit on the training dataset 
while tuning model hyperparameters. The evaluation becomes more biased as skill on the 
validation dataset is incorporated into the model configuration. 
 
The validation set is used to evaluate a given model, but this is for frequent evaluation. 
 
Use this data to fine-tune the model hyperparameters, so the model occasionally sees this data, 
but never does it “Learn” from this. 
 
We use the validation set results and update higher level hyperparameters. So the validation set 
in a way affects a model, but indirectly.

Practical Data Science with Python COSC 2670/2738 Assignment 1 (Part 2) Assessment Type Individual Due Date 23:59, the 3rd of May, 2020 Marks 20 Please read all the following information before...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment