Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Practical Data Science with Python COSC 2670/2738 Assignment 1 (Part 2) Assessment Type Individual Due Date 23:59, the 3rd of May, 2020 Marks 20 Please read all the following information before...

1 answer below »
Practical Data Science with Python
COSC 2670/2738
Assignment 1 (Part 2)
Assessment Type Individual
Due Date 23:59, the 3rd of May, 2020
Marks 20
Please read all the following information before attempting your assign-
ment. This is an individual assignment. You may not collude with any other people,
or plagiarise their work. Each student is expected to present the results of his/her own
thinking and writing. Never copy other student’s work (even if they “explain it to you
first”) and never give your written work to others. Keep any conversation high-level and
never show your solution to others. Never copy from the Web or any other resource. Re-
member you are meant to generate the solution to the questions by yourself. Suspected
collusion or plagiarism will be dealt with according to RMIT policy.
In the submission (your PDF file) you will be required to certify that the submitted
solution represents your own work only by agreeing to the following statement:
I certify that this is all my own original work. If I took any parts from
elsewhere, then they were non-essential parts of the assignment, and they
are clearly attributed in my submission. I will show we I agree to this
honor code by typing “Yes”:
A sample format for this requirement is provided, and please find it in Canvas −
Assignments − > Assignment1Part2.
Tasks
This is the part 2 of Assignment 1, and it includes two tasks. This is independent to you
assignment 1, so your cu
ent assignment 1 will not affect this part 2.
Task 1: An oral presentation of the work in Assignment 1 (10%)
The presentation should
iefly describe
• How to prepare the data?
• How to explore the data?
• What are the results from your analysis?
The presentation should be a maximum of 10 minutes. Your presentation slides should
e:
• Microsoft PowerPoint slides (with audio inserted for each slide by using: Insert
− > Audio − > Record Audio).
• or you can create your own presentation slides (e.g. PDF version) and please submit
your own record of your presentation as well.
Task 2: Short answer question (10%)
The questions in the survey can be divided into two parts:
• one is about people’s attitude or opinion about Start War movies, including:
– Have you seen any of the 6 films in the Star Wars franchise?
– Do you consider yourself to be a fan of the Star Wars film franchise?
– Which of the following Star Wars films have you seen? Please select all that
apply. (Star Wars: Episode I The Phantom Menace; Star Wars: Episode II
Attack of the Clones; Star Wars: Episode III Revenge of the Sith; Star Wars:
Episode IV A New Hope; Star Wars: Episode V The Empire Strikes Back;
Star Wars: Episode VI Return of the Jedi)
– Please rank the Star Wars films in order of preference with 1 being your favorite
film in the franchise and 6 being your least favorite film. (Star Wars: Episode I
The Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars:
Episode III Revenge of the Sith; Star Wars: Episode IV A New Hope; Sta
Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI Return of
the Jedi)
– Please state whether you view the following characters favorably, unfavorably,
or are unfamiliar with him/her. (Han Solo, Luke Skywalker, Princess Leia Or-
gana, Anakin Skywalker, Obi Wan Kenobi, Emperor Palpatine, Darth Vader,
Lando Calrissian, Boba Fett, C-3P0, R2-D2, Jar Jar Binks, Padme Amidala,
Yoda)
– Which character shot first?
– Are you familiar with the Expanded Universe?
– Do you consider yourself to be a fan of the Expanded Universe?
– Do you consider yourself to be a fan of the Star Trek franchise?
• the other is about people’s demographics, including
– Gende
– Age
– Household Income
– Education
– Location (Census Region)
We would like to build a classifier (or some classifiers, for example one classifier pe
demographic feature), which can classify people’s demographics (gender, age, household
income, education, location (census region)) based on their attitude or opinion about
2
Start War movies. Please describe how to build this classifier (or these classifiers) by
using the data collected in the survey (the data provided in Assignment 1).
Please note that this is a short-answer question, and no coding work is required. You
submission must be in PDF document, and must be at most 6 (in single column
format) pages (including figures and references) with a font size between 10
and 12 points. Penalties will apply if the report does not satisfy the requirement.
What to Submit, When, and How
The assignment is due at
23:59, the 3rd of May, 2020 .
Assignments submitted after this time will be subject to standard late submission penal-
ties.
You need to submit the following files:
• your presentation slides and the oral audio presentation as required in Task 2.
• Your Assignment1 Part2.pdf file includes your answers to Task 2.
They must be submitted as ONE single zip file, named as your student number (fo
example, XXXXXXXXXXzip if your student ID is s XXXXXXXXXXThe zip file must be submitted in
Canvas:
Assignments/Assignment 1 (Part 2).
Please do NOT submit other unnecessary files.
3

{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"#Task 1: Data Preparation\n",
"# \"You will start by loading the CSV data from the file (using appropriate pandas functions) and checking whether the loaded data is equivalent to the data in the source CSV file.\n",
"# Then, you need to clean the data by using the knowledge we taught in the lectures. You need to deal with all the potential issues/e
ors in the data appropriately (such as: typos, extra whitespaces, sanity checks for impossible values, and missing values etc). \"\n",
"\n",
"# Please structure code as follows: \n",
"# always provide one line of comments to explain the purpose of the code, e.g. load the data, checking the equivalent to original data, checking typos (do this for each other types of e
ors)\n",
"\n",
"#Code goes after this line by adding cells"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"#reading csv file using pandas\n",
"import pandas as pd\n",
"starwars = pd.read_csv(\"starwars.csv\", encoding=\"ISO-8859-1\")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(1187, 38)\n"
]
},
{
"data": {
"text/plain": [
"Index(['RespondentID',\n",
" 'Have you seen any of the 6 films in the Star Wars franchise?',\n",
" 'Do you consider yourself to be a fan of the Star Wars film franchise?',\n",
" 'Which of the following Star Wars films have you seen? Please select all that apply.',\n",
" 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',\n",
" 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',\n",
" 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',\n",
" 'Unnamed: 14',\n",
" 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',\n",
" 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',\n",
" 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',\n",
" 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',\n",
" 'Unnamed: 28', 'Which character shot first?',\n",
" 'Are you familiar with the Expanded Universe?',\n",
" 'Do you consider yourself to be a fan of the Expanded Universe?ξ',\n",
" 'Do you consider yourself to be a fan of the Star Trek franchise?',\n",
" 'Gender', 'Age', 'Household Income', 'Education',\n",
" 'Location (Census Region)'],\n",
" dtype='object')"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#to print the shape of file and data type of columns\n",
"print(starwars.shape)\n",
"starwars.columns"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"#to find null values from RespondentID column\n",
"starwars = starwars[starwars['RespondentID'].notnull()]"
]
},
{
"cell_type": "code"
Answered Same Day May 09, 2021 COSC2670

Solution

Pushpendra answered on May 10 2021
152 Votes
Project Task-2: Classification of Movie Data analysisˆ

Feature Engineering:

Feature engineering is the process of using domain knowledge to extract features from raw
data via data mining techniques. These features can be used to improve the performance
of machine learning algorithms.
 Feature engineering is a process where manually and automatically select those
features in data that contribute most to the prediction variable or output. Having
i
elevant features in your data can decrease the accuracy of many models.

 Feature selection techniques in Python with scikit-learn li
ary:

1) Calculate the no of features which has low variance. This could be applied by using a
threshold value using Variance Threshold in the sklearn li
ary.

2) Remove the features which have a high co
elation. Co
elation can be positive or
negative.

3) Univariate Feature Selection (ANOVA):

o Statistical tests can be used to select those features that have the strongest
elationship with the output variable.
o Use the chi-squared (chi2) statistical test for non-negative features to select the
est features from the dataset.
4) Recursive Feature Elimination:
o The Recursive Feature Elimination (or RFE) works by recursively removing attributes and
uilding a model on those attributes that remain. It uses the model accuracy to identify
which attributes (and combination of attributes) contribute the most to predicting the
target attribute.
Training Model:
Firstly prepare machine learning algorithm on training dataset and use predictions from this
same dataset to evaluate performance.
Split into Train and Test Sets:
o The large amount of data and the complexity of the models require very long training
times. It is typically to use a simple separation of data into training and test datasets or
training and validation datasets use Python scikit-learn machine learning
o Use 70% for training and the remaining 30% of the data for validation. The validation
dataset can be specified to the fit () function.
o The key parameter to understand about:
1) Training Dataset
2) Validation Dataset
3) Test Dataset
Training Dataset:

The sample of data used to fit the model. The actual dataset use to train the model. The
model sees and learns from this data.

Validation Dataset:

The sample of data used to provide an unbiased evaluation of a model fit on the training dataset
while tuning model hyperparameters. The evaluation becomes more biased as skill on the
validation dataset is incorporated into the model configuration.

The validation set is used to evaluate a given model, but this is for frequent evaluation.

Use this data to fine-tune the model hyperparameters, so the model occasionally sees this data,
ut never does it “Learn” from this.

We use the validation set results and update higher level hyperparameters. So the validation set
in a way affects a model, but indirectly....
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here