Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Practical Data Science with Python COSC 2670/2738 Assignment 1 Assessment Type Individual Due Date 23:59, the 15th of April, 2020 Marks 15 Introduction In this assignment, you will examine a data file...

1 answer below »
Practical Data Science with Python
COSC 2670/2738
Assignment 1
Assessment Type Individual
Due Date 23:59, the 15th of April, 2020
Marks 15
Introduction
In this assignment, you will examine a data file and ca
y out the first steps of the data
science process, including the cleaning and exploring of data.
You will need to develop and implement appropriate steps, in IPython, to load a data
file into memory, clean, process, and analyse it.
This assignment is intended to give you practical experience with the typical first
steps of the data science process.
The “Practical Data Science” Canvas contains further announcements and a discus-
sion board for this assignment. Please be sure to check these on a regular basis – it
is your responsibility to stay informed with regards to any announcements or changes.
Login through https:
learninghub.rmit.edu.au.
Where to Develop Your Code
You are encouraged to develop and test your code in two environments: Jupyter Note-
ook on Lab PCs and Teaching Servers.
Jupyter Notebook on Lab PCs
On Lab Computer, you can find Jupyter Notebook via:
Start → All Programs → Anaconda3 (64-bit) → Jupyter Notebook
Then,
• Select New → Python 3
• The new created ‘*.ipynd’ is created at the following location:
– C:\Users\sXXXXXXX
– where sXXXXXXX should be replaced with a string consisting of the lette
“s” followed by your student number.
https:
learninghub.rmit.edu.au
Teaching Servers
Three CSIT teaching servers are available for your use:
(titan|saturn|jupiter).csit.rmit.edu.au.
Details for how to access these servers are available in ‘‘Extra: Run Anaconda on
RMIT Coreteaching Servers’’ under the Modules/Week2: Data Curation section of
the course Canvas. You are encouraged to develop your code on these machines.
If you choose to develop your code elsewhere, it is your responsibility to ensure that
your assignment submission can be successfully run using the version of IPython installed
on Lab PCs or (titan|saturn|jupiter).csit.rmit.edu.au, as this is where your code
will be run for marking purposes.
Important: You are required to make regular backups of all of your work. This is
good practice, no matter where you are developing your assignment solutions.
Academic integrity and plagiarism (standard warning)
Academic integrity is about honest presentation of your academic work. It means ac-
knowledging the work of others while developing your own insights, knowledge and ideas.
You should take extreme care that you have:
• Acknowledged words, data, diagrams, models, frameworks and/or ideas of others
you have quoted (i.e. directly copied), summarised, paraphrased, discussed or men-
tioned in your assessment through the appropriate referencing methods
• Provided a reference list of the publication details so your reader can locate the
source if necessary. This includes material taken from Internet sites. If you do not
acknowledge the sources of your material, you may be accused of plagiarism because
you have passed off the work and ideas of another person without appropriate
eferencing, as if they were your own.
RMIT University treats plagiarism as a very serious offence constituting misconduct.
Plagiarism covers a variety of inappropriate behaviours, including:
• Failure to properly document a source
• Copyright material from the internet or databases
• Collusion between students
For further information on our policies and procedures, please refer to the following:
https:
www.rmit.edu.au/students/student-essentials
ights-and-responsibilities
academic-integrity.
All submission will be checked by TurnedIn.
General Requirements
This section contains information about the general requirements that your assignment
must meet. Please read all requirements carefully before you start.
• You must do the analysis in IPython.
2
https:
www.rmit.edu.au/students/student-essentials
ights-and-responsibilities/academic-integrity
https:
www.rmit.edu.au/students/student-essentials
ights-and-responsibilities/academic-integrity
• Parts of this assignment will include a written report, this must be in PDF format.
• Please ensure that your submission follows the file naming rules specified in the
tasks below. File names are case sensitive, i.e. if it is specified that the file name is
gryphon, then that is exactly the file name you should submit; Gryphon, GRYPHON,
griffin, and anything else but gryphon will be rejected.
Assessment details
Task 1: Data Preparation (5%)
Have a look at the file StarWars.csv, which is available in Canvas under the Assignments
-> Assignment 1 section of the course Canvas.
This file contains data behind the story America’s Favorite ‘Star Wars’ Movies (And
Least Favorite Characters)1. The author collected the data by running a poll through
SurveyMonkey Audience, surveying 1,186 respondents. The description of the questions
asked in the survey is given below.
• Have you seen any of the 6 films in the Star Wars franchise?
• Do you consider yourself to be a fan of the Star Wars film franchise?
• Which of the following Star Wars films have you seen? Please select all that apply.
(Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the
Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New
Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI
Return of the Jedi)
• Please rank the Star Wars films in order of preference with 1 being your favorite
film in the franchise and 6 being your least favorite film. (Star Wars: Episode I The
Phantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: Episode
III Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: Episode
V The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi)
• Please state whether you view the following characters favorably, unfavorably, or are
unfamiliar with him/her. (Han Solo, Luke Skywalker, Princess Leia Organa, Anakin
Skywalker, Obi Wan Kenobi, Emperor Palpatine, Darth Vader, Lando Calrissian,
Boba Fett, C-3P0, R2-D2, Jar Jar Binks, Padme Amidala, Yoda)
• Which character shot first?
• Are you familiar with the Expanded Universe?
• Do you consider yourself to be a fan of the Expanded Universe?
• Do you consider yourself to be a fan of the Star Trek franchise?
• Gende
• Age
1https:
github.com/fivethirtyeight/data/tree/maste
star-wars-survey
3
gryphon
Gryphon
GRYPHON
griffin
gryphon
• Household Income
• Education
• Location (Census Region)
Being a careful data scientist, you know that it is vital to carefully check any available
data before starting to analyse it. Your task is to prepare the provided data for analysis.
You will start by loading the CSV data from the file (using appropriate pandas functions)
and checking whether the loaded data is equivalent to the data in the source CSV file.
Then, you need to clean the data by using the knowledge we taught in the lectures. You
need to deal with all the potential issues/e
ors in the data appropriately.
Task 2: Data Exploration (5%)
Explore the provided data based on the following steps:
1. Explore the survey question: Please rank the Star Wars films in order of preference
with 1 being your favorite film in the franchise and 6 being your least favorite film.
(Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of the
Clones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A New
Hope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VI
Return of the Jedi), then analysis how people rate Star Wars Movies.
2. Explore the relationships between columns. You need to choose 3 pairs of columns
to focus on, and you need to generate 1 visualisation for each pair. Each pai
of columns that you choose should address a plausible hypothesis for the data
concerned.
3. Explore whether there are relationship between people’s demographics (Gender,
Age, Household Income, Education, Location) and their attitude to Start Wa
characters.
Note, each visualization (graph) shoul be complete and informative in itself, and should
e clear for readers to read and obtain information.
Task 3: Report (5%)
Write your report and save it in a file called report.pdf, and it must be in PDF format,
and must be at most 6 (in single column format) pages (including figures and
eferences) with a font size between 10 and 12 points. Penalties will apply if
the report does not satisfy the requirement. Moreover, the quality of the report will be
considered, e.g. clarity, grammar mistakes, the flow of the presentation.
Remember to clearly cite any sources (including books, research papers, course notes,
etc.) that you refe
ed to while designing aspects of your programs.
• Create a heading called “Data Preparation” in your report.
4
– Provide a
ief explanation of how you addressed the task. For the steps of
dealing with the potential issues/e
ors, please create a sub-section for each
type of e
ors you dealt with (e.g. typos, extra whitespaces, sanity checks fo
impossible values, and missing values etc), and also explain and justify how
you dealt with each kind of e
ors.
• Create a heading called “Data Exploration” in your report.
– For each numbered step in Task 2 above, create a sub-section with co
espond-
ing numbering.
What to Submit, When, and How
The assignment is due at
23:59, the 15th of April, 2020.
Assignments submitted after this time will be subject to standard late submission penal-
ties.
You need to submit the following files:
• Notebook file containing your python commands for Task 1 and Task 2, ‘assign-
ment1.ipynb’. Please use the provided solution template to organise you
solutions: assignment1 TEMPLATE.ipyn
# For the notebook files, please make sure to clean them and remove any unnecessary
lines of code (cells). Follow these steps before submission:
1. Main menu → Kernel → Restart & Run All
2. Wait till you see the output displayed properly. You should see all the data
printed and graphs displayed.
• Your report.pdf file: at most 6 (in single column format) pages (including
figures and references) with a font size between 10 and 12 points. Penalties
will apply if the report does not satisfy the requirement.
They must be submitted as ONE single zip file, named as your student number (fo
example, XXXXXXXXXXzip if your student ID is s XXXXXXXXXXThe zip file must be submitted in
Canvas:
Assignments/Assignment 1.
Please do NOT submit other unnecessary files.
5
A Marking Guidelines
Data Preparation Data Exploration Report
(Maximum = 5 marks) (Maximum = 5 marks) (Maximum = 5 marks)
5 marks 5 marks 5 marks
Data preparation is well
designed, systematic and well
explained. All potential
e
ors/issues have been
completely examined and
properly treated
Analysis is thorough and demonstrates
understanding and critical analysis. Well-
easoned exploration are provided for all
sub-tasks. All analysis, comparisons and
conclusions are evidenced by data (e.g. in
well-formatted figures and/or tables).
Very clear, well struc-
tured and accessible re-
port, an undergraduate
student can pick up the
eport and understand
it with no difficulty.
4 marks 4 marks 4 marks
Data preparation is
easonably designed,
systematic and explained.
There are at least one
obvious missing issue/e
or.
Each examined e
o
issue
have been completely checked
and properly treated.
Analysis is thorough and demonstrates
good understanding and critical
Answered Same Day Apr 18, 2021 COSC2670

Solution

Neha answered on Apr 19 2021
156 Votes
Data preparation
Pandas in the python is used as the data manipulation and analysis li
ary. It is one of the
cornerstones of the python scientific programming stack. It can be used for multiple task
which also involves data preparation. The data preparation can be done using the CRISP-DM
model. Another method is KDD process which involves the selection, preprocessing and
transformation.
Exploratory data analysis
It is one of the point from the data analysis field, data science or the machine learning
project. It can be defined as the practice of including visual and quantitative methods to
help us in understanding the dataset without assuming anything. It is an important and
crucial step before entering the machine learning or any statistical modeling.
Dealing with the missing values
Here are some common methods which can be used to deal with the missing values present
in the dataset
1) Drop instances and attributes
2) Impute the attribute mean, median and mode for all the missing...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here