Practical Data Science with PythonCOSC 2670/2738Assignment 1Assessment Type IndividualDue Date...

Question

Practical Data Science with PythonCOSC 2670/2738Assignment 1Assessment Type IndividualDue Date 23:59, the 15th of April, 2020Marks 15IntroductionIn this assignment, you will examine a data file and cay out the first steps of the datascience process, including the cleaning and exploring of data.You will need to develop and implement appropriate steps, in IPython, to load a datafile into memory, clean, process, and analyse it.This assignment is intended to give you practical experience with the typical firststeps of the data science process.The “Practical Data Science” Canvas contains further announcements and a discus-sion board for this assignment. Please be sure to check these on a regular basis – itis your responsibility to stay informed with regards to any announcements or changes.Login through https:learninghub.rmit.edu.au.Where to Develop Your CodeYou are encouraged to develop and test your code in two environments: Jupyter Note-ook on Lab PCs and Teaching Servers.Jupyter Notebook on Lab PCsOn Lab Computer, you can find Jupyter Notebook via:Start → All Programs → Anaconda3 (64-bit) → Jupyter NotebookThen,• Select New → Python 3• The new created ‘*.ipynd’ is created at the following location:– C:\Users\sXXXXXXX– where sXXXXXXX should be replaced with a string consisting of the lette“s” followed by your student number.https:learninghub.rmit.edu.auTeaching ServersThree CSIT teaching servers are available for your use:(titan|saturn|jupiter).csit.rmit.edu.au.Details for how to access these servers are available in ‘‘Extra: Run Anaconda onRMIT Coreteaching Servers’’ under the Modules/Week2: Data Curation section ofthe course Canvas. You are encouraged to develop your code on these machines.If you choose to develop your code elsewhere, it is your responsibility to ensure thatyour assignment submission can be successfully run using the version of IPython installedon Lab PCs or (titan|saturn|jupiter).csit.rmit.edu.au, as this is where your codewill be run for marking purposes.Important: You are required to make regular backups of all of your work. This isgood practice, no matter where you are developing your assignment solutions.Academic integrity and plagiarism (standard warning)Academic integrity is about honest presentation of your academic work. It means ac-knowledging the work of others while developing your own insights, knowledge and ideas.You should take extreme care that you have:• Acknowledged words, data, diagrams, models, frameworks and/or ideas of othersyou have quoted (i.e. directly copied), summarised, paraphrased, discussed or men-tioned in your assessment through the appropriate referencing methods• Provided a reference list of the publication details so your reader can locate thesource if necessary. This includes material taken from Internet sites. If you do notacknowledge the sources of your material, you may be accused of plagiarism becauseyou have passed off the work and ideas of another person without appropriateeferencing, as if they were your own.RMIT University treats plagiarism as a very serious offence constituting misconduct.Plagiarism covers a variety of inappropriate behaviours, including:• Failure to properly document a source• Copyright material from the internet or databases• Collusion between studentsFor further information on our policies and procedures, please refer to the following:https:www.rmit.edu.au/students/student-essentialsights-and-responsibilitiesacademic-integrity.All submission will be checked by TurnedIn.General RequirementsThis section contains information about the general requirements that your assignmentmust meet. Please read all requirements carefully before you start.• You must do the analysis in IPython.2https:www.rmit.edu.au/students/student-essentialsights-and-responsibilities/academic-integrityhttps:www.rmit.edu.au/students/student-essentialsights-and-responsibilities/academic-integrity• Parts of this assignment will include a written report, this must be in PDF format.• Please ensure that your submission follows the file naming rules specified in thetasks below. File names are case sensitive, i.e. if it is specified that the file name isgryphon, then that is exactly the file name you should submit; Gryphon, GRYPHON,griffin, and anything else but gryphon will be rejected.Assessment detailsTask 1: Data Preparation (5%)Have a look at the file StarWars.csv, which is available in Canvas under the Assignments-> Assignment 1 section of the course Canvas.This file contains data behind the story America’s Favorite ‘Star Wars’ Movies (AndLeast Favorite Characters)1. The author collected the data by running a poll throughSurveyMonkey Audience, surveying 1,186 respondents. The description of the questionsasked in the survey is given below.• Have you seen any of the 6 films in the Star Wars franchise?• Do you consider yourself to be a fan of the Star Wars film franchise?• Which of the following Star Wars films have you seen? Please select all that apply.(Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of theClones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A NewHope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VIReturn of the Jedi)• Please rank the Star Wars films in order of preference with 1 being your favoritefilm in the franchise and 6 being your least favorite film. (Star Wars: Episode I ThePhantom Menace; Star Wars: Episode II Attack of the Clones; Star Wars: EpisodeIII Revenge of the Sith; Star Wars: Episode IV A New Hope; Star Wars: EpisodeV The Empire Strikes Back; Star Wars: Episode VI Return of the Jedi)• Please state whether you view the following characters favorably, unfavorably, or areunfamiliar with him/her. (Han Solo, Luke Skywalker, Princess Leia Organa, AnakinSkywalker, Obi Wan Kenobi, Emperor Palpatine, Darth Vader, Lando Calrissian,Boba Fett, C-3P0, R2-D2, Jar Jar Binks, Padme Amidala, Yoda)• Which character shot first?• Are you familiar with the Expanded Universe?• Do you consider yourself to be a fan of the Expanded Universe?• Do you consider yourself to be a fan of the Star Trek franchise?• Gende• Age1https:github.com/fivethirtyeight/data/tree/mastestar-wars-survey3gryphonGryphonGRYPHONgriffingryphon• Household Income• Education• Location (Census Region)Being a careful data scientist, you know that it is vital to carefully check any availabledata before starting to analyse it. Your task is to prepare the provided data for analysis.You will start by loading the CSV data from the file (using appropriate pandas functions)and checking whether the loaded data is equivalent to the data in the source CSV file.Then, you need to clean the data by using the knowledge we taught in the lectures. Youneed to deal with all the potential issues/eors in the data appropriately.Task 2: Data Exploration (5%)Explore the provided data based on the following steps:1. Explore the survey question: Please rank the Star Wars films in order of preferencewith 1 being your favorite film in the franchise and 6 being your least favorite film.(Star Wars: Episode I The Phantom Menace; Star Wars: Episode II Attack of theClones; Star Wars: Episode III Revenge of the Sith; Star Wars: Episode IV A NewHope; Star Wars: Episode V The Empire Strikes Back; Star Wars: Episode VIReturn of the Jedi), then analysis how people rate Star Wars Movies.2. Explore the relationships between columns. You need to choose 3 pairs of columnsto focus on, and you need to generate 1 visualisation for each pair. Each paiof columns that you choose should address a plausible hypothesis for the dataconcerned.3. Explore whether there are relationship between people’s demographics (Gender,Age, Household Income, Education, Location) and their attitude to Start Wacharacters.Note, each visualization (graph) shoul be complete and informative in itself, and shoulde clear for readers to read and obtain information.Task 3: Report (5%)Write your report and save it in a file called report.pdf, and it must be in PDF format,and must be at most 6 (in single column format) pages (including figures andeferences) with a font size between 10 and 12 points. Penalties will apply ifthe report does not satisfy the requirement. Moreover, the quality of the report will beconsidered, e.g. clarity, grammar mistakes, the flow of the presentation.Remember to clearly cite any sources (including books, research papers, course notes,etc.) that you refeed to while designing aspects of your programs.• Create a heading called “Data Preparation” in your report.4– Provide a ief explanation of how you addressed the task. For the steps ofdealing with the potential issues/eors, please create a sub-section for eachtype of eors you dealt with (e.g. typos, extra whitespaces, sanity checks foimpossible values, and missing values etc), and also explain and justify howyou dealt with each kind of eors.• Create a heading called “Data Exploration” in your report.– For each numbered step in Task 2 above, create a sub-section with coespond-ing numbering.What to Submit, When, and HowThe assignment is due at23:59, the 15th of April, 2020.Assignments submitted after this time will be subject to standard late submission penal-ties.You need to submit the following files:• Notebook file containing your python commands for Task 1 and Task 2, ‘assign-ment1.ipynb’. Please use the provided solution template to organise yousolutions: assignment1 TEMPLATE.ipyn# For the notebook files, please make sure to clean them and remove any unnecessarylines of code (cells). Follow these steps before submission:1. Main menu → Kernel → Restart & Run All2. Wait till you see the output displayed properly. You should see all the dataprinted and graphs displayed.• Your report.pdf file: at most 6 (in single column format) pages (includingfigures and references) with a font size between 10 and 12 points. Penaltieswill apply if the report does not satisfy the requirement.They must be submitted as ONE single zip file, named as your student number (foexample, XXXXXXXXXXzip if your student ID is s XXXXXXXXXXThe zip file must be submitted inCanvas:Assignments/Assignment 1.Please do NOT submit other unnecessary files.5A Marking GuidelinesData Preparation Data Exploration Report(Maximum = 5 marks) (Maximum = 5 marks) (Maximum = 5 marks)5 marks 5 marks 5 marksData preparation is welldesigned, systematic and wellexplained. All potentialeors/issues have beencompletely examined andproperly treatedAnalysis is thorough and demonstratesunderstanding and critical analysis. Well-easoned exploration are provided for allsub-tasks. All analysis, comparisons andconclusions are evidenced by data (e.g. inwell-formatted figures and/or tables).Very clear, well struc-tured and accessible re-port, an undergraduatestudent can pick up theeport and understandit with no difficulty.4 marks 4 marks 4 marksData preparation iseasonably designed,systematic and explained.There are at least oneobvious missing issue/eor.Each examined eoissuehave been completely checkedand properly treated.Analysis is thorough and demonstratesgood understanding and critical

Neha · Accepted Answer

Data preparation 
Pandas in the python is used as the data manipulation and analysis library. It is one of the 
cornerstones of the python scientific programming stack. It can be used for multiple task 
which also involves data preparation. The data preparation can be done using the CRISP-DM 
model. Another method is KDD process which involves the selection, preprocessing and 
transformation.
Exploratory data analysis 
It is one of the point from the data analysis field, data science or the machine learning 
project. It can be defined as the practice of including visual and quantitative methods to 
help us in understanding the dataset without assuming anything.  It is an important and 
crucial step before entering the machine learning or any statistical modeling.  
Dealing with the missing values 
Here are some common methods which can be used to deal with the missing values present 
in the dataset 
1) Drop instances and attributes 
2) Impute the attribute mean,

Practical Data Science with Python COSC 2670/2738 Assignment 1 Assessment Type Individual Due Date 23:59, the 15th of April, 2020 Marks 15 Introduction In this assignment, you will examine a data file...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment