Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Data Science 311 Lab 4 (10 points) Due at 10am on Oct. 31, 2022 Read all of the instructions. Late work will not be accepted. Overview In this lab you will create a dataset for an intended...

1 answer below »
Data Science 311
Lab 4 (10 points)
Due at 10am on Oct. 31, 2022
Read all of the instructions. Late work will not be accepted.
Overview
In this lab you will create a dataset for an intended analysis which could be conducted by
future data analysts. The lab will be an exercise in data curation as well as time management
and tolerance to ambiguity. You have complete freedom on the original data used to create
your dataset as well as the description of the intended analysis. Students should find a data
source that is workable from the methods taught in class, i.e., downloadable as a csv or json
that can be readily loaded into a pandas dataframe. Please take care to make a data curation
plan that can be accomplished in a suitable time frame for the lab due date.
Collaboration
For this lab, you can
ainstorm with any classmates about ideas for datasets and data curation
methodology. However, your dataset and co
esponding explanation of the intended analysis
should be unique. Your submission must acknowledge ideas or suggestions you received from
other classmates in an acknowledgement section at the end of your readme.txt.
Details
Data
You can use any data source of your choosing for this lab. This includes any data previously
discussed in lecture and accompanying notes as well as other data that you might be interested
in getting aquainted with.
Tasks
In this lab you are dipping your feet in the deep end of data science by collecting, organizing,
and annotating data. The lab is structured into 4 parts which you will document in a plain text
file readme.txt. Carefully follow the steps outlined below to complete the lab assignment.
1. Create a file in a text editor called readme.txt. Stub out the required sections listed
elow.
(a) Title: Lab 4 DS 311 Dataset Curation
(b) Author: Your name
(c) Dataset name
(d) Dataset description
(e) Data provenance
1
(f) Intended usage
(g) Data curation
(h) Data faults
(i) Acknowledgements
Note
As it turns out, the order of the sections for a well-structured readme are often
not the order in which you end up elaborating them. For instance, the interested
eader of a readme will want to know the dataset description first hand, but you
will know the dataset provenance long before you have assembled the final dataset.
The dataset description will be the last thing you usually write.
2. Decide on a data source. Search around for some accessible data that interested in check-
ing out and can easily be loaded into a pandas dataframe. Create a file, lab4.ipynb,
and write the code to load the data from a url or google drive in the notebook. If the
data was accessed from a graphical user interface and then downloaded to disc, the orig-
inal data must be included in the submission zip file with the title original data.csv
so the TA can run your code without going through the data collection process.
3. Complete the Data provenance section of readme.txt containing complete and detailed
instructions for acquiring the data from the original source, e.g., url, search parameters
for GUI or API, reference to relevant code for accessing the data in lab4.ipynb.
4. Complete the intended usage section of readme.txt. This should describe the analysis
intended for the dataset you are going to curate.
5. Fill out the Data curation section of readme.txt by describing your plan to curate the
data for the intended analysis. The ultimate dataset after data curation should be in a
standard pandas dataframe format with rows co
esponding to datapoints and columns
co
esponding to data features. If personal data is submitted, deanonymization must
e part of the data curation plan (e.g. abstract from locations in location data by using
elative distances).
6. Enact the data curation plan in lab4.ipynb. It’s okay at this point to adapt you
plan if insurmountable difficulties arise due to unanticipated data quirks or time con-
straints. Document any difficulties your plan enactment encountered in readme.txt in
the Data faults section, and then document the completed data curation plan in the
Data curation section if it’s in any way changed from the original.
7. Fill out the Dataset description section of readme.txt. This can be as simple as
describing what the datapoints are and describing in words each feature of the dataset.
8. In lab4.ipynb, save your curated dataset to a csv format called dataset.csv.
9. Fill out the Acknowledgements section of readme.txt. This should include any col-
laboration, tips, or inspiration you received from classmates or instructors as well as
any reference articles related to the original data source if the data source requested a
citation.
10. Think of a cool name for your dataset. Record this in the Dataset name section of
eadme.txt
2
Submitting Your Work
You will submit a single file, Firstname Lastname lab4.zip. containing your 3 (or 4 if
applicable) files:
1. lab4.ipyn
2. readme.txt
3. dataset.csv
4. original data.csv (if applicable)
(where spelling, spacing and capitalization matter) and upload the zip via Canvas.
Grading
Dataset creation will be graded on the following:
• 25% of total grade will be on data curation.
• 25% of total grade will be on whether the dataset.csv in the zip file matches dataset.csv
derived from running lab4.ipynb.
• 25% of the total grade will be on completeness of readme.txt
• 25% on clarity of exposition in readme.txt
3
Answered 1 days After Oct 30, 2022

Solution

Amar Kumar answered on Oct 31 2022
53 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here