Scalable Data Analytics Homework 1
Spring 2021 Deadline Feb.15 Noon, 2021
Deadlines Homework 1 is due on Feb.15th 12:30pm. 50% late submission penalty.
How to submit: Please submit a zip file to the Assignment/Homework 1 folder in
the iCollege. The zip file name should be ’Yourname-Pantherid.zip’. The zipped file should
contain three separate ipython notebook files ’1-generator.ipynb’, ’2-HOF.ipynb’, and ’3-
generator-HOF.ipynb’ for the first, second and third problems respectively.
Data Set: Citibike dataset posted in the iCollege.
1. (2 points) Python’s Generators and Streaming.
Compute the median age of the Citibike’s subscribed customers. You are required to read
data line by line and are not allowed to store the entire data set in memory. Indeed, you
should not have any containers (e.g. list, dictionary, DataFrame, etc.) with more than
100 elements in memory. You should use yield when you want to iterate over a sequence,
ut don’t want to store the entire sequence in memory as shown in the Codes/Lab3.
What to submit:
Turn in an ipython notebook with the plot of the histogram of customers age and print
out a single number showing the median age of the subscribed customers.
2. (4 points) Python’s Higher Order Functions
This is how you can read the file and transform it to a list of lists.
import pandas as pd
df = pd.read_csv("citibike.csv")
ows = df.values.tolist()
(a).
Determine the number trips that gender 1 made, and that gender 2 made. We can do
this by just counting the number of occu
ences of ”1” and ”2” in the gender column
(2pt):
Read file
YOUR HOF EXPRESSION
# After this, you should get something like
# (37805, 7848)
Scalable Data Analytics - Page 2 of 2
(b).
Count the number of trips per birth year using higher order functions (2pt):
Read file
YOUR HOF EXPRESSION
# After this, you should get something like
# {"1900.0": 22, "1901.0": 1, "1910.0": 2, "1922.0": 4, ... "1995.0": 256,
"1996.0": 124, "1997.0": 94, "1998.0": 59, "1999.0": 17}
Hint: math.isnan() is able to remove all the nan values.
What to submit:
Turn in an ipython notebook print out the results for problems (a) and (b).
3. (4 points) Extract the first ride of the day from a Citibike data stream. The first ride of
the day is interpreted as the ride with the earliest starting time of a day. For the sample
data, which is a week worth of citibike records, your program should only generate 7
items (one for each day).
Streaming Computation: you are asked to complete the task using steaming computation
methods. You can only iterate the data set once using yield as shown in Codes/Lab3.
You can store a container (e.g. list, dictionary, DataFrame,etc.) with maximum 7
elements in memory. The data set has been sorted by the starting time.
What to submit:
Turn in an ipython notebook print out the birth years of the first riders each day fo
problem.
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import csv "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"#read file\n",
"with open(\"citibike.csv\",\"r\") as fi:\n",
" reader = csv.DictReader(fi)\n",
" for row in reader:\n",
" XXXXXXXXXXbirthyear = row[\"birth_year\"]\n",
" XXXXXXXXXXif birthyear != \"\":\n",
" XXXXXXXXXXage = 2015-int(birthyear)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"#def generator to iterate teh daata set only once\n",
"def citibike2gen(filename):\n",
" with open(\"citibike.csv\",\"r\") as fi:\n",
" XXXXXXXXXXreader = csv.DictReader(fi)\n",
" XXXXXXXXXXfor row in reader:\n",
" XXXXXXXXXXbirthyear = row[\"birth_year\"]\n",
" XXXXXXXXXXif birthyear != \"\":\n",
" XXXXXXXXXXage = 2015-int(birthyear)\n",
" XXXXXXXXXXyield age\n",
"count = {}\n",
"for age in citibike2gen(\"citibike.csv\"):\n",
" count[age] = count.get(age,0)+1"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{37: 1377,\n",
" 22: 470,\n",
" 46: 1133,\n",
" 30: 1673,\n",
" 58: 449,\n",
" 36: 1279,\n",
" 32: 1793,\n",
" 60: 413,\n",
" 33: 1455,\n",
" 27: 1358,\n",
" 24: 922,\n",
" 25: 1361,\n",
" 38: 1122,\n",
" 47: 1010,\n",
" 28: 1730,\n",
" 35: 1509,\n",
" 55: 771,\n",
" 29: 1568,\n",
" 34: 1499,\n",
" 40: 1071,\n",
" 42: 1022,\n",
" 44: 1162,\n",
" 31: 1714,\n",
" 20: 256,\n",
" 21: 392,\n",
" 49: 863,\n",
" 43: 1081,\n",
" 51: 891,\n",
" 61: 417,\n",
" 23: 493,\n",
" 26: 1322,\n",
" 45: 1347,\n",
" 54: 618,\n",
" 41: 1158,\n",
" 39: 1168,\n",
" 56: 687,\n",
" 50: 947,\n",
" 57: 783,\n",
" 48: 999,\n",
" 52: 970,\n",
" 66: 134,\n",
" 63: 247,\n",
" 70: 28,\n",
" 67: 149,\n",
" 18: 94,\n",
" 19: 124,\n",
" 53: 899,\n",
" 65: 150,\n",
" 71: 59,\n",
" 62: 346,\n",
" 59: 488,\n",
" 64: 229,\n",
" 74: 39,\n",
" 77: 24,\n",
" 81: 8,\n",
" 68: 74,\n",
" 73: 61,\n",
" 75: 21,\n",
" 72: 18,\n",
" 69: 93,\n",
" 17: 59,\n",
" 115: 22,\n",
" 16: 17,\n",
" 80: 9,\n",
" 76: 4,\n",
" 105: 2,\n",
" 89: 1,\n",
" 86: 1,\n",
" 114: 1,\n",
" 93: 4}"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"count"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" XXXXXXXXXX:00:00+00:00\n",