Scalable Data Analytics Homework 1Spring 2021 Deadline Feb.15 Noon, 2021Deadlines Homework 1 is due...

Question

Scalable Data Analytics Homework 1Spring 2021 Deadline Feb.15 Noon, 2021Deadlines Homework 1 is due on Feb.15th 12:30pm. 50% late submission penalty.How to submit: Please submit a zip file to the Assignment/Homework 1 folder inthe iCollege. The zip file name should be ’Yourname-Pantherid.zip’. The zipped file shouldcontain three separate ipython notebook files ’1-generator.ipynb’, ’2-HOF.ipynb’, and ’3-generator-HOF.ipynb’ for the first, second and third problems respectively.Data Set: Citibike dataset posted in the iCollege.1. (2 points) Python’s Generators and Streaming.Compute the median age of the Citibike’s subscribed customers. You are required to readdata line by line and are not allowed to store the entire data set in memory. Indeed, youshould not have any containers (e.g. list, dictionary, DataFrame, etc.) with more than100 elements in memory. You should use yield when you want to iterate over a sequence,ut don’t want to store the entire sequence in memory as shown in the Codes/Lab3.What to submit:Turn in an ipython notebook with the plot of the histogram of customers age and printout a single number showing the median age of the subscribed customers.2. (4 points) Python’s Higher Order FunctionsThis is how you can read the file and transform it to a list of lists.import pandas as pddf = pd.read_csv("citibike.csv")ows = df.values.tolist()(a).Determine the number trips that gender 1 made, and that gender 2 made. We can dothis by just counting the number of occuences of ”1” and ”2” in the gender column(2pt):Read fileYOUR HOF EXPRESSION# After this, you should get something like# (37805, 7848)Scalable Data Analytics - Page 2 of 2(b).Count the number of trips per birth year using higher order functions (2pt):Read fileYOUR HOF EXPRESSION# After this, you should get something like# {"1900.0": 22, "1901.0": 1, "1910.0": 2, "1922.0": 4, ... "1995.0": 256,"1996.0": 124, "1997.0": 94, "1998.0": 59, "1999.0": 17}Hint: math.isnan() is able to remove all the nan values.What to submit:Turn in an ipython notebook print out the results for problems (a) and (b).3. (4 points) Extract the first ride of the day from a Citibike data stream. The first ride ofthe day is interpreted as the ride with the earliest starting time of a day. For the sampledata, which is a week worth of citibike records, your program should only generate 7items (one for each day).Streaming Computation: you are asked to complete the task using steaming computationmethods. You can only iterate the data set once using yield as shown in Codes/Lab3.You can store a container (e.g. list, dictionary, DataFrame,etc.) with maximum 7elements in memory. The data set has been sorted by the starting time.What to submit:Turn in an ipython notebook print out the birth years of the first riders each day foproblem. { "cells": [  {   "cell_type": "code",   "execution_count": 2,   "metadata": {},   "outputs": [],   "source": [    "import csv "   ]  },  {   "cell_type": "code",   "execution_count": 17,   "metadata": {},   "outputs": [],   "source": [    "#read file
",    "with open("citibike.csv","r") as fi:
",    "    reader = csv.DictReader(fi)
",    "    for row in reader:
",    " XXXXXXXXXXbirthyear = row["birth_year"]
",    " XXXXXXXXXXif birthyear != "":
",    " XXXXXXXXXXage = 2015-int(birthyear)"   ]  },  {   "cell_type": "code",   "execution_count": 18,   "metadata": {},   "outputs": [],   "source": [    "#def generator to iterate teh daata set only once
",    "def citibike2gen(filename):
",    "    with open("citibike.csv","r") as fi:
",    " XXXXXXXXXXreader = csv.DictReader(fi)
",    " XXXXXXXXXXfor row in reader:
",    " XXXXXXXXXXbirthyear = row["birth_year"]
",    " XXXXXXXXXXif birthyear != "":
",    " XXXXXXXXXXage = 2015-int(birthyear)
",    " XXXXXXXXXXyield age
",    "count = {}
",    "for age in citibike2gen("citibike.csv"):
",    "    count[age] = count.get(age,0)+1"   ]  },  {   "cell_type": "code",   "execution_count": 19,   "metadata": {},   "outputs": [    {     "data": {      "text/plain": [       "{37: 1377,
",       " 22: 470,
",       " 46: 1133,
",       " 30: 1673,
",       " 58: 449,
",       " 36: 1279,
",       " 32: 1793,
",       " 60: 413,
",       " 33: 1455,
",       " 27: 1358,
",       " 24: 922,
",       " 25: 1361,
",       " 38: 1122,
",       " 47: 1010,
",       " 28: 1730,
",       " 35: 1509,
",       " 55: 771,
",       " 29: 1568,
",       " 34: 1499,
",       " 40: 1071,
",       " 42: 1022,
",       " 44: 1162,
",       " 31: 1714,
",       " 20: 256,
",       " 21: 392,
",       " 49: 863,
",       " 43: 1081,
",       " 51: 891,
",       " 61: 417,
",       " 23: 493,
",       " 26: 1322,
",       " 45: 1347,
",       " 54: 618,
",       " 41: 1158,
",       " 39: 1168,
",       " 56: 687,
",       " 50: 947,
",       " 57: 783,
",       " 48: 999,
",       " 52: 970,
",       " 66: 134,
",       " 63: 247,
",       " 70: 28,
",       " 67: 149,
",       " 18: 94,
",       " 19: 124,
",       " 53: 899,
",       " 65: 150,
",       " 71: 59,
",       " 62: 346,
",       " 59: 488,
",       " 64: 229,
",       " 74: 39,
",       " 77: 24,
",       " 81: 8,
",       " 68: 74,
",       " 73: 61,
",       " 75: 21,
",       " 72: 18,
",       " 69: 93,
",       " 17: 59,
",       " 115: 22,
",       " 16: 17,
",       " 80: 9,
",       " 76: 4,
",       " 105: 2,
",       " 89: 1,
",       " 86: 1,
",       " 114: 1,
",       " 93: 4}"      ]     },     "execution_count": 19,     "metadata": {},     "output_type": "execute_result"    }   ],   "source": [    "count"   ]  },  {   "cell_type": "code",   "execution_count": 22,   "metadata": {},   "outputs": [    {     "name": "stdout",     "output_type": "stream",     "text": [      " XXXXXXXXXX:00:00+00:00
",

Sanchi · Accepted Answer

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    " 
",
    "data = pd.read_csv('C:/Users/sanchi.kalra/Desktop/Greynodes/AS18/citibike-ltwimtfd.csv')
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "data['starttime'] =  pd.to_datetime(data['starttime'], format='%Y-%m-%d %H:%M:%S')
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = data.resample('D', on= 'starttime').min()
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "y =[]
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "for m in x['starttime']:
",
    "	y.append(m)
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = data[data['starttime'].isin(y)]
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "data1 = df[['starttime','birth_year']]
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "data1 = data1.groupby(data1['starttime'].unique())
"
   ]
  },
  {
   "cell_type": "code",

Scalable Data Analytics Homework 1 Spring 2021 Deadline Feb.15 Noon, 2021 Deadlines Homework 1 is due on Feb.15th 12:30pm. 50% late submission penalty. How to submit: Please submit a zip file to the...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment