Part B - MRJob and Hive with CSV (8 marks)
In Part B your task is to answer a question about the data in a CSV file, first using MRJob, and then using Hive. By using both to answer the same question about the same file you can more readily see how the two techniques compare.
When you click the panel on the right you'll get a connection to a server that has, in your home directory, a CSV file called "orders.csv", containing data about book orders (feel free to open the file and explore its contents).
Here are the fields in the file:
OrderDate (date)
ISBN (string)
Title (string)
Category (string)
PriceEach (decimal(5,2))
Quantity (integer)
FirstName (string)
LastName (string)
City (string)
Your task is to find the total dollar amount of orders for each city.
Your results should appear as the following:
ATLANTA XXXXXXXXXX
AUSTIN XXXXXXXXXX
BOISE XXXXXXXXXX
CHEYENNE XXXXXXXXXX
CHICAGO XXXXXXXXXX
CODY XXXXXXXXXX
EASTPOINT XXXXXXXXXX
KALMAZOO XXXXXXXXXX
MACON XXXXXXXXXX
MIAMI XXXXXXXXXX
MORRISTOWN XXXXXXXXXX
SEATTLE XXXXXXXXXX
TALLAHASSEE XXXXXXXXXX
TRENTON XXXXXXXXXX
(There is no need to sort the results or remove the quotation marks.)
First (4 marks)
Write a MRJob job to do this. A file called "job.py" has been created for you - you just need to fill in the details. You should be able to modify MRJob jobs that you have already seen in this week's content.
You can test your job by running the following command (it tells Python to execute job.py, using orders.csv as input):
$ python job.py orders.csv
Second (4 marks)
Write a Hive script to do this. A file called "script.hql" has been created for you - you just need to fill in the details. You should be able to modify Hive scripts that you have already seen in this week's content.
You can test your script by running the following command (it tells Hive to execute the commands contained in the file script.hql):
$ hive -f script.hql