Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Part A - MRJob with text (6 marks) [file attached abcnews.txt] Detecting popular and trending topics from the news articles is an important task for public opinion monitoring. In Part A, your task is...

1 answer below »

Part A - MRJob with text (6 marks) [file attached abcnews.txt]
Detecting popular and trending topics from the news articles is an important task for public opinion monitoring. In Part A, your task is to perform text data analysis over a dataset of Australian news from ABC (Australian Broadcasting Corporation) using MRJob.
The dataset you will use contains data from news headlines published over several years. In this text file, each line is a headline of a news article, in the format of "date, term1 term2 ... ... ". The lines are sorted by the date, and the terms are separated by the space character. A sample file is like the below:
XXXXXXXXXX,woman sta
ed adelaide shopping centre
XXXXXXXXXX,economy continue teetering edge recession
XXXXXXXXXX,coronanomics learnt coronavirus economy
XXXXXXXXXX,coronavirus home test kits selling chinese community
XXXXXXXXXX,coronavirus pacific economy foriegn aid china
XXXXXXXXXX,china builds pig apartment blocks guard swine flu
XXXXXXXXXX,economy starts bounce unemployment
XXXXXXXXXX,online shopping rise due coronavirus
XXXXXXXXXX,china close encounters elon musks
When you click the panel on the right you'll get a connection to a server that has, in your home directory, a text file called "abcnews.txt", containing some sample text (feel free to open the file and explore its contents). The entire dataset can be downloaded from https:
www.kaggle.com/therohk/million-headlines.
Your task is to compute for each term, in which year it appears the most. That is, for each term, you count how many articles contain this word in each year, and then select the year that has the most articles with this term (note that if an article contains a term multiple times, it only contributes 1 to the frequency). If the term appears in several years with the same frequency, select the earliest year as the result.
In your output, each line contains a key-value pair, where the key is the term, and the value is a pair of the year and this term's frequency in this year. For example, given the above data set, the output should be (there is no need to remove the quotation marks):
"adelaide" "2019:1"
"aid" "2020:1"
"apartment" "2020:1"
"blocks" "2020:1"
"bounce" "2021:1"
"builds" "2020:1"
"centre" "2019:1"
"china" "2020:2"
"chinese" "2020:1"
"close" "2021:1"
"community" "2020:1"
"continue" "2019:1"
"coronanomics" "2020:1"
"coronavirus" "2020:3"
"due" "2021:1"
"economy" "2020:2"
"edge" "2019:1"
"elon" "2021:1"
"encounters" "2021:1"
"flu" "2020:1"
"foriegn" "2020:1"
"guard" "2020:1"
"home" "2020:1"
"kits" "2020:1"
"learnt" "2020:1"
"musks" "2021:1"
"online" "2021:1"
"pacific" "2020:1"
"pig" "2020:1"
"recession" "2019:1"
"rise" "2021:1"
"selling" "2020:1"
"shopping" "2019:1"
"sta
ed" "2019:1"
"starts" "2021:1"
"swine" "2020:1"
"teetering" "2019:1"
"test" "2020:1"
"unemployment" "2021:1"
"woman" "2019:1"
Write an MRJob job to do this. A file called "job.py" has been created for you - you just need to fill in the details. You can test your job locally by running the following command (it tells Python to execute job.py, using abcnews.txt as input, but the results may not be sorted by years):
$ python job.py abcnews.txt
To run your code on Hadoop MapReduce, you can use the following command (the results would be sorted as you can see in "output"):
$ python job.py abcnews.txt -r hadoop > output
Part B - MRJob with CSV (4 marks)[file attached orders.csv]
In Part B your task is to answer a question about the data in a CSV file using MRJob. When you click the panel on the right you'll get a connection to a server that has, in your home directory, a CSV file called "orders.csv", containing data about book orders (feel free to open the file and explore its contents).
Here are the fields in the file:
OrderDate (date)
ISBN (string)
Title (string)
Category (string)
PriceEach (decimal(5,2))
Quantity (integer)
FirstName (string)
LastName (string)
City (string)
Your task is to compute the average cost of books per customer, i.e., the total spent for books of a customer divided by the number of books purchased by the customer.
The result should be rounded to two decimal places, with round(x,2), as shown below (MRJob output):
"BECCA NELSON" XXXXXXXXXX
"BONITA MORALES" XXXXXXXXXX
"CINDY GIRARD" XXXXXXXXXX
"GREG MONTIASA" XXXXXXXXXX
"JAKE LUCAS" XXXXXXXXXX
"JASMINE LEE" XXXXXXXXXX
"JENNIFER SMITH" XXXXXXXXXX
"KENNETH FALAH" XXXXXXXXXX
"KENNETH JONES" XXXXXXXXXX
"LEILA SMITH" XXXXXXXXXX
"REESE MCGOVERN" XXXXXXXXXX
"STEVE SCHELL" XXXXXXXXXX
"TAMMY GIANA" XXXXXXXXXX
"THOMAS PIERSON" XXXXXXXXXX
Write an MRJob job to do this. A file called "job.py" has been created for you - you just need to fill in the details. Note that you are required to implement a combiner to do this task.
You can test your job locally by running the following command (it tells Python to execute job.py locally, using orders.csv as the input):
$ python job.py orders.csv
To run your code on Hadoop, you can use the following command (the results would be sorted by keys as you can see in "output"):
$ python job.py orders.csv -r hadoop > output

XXXXXXXXXX,woman sta
ed adelaide shopping centre
XXXXXXXXXX,economy continue teetering edge recession
XXXXXXXXXX,coronanomics learnt coronavirus economy
XXXXXXXXXX,coronavirus home test kits selling chinese community
XXXXXXXXXX,coronavirus pacific economy foriegn aid china
XXXXXXXXXX,china builds pig apartment blocks guard swine flu
XXXXXXXXXX,economy starts bounce unemployment
XXXXXXXXXX,online shopping rise due coronavirus
XXXXXXXXXX,china close encounters elon musks

Website: https:
edstem.org/au
Username: XXXXXXXXXX
Password: Homework1968
Click on big data management
Click on lessons on right corner:
Scroll down and you will see week 5 and 6 assessment task
All the files you need is in each assessment’s toggle files
Click on terminal to text code:
All the instructions is in the descriptions! And slides is used to choose part a, b , or c section of the assessment.
If you don’t submit the code, it should be saved. But to make sure just press submit. Each file can be submitted multiple times!
Answered 4 days After Jul 26, 2022

Solution

Rushendra answered on Jul 31 2022
84 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here