Part A Part A - Spark RDD with text (8 marks) Detecting popular and trending topics from the news articles is an important task for public opinion monitoring. In Part A your task is to perform text...

1 answer below »

Part A
Part A - Spark RDD with text (8 marks)
Detecting popular and trending topics from the news articles is an important task for public opinion monitoring. In Part A your task is to perform text data analysis over a dataset of Australian news from ABC (Australian Broadcasting Corporation) using Spark RDD.
The dataset you are going to use contains data of news headlines published over several years. In this text file, each line is a headline of a news article, in format of "date, term1 term2 ... ... ". The lines are sorted by the date, and the terms are separated by the space character. A sample file is like below:
XXXXXXXXXX,council chief executive fails to secure position
XXXXXXXXXX,council welcomes ambulance levy decision
XXXXXXXXXX,council welcomes insurance
eakthrough
XXXXXXXXXX,fed opp to re introduce national insurance
XXXXXXXXXX,cowboys survive eels comeback
XXXXXXXXXX,cowboys withstand eels fightback
XXXXXXXXXX,castro vows cuban socialism to survive bush
XXXXXXXXXX,coronanomics things learnt about how coronavirus economy
XXXXXXXXXX,coronavirus at home test kits selling in the chinese community
XXXXXXXXXX,coronavirus campbell remess streams bear making classes
XXXXXXXXXX,coronavirus pacific economy foriegn aid china
XXXXXXXXXX,china builds pig apartment blocks to guard against swine flu
When you click the panel on the right you'll get a connection to a server that has, in your home directory, a text file called "abcnews.txt", containing some sample text (feel free to open the file and explore its contents). The entire dataset can be downloaded from https:
www.kaggle.com/therohk/million-headlines.
Your task is to find the top-3 most frequent terms for each year. That is, for each year, select 3 terms that appeared in the most articles of that year, which represent the hot topics. If some words appear in the same number of articles, sort them in ascending order alphabetically.
Please ignore the "stop words" which are frequent but meaningless for this task, including: "to", "a", "an", "the", "for", "in", "on", "of", "at", "over", "with", "after", "and", "from", "new", "us", "by", "as", "man", "up", "says", "in", "out", "is", "be", "are", "not", "pm", "am", "off", "more", "less", "no", "how".
In your output, sort the results by years. For each year (in one line), sort the top-3 terms first by their article frequencies and then by the terms in alphabetical order. For example, given the above data set, the output should be (using Spark RDD):
XXXXXXXXXXcouncil insurance welcomes
XXXXXXXXXXcowboys eels survive
XXXXXXXXXXcoronavirus china economy
Write a Python program that uses Spark RDD to do this. A file called "rdd.py" has been created for you - you just need to fill in the details. Note that the efficiency (the time complexity) of your method will be considered for marking.
To debug your code, you can first test everything in pyspark, and then write the codes in "rdd.py". To test your program, you first need to create your default directory in Hadoop, and then copy abcnews.txt to it:
$ hdfs dfs -mkdir -p /use
use
$ hdfs dfs -put abcnews.txt
Similarly, please also update the file "stopwords.txt" to HDFS, also in the folder "/use
user".
You can run your program on Spark by running the following command:
$ spark-submit rdd.py
Please save your results in the 'result-rdd' folder in HDFS.
Part
Part B - Spark RDD with CSV (4 marks)
In Part B your task is to answer a question about the data in a CSV file using Spark RDD. When you click the panel on the right you'll get a connection to a server that has, in your home directory, the CSV file "orders.csv". It's one that you've seen before. Here are the fields in the file:
OrderDate (date)
ISBN (string)
Title (string)
Category (string)
PriceEach (decimal)
Quantity (integer)
FirstName (string)
LastName (string)
City (string)
Your task is to find the number of books ordered each day, sorted by the number of books descending, then order date ascending.
Your results should appear as the following:
XXXXXXXXXX,10
XXXXXXXXXX,8
XXXXXXXXXX,7
XXXXXXXXXX,6
XXXXXXXXXX,5
XXXXXXXXXX,4
XXXXXXXXXX,4
First (4 marks)
Write a Python program that uses Spark RDDs to do this. A file called "rdd.py" has been created for you - you just need to fill in the details. You should be able to modify programs that you have already seen in this week's content. To sort the RDD results, you can use SortBy, and here is an example of it.
Hint:
tmp = [('a', 3), ('b', 2), ('a', 1), ('d', 4), ('2', 5)]
sc.parallelize(tmp).sortBy(lambda x: (x[0],x[1])).collect()
Output:
[('2', 5), ('a', 1), ('a', 3), ('b', 2), ('d', 4)]
To test your program you first need to create your default directory in Hadoop, and copy orders.csv to it:
$ hdfs dfs -mkdir -p /use
use
$ hdfs dfs -put orders.csv
You can test your program by running the following command:
$ spark-submit rdd.py
Please save your results in the 'result-rdd' folder in HDFS.

Part A - Hive with text (4 marks)
In Part A your task is to answer a question about the data in an unprocessed text file using Hive. When you click the panel on the right you'll get a connection to a server that has, in your home directory, a text file called "walden.txt", containing some sample text (feel free to open the file and explore its contents)(it's an extract from Walden, by Henry David Thoreau).
In this text file, each line is a sentence. It is worth noting that there are multiple spaces at the end of each line in this unprocessed text file.
Your task is to find the average word lengths according to the first letters of sentences. For example, given a toy input file as shown below:
Aaa
cc.
Ab b.
The output should be:
Letter A: 2.6
Because, for A, we have (3*3+2*2)/5 = 2.6.
You can assume that sentences are separated by full stops, and words are separated by spaces. For simplicity, we include all punctuations, like ',' and '.', when calculating word length, like what we did in example1 and example2. (So, the length of 'cc.' is 3 instead of 2.) The case of letters can be ignored.
Given the walden.txt file as input, the format of the output is "letter: avg_word_length" (The result should be rounded to two decimal places, with round(x,2) ), as shown below:
Letter A XXXXXXXXXX
Letter B XXXXXXXXXX
Letter F XXXXXXXXXX
Letter I XXXXXXXXXX
Letter S XXXXXXXXXX
Letter T XXXXXXXXXX
Letter W XXXXXXXXXX
Write a Hive script to do this. A file called "script.hql" has been created for you - you just need to fill in the details. You should be able to modify Hive scripts that you have already seen in this week's content. You might use some User-Defined Functions (UDFs) which can be found here.
You can test your script by running the following command (it tells Hive to execute the commands contained in the file script.hql):
$ hive -f script.hql
This is worth 4 marks
When you are happy that your job and script are co
ect, click "Submit".
Part B - Spark SQL with CSV (2 marks)
In Part B your task is to answer a question about the data in a CSV file using Spark DataFrames and SQL. When you click the panel on the right you'll get a connection to a server that has, in your home directory, the CSV file "orders.csv". It's one that you've seen before. Here are the fields in the file:
OrderDate (date)
ISBN (string)
Title (string)
Category (string)
PriceEach (decimal)
Quantity (integer)
FirstName (string)
LastName (string)
City (string)
Your task is to find the number of books ordered each day, sorted by the number of books descending, then order date ascending.
Your results should appear as the following:
XXXXXXXXXX,10
XXXXXXXXXX,8
XXXXXXXXXX,7
XXXXXXXXXX,6
XXXXXXXXXX,5
XXXXXXXXXX,4
XXXXXXXXXX,4
Write a Python program that uses Spark DataFrames and SQL to do this. A file called "sql.py" has been created for you - you just need to fill in the details. Again, you should be able to modify programs that you have already seen in this week's content.
You can test your program by running the following command:
$ spark-submit sql.py
Please save your results in the 'result-sql' folder in HDFS.
When you are happy that your two programs are co
ect, click "Submit".
Part C- Spark SQL with CSV (6 marks)
COVID-19 has affected our lives significantly in recent years. In Part B your task is to do a data analysis task over a COVID-19 data set stored in the CSV format using Spark DataFrames and SQL. The COVID-19 dataset contains the cases by notification date and postcode, local health district, and local government area in NSW, Australia. The dataset is updated daily, except on weekends. Here are the fields in the file:
notification_date (date) -- e.g XXXXXXXXXX, XXXXXXXXXXetc.
postcode (integer) -- e.g. 2011, 2035, etc.
lhd_2010_code (string) -- local health district code, e.g. X720, X760, etc.
lhd_2010_name (string) -- local health district name, e.g. South Eastern Sydney, Northern Sydney, etc.
lga_code19 (string) -- local government area code, e.g. 17200, 16550, etc.
lga_name19 (string) -- local government area name, e.g. Sydney (C), Randwick (C), etc.
When you click the panel on the right you'll get a connection to a server, and in your home directory you can see a sample of the data set named "cases-locations.csv".
Your task is to find the maximum daily cases number in each local health district (lhd) together with the date. Each line of your result should contain the local health district, the local health district code, the date and the maximum daily increase of total confirmed cases. The results should be sorted first by the daily increase in descending order, and then by the date in ascending order, and finally by the local health district name(lhd_2010_name) in descending order. For a certain local health district, if there are multiple dates that have the same maximum daily cases number, please return all such dates.
For example, given the sample data set, your results should be as below:
Northern Sydney,X760, XXXXXXXXXX,44
South Eastern Sydney,X720, XXXXXXXXXX,41
Western Sydney,X740, XXXXXXXXXX,24
Hunter New England,X800, XXXXXXXXXX,22
South Western Sydney,X710

assignment-2-week-5-pwzdy0m4.docx assignment-3-week-6-5reqqh4i.docx new-microsoft-word-document-3-ux1c2sy2-oxeuapqc.docx

Answered 6 days After Jul 27, 2022

Solution

Rushendra answered on Aug 03 2022

82 Votes

SOLUTION.PDF

Part A Part A - Spark RDD with text (8 marks) Detecting popular and trending topics from the news articles is an important task for public opinion monitoring. In Part A your task is to perform text...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment