Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Spark RDD with text (8 marks) Detecting popular and trending topics from the news articles is an important task for public opinion monitoring. In Part A your task is to perform text data analysis over...

1 answer below »

Spark RDD with text (8 marks)

Detecting popular and trending topics from the news articles is an important task for public opinion monitoring. In Part A your task is to perform text data analysis over a dataset of Australian news from ABC (Australian Broadcasting Corporation) using Spark RDD.

The dataset you are going to use contains data of news headlines published over several years. In this text file, each line is a headline of a news article, in format of "date, term1 term2 ... ... ". The lines are sorted by the date, and the terms are separated by the space character. A sample file is like below:

XXXXXXXXXX,council chief executive fails to secure position XXXXXXXXXX,council welcomes ambulance levy decision XXXXXXXXXX,council welcomes insurance breakthrough XXXXXXXXXX,fed opp to re introduce national insurance XXXXXXXXXX,cowboys survive eels comeback XXXXXXXXXX,cowboys withstand eels fightback XXXXXXXXXX,castro vows cuban socialism to survive bush XXXXXXXXXX,coronanomics things learnt about how coronavirus economy XXXXXXXXXX,coronavirus at home test kits selling in the chinese community XXXXXXXXXX,coronavirus campbell remess streams bear making classes XXXXXXXXXX,coronavirus pacific economy foriegn aid china XXXXXXXXXX,china builds pig apartment blocks to guard against swine flu

When you click the panel on the right you'll get a connection to a server that has, in your home directory, a text file called "abcnews.txt", containing some sample text (feel free to open the file and explore its contents). The entire dataset can be downloaded from https://www.kaggle.com/therohk/million-headlines.

Your task is to find the top-3 most frequent terms for each year. That is, for each year, select 3 terms that appeared in the most articles of that year, which represent the hot topics. If some words appear in the same number of articles, sort them in ascending order alphabetically.

Please ignore the "stop words" which are frequent but meaningless for this task, including: "to", "a", "an", "the", "for", "in", "on", "of", "at", "over", "with", "after", "and", "from", "new", "us", "by", "as", "man", "up", "says", "in", "out", "is", "be", "are", "not", "pm", "am", "off", "more", "less", "no", "how".

In your output, sort the results by years. For each year (in one line), sort the top-3 terms first by their article frequencies and then by the terms in alphabetical order. For example, given the above data set, the output should be (using Spark RDD):

XXXXXXXXXXcouncil insurance welcomes XXXXXXXXXXcowboys eels survive XXXXXXXXXXcoronavirus china economy

Write a Python program that uses Spark RDD to do this. A file called "rdd.py" has been created for you - you just need to fill in the details. Note that the efficiency (the time complexity) of your method will be considered for marking.

To debug your code, you can first test everything in pyspark, and then write the codes in "rdd.py". To test your program, you first need to create your default directory in Hadoop, and then copy abcnews.txt to it:

$ hdfs dfs -mkdir -p /user/user $ hdfs dfs -put abcnews.txt

Similarly, please also update the file "stopwords.txt" to HDFS, also in the folder "/user/user".

You can run your program on Spark by running the following command:

$ spark-submit rdd.py

Please save your results in the 'result-rdd' folder in HDFS (i.e., use saveAsTextFile("result-rdd") in your code).

Answered Same Day Aug 09, 2022

Solution

Aditi answered on Aug 10 2022
85 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here