Detecting popular and trending topics from the news articles is an important task for public opinion monitoring. In Part A your task is to perform text data analysis over a dataset of Australian news from ABC (Australian Broadcasting Corporation) using Spark RDD.
The dataset you are going to use contains data of news headlines published over several years. In this text file, each line is a headline of a news article, in format of "date, term1 term2 ... ... ". The lines are sorted by the date, and the terms are separated by the space character. A sample file is like below:
XXXXXXXXXX,council chief executive fails to secure position
XXXXXXXXXX,council welcomes ambulance levy decision
XXXXXXXXXX,council welcomes insurance breakthrough
XXXXXXXXXX,fed opp to re introduce national insurance
XXXXXXXXXX,cowboys survive eels comeback
XXXXXXXXXX,cowboys withstand eels fightback
XXXXXXXXXX,castro vows cuban socialism to survive bush
XXXXXXXXXX,coronanomics things learnt about how coronavirus economy
XXXXXXXXXX,coronavirus at home test kits selling in the chinese community
XXXXXXXXXX,coronavirus campbell remess streams bear making classes
XXXXXXXXXX,coronavirus pacific economy foriegn aid china
XXXXXXXXXX,china builds pig apartment blocks to guard against swine flu
When you click the panel on the right you'll get a connection to a server that has, in your home directory, a text file called "abcnews.txt", containing some sample text (feel free to open the file and explore its contents). The entire dataset can be downloaded from https://www.kaggle.com/therohk/million-headlines.
Your task is to find the top-3 most frequent terms for each year. That is, for each year, select 3 terms that appeared in the most articles of that year, which represent the hot topics. If some words appear in the same number of articles, sort them in ascending order alphabetically.
Please ignore the "stop words" which are frequent but meaningless for this task, including: "to", "a", "an", "the", "for", "in", "on", "of", "at", "over", "with", "after", "and", "from", "new", "us", "by", "as", "man", "up", "says", "in", "out", "is", "be", "are", "not", "pm", "am", "off", "more", "less", "no", "how".
In your output, sort the results by years. For each year (in one line), sort the top-3 terms first by their article frequencies and then by the terms in alphabetical order. For example, given the above data set, the output should be (using Spark RDD):
XXXXXXXXXXcouncil insurance welcomes
XXXXXXXXXXcowboys eels survive
XXXXXXXXXXcoronavirus china economy
Write a Python program that uses Spark RDD to do this. A file called "rdd.py" has been created for you - you just need to fill in the details. Note that the efficiency (the time complexity) of your method will be considered for marking.
To debug your code, you can first test everything in pyspark, and then write the codes in "rdd.py". To test your program, you first need to create your default directory in Hadoop, and then copy abcnews.txt to it:
$ hdfs dfs -mkdir -p /user/user
$ hdfs dfs -put abcnews.txt
Similarly, please also update the file "stopwords.txt" to HDFS, also in the folder "/user/user".
You can run your program on Spark by running the following command:
Please save your results in the 'result-rdd' folder in HDFS (i.e., use saveAsTextFile("result-rdd") in your code).