Programming Assignment: Machine Problem 5: Spark MapReduce Deadline Pass this assignment by Mar 5, 9:59 PM PST Instructions 1. Overview Welcome to the Spark MapReduce programming...

1 answer below »

Programming Assignment: Machine Problem 5:
Spark MapReduce
Deadline
Pass this assignment by Mar 5, 9:59 PM PST
Instructions
1. Overview
Welcome to the Spark MapReduce programming assignment. You will implement the solution
to this machine problem in Python. To work on this assignment, you need Docker Desktop
installed.
2. General Requirements
Please note that our grader runs on a docker container NOT connected to the internet.
Therefore, no additional li
aries are allowed for this assignment (you can only use the
default li
aries of Python but no pip installs). Also, you will NOT be allowed to create
any file or folder outside the cu
ent folder (i.e., you can only create files and folders in
the folder that your solutions are in).
3. Setup
Download the docker file, build a docker image and run it in a container. If you have already
created this container, do not create a new one.
Copy commands below
# clone the repository and find the docker file
git clone https:
github.com/UIUC-CS498-Cloud/MP5_SparkMapReduce_Template.git
cd MP5_SparkMapReduce_Template/Docke
# build an image for mp5 based on the docker file
docker build -t mp5 .
# create a container named 'mp5-cntr' for mp5 using the image mp5
docker run --name mp5-cntr -it mp5
# or start the 'mp5-cntr' container if you have created it
docker start -a mp5-cntr
4. Sorting
When selecting the top N items in a list, sorting is necessary. Use the following steps to sort:
1. Sort the list ASCENDING based on count first, then on the key. If the key is a string, sort
lexicographically.
2. Select the bottom N items in the sorted list as Top items.
This logic is implemented in the third example of the Hadoop MapReduce Tutorial.
For example, to select top 5 items in the list {"A": 100, "B": 99, "C":98, "D": 97, "E": 96, "F": 96,
"G":90}, first sort the items ASCENDING:
"G":90
"E": 96
"F": 96
"D": 97
"C":98
"B": 99
"A": 100
Then, the bottom 5 items are A, B, C, D, F.
Another example, to select 5 top items in the list {"43": 100, "12": 99, "44":98, "12": 97, "1": 96,
"100": 96, "99":90}
"99":90
"1": 96
"100": 96
"12": 97
"44":98
"12": 99
"43": 100
Then, the bottom 5 items are 43, 12, 44, 12, 100.
Submission
1. Requirements
This assignment will be graded based on Python 3.6.
2. Procedures
Step 1: Launch and go into the 'mp5-cntr' container after the setup. Note that files inside the
container and the host machine are not shared. Therefore, you should clone the repository
again within the container. Download the templates and change the cu
ent folder, run:
git clone https:
github.com/UIUC-CS498-Cloud/MP5_SparkMapReduce_Template.git
cd MP5_SparkMapReduce_Template/PythonTemplate
Step 2: Finish the exercises by editing the provided templates files. All you need to do is
complete the parts marked with TODO. Please note that you are NOT allowed to import
any additional li
aries.
• Each exercise has one or more code templates. Simply edit these files.
• Our autograder runs the code on the provided Docker image.
More information about these exercises is provided in the next section.
Step 3: After you are done with the assignments, put all your 5 python files
(TitleCountSpark.py, TopTitleStatisticsSpark.py, OrphanPagesSpark.py,
TopPolularLinksSpark.py, PopularityLeagueSpark.py) into a .zip file named "MP5.zip".
Remember not to include the parent folder. Submit your "MP5.zip".
Exercise A: Top Titles
In this exercise, you will implement a counter for words in Wikipedia titles and find the top
words used in these titles. We have provided a template for this exercise in the following
file: TitleCountSpark.py
You need to make the necessary changes to parts marked with TODO.
Your application takes a list of Wikipedia titles (one in each line) as an input and first
tokenizes them using the provided delimiters. It then makes the tokens lowercased and
emoves common words from the provided stopwords. Next, your application selects the
top 10 words, and finally, saves the count for them in the output. Use the method in the
Sorting section to select top words.
You can test your output with:
# spark-submit TitleCountSpark.py stopwords.txt delimiters.txt dataset/titles/ partA
# cat partA
Here is an example output showing the top 5 words in alphabetical order. Note that the
autograder requires the top 10 (after they are chosen based on count):

The order of lines matters. Please sort the output in alphabetic order as shown above. Also,
make sure the key and value pairs in the final output are tab-separated.
Exercise B: Top Title Statistics
In this exercise, you will implement an application to find some statistics about the top words
used in Wikipedia titles. We have provided a template for this exercise in the following
file: TopTitleStatisticsSpark.py
You need to make the necessary changes to parts marked with TODO.
Your output from Exercise A will be used as input here. The application saves the following
statistics about the top words in the output: “Mean” , “Sum”, “Minimum”, “Maximum”, and
“Variance” of the counts. All values should be floored to be an integer. For the sake of
simplicity, use integers in all calculations.
The following is the sample command we will use to run the application:
# spark-submit TopTitleStatisticsSpark.py partA partB
# cat partB
Here is the output of an application that selects the top 5 words, though we still require the
top 10 as described above:

Make sure the stats and the co
esponding results are tab-separated.
Exercise C: Orphan Pages
In this exercise, you will implement an application to find orphan pages in Wikipedia. We have
provided a template for this exercise in the following files: OrphanPagesSpark.py
You need to make the necessary changes to parts marked with TODO.
Your application takes a list of Wikipedia links (not Wikipedia titles anymore) as an input. All
pages are represented by their ID numbers. Each line starts with a page ID, followed by a list
of pages that the ID links to. The following is a sample line in the input:

In this sample, page 2 has links to pages 3, 747213, and so on. Note that links are not
necessarily two-way. The application should save the IDs of orphan pages in the output.
Orphan pages are pages to which no pages link. A page that links to itself is NOT an orphan
page.
The following is the sample command we will use to run the application:
# spark-submit OrphanPagesSpark.py dataset/links/ partC
# cat partC
# head partC
Here is a part of the output of this application:

The order of lines matters. Please sort your output (key value) in alphabetic order.
Exercise D: Top Popular Links
In this exercise, you will implement an application to find orphan pages in Wikipedia. We have
provided a template for this exercise in the following file: TopPopularLinksSpark.py
You need to make the necessary changes to parts marked with TODO.
Your application takes a list of Wikipedia links (not Wikipedia titles anymore) as an input. All
pages are represented by their ID numbers. Each line starts with a page ID, followed by a list
of pages that the ID has a link to. The following is a sample line in the input:

In this sample, page 2 has links to pages 3, 747213, and so on. Note that links are not
necessarily two-way. The application should save the IDs of the top 10 popular pages and the
number of links to them in the output. A page is popular if more pages are linked to it. Use
the method in Sorting to select top links.
The following is the sample command we will use to run the application:
# spark-submit TopPopularLinksSpark.py dataset/links/ partD
# cat partD
Here is the output of an application that selects the top 5 popular links:

The order of lines matters. Please sort your output (key value) in alphabetical order. Also,
make sure the key-value pair in the final output are tab-separated.
Exercise E: Popularity League
In this exercise, you will implement an application to find the most popular pages in
Wikipedia. Again, we have provided a tempalte for this exercise in the following
file: PopularityLeagueSpark.py
You need to make the necessary changes to parts marked with TODO.
Your application takes a list of Wikipedia links as input. All pages are represented by their ID
numbers. Each line starts with a page ID, followed by the pages the ID links to. The following
is a sample line in the input:

In this sample, page 2 has links to pages 3, 747213, and so on. Note that links are not
necessarily two-way.
The popularity of a page is determined by the number of pages in the whole Wikipedia graph
that link to that specific page. (Same number as Exercise D)
The application also takes a list of page IDs as an input (also called a league list). The goal of
the application is to calculate the rank of pages in the league using their popularity.
A page's rank is the number of pages in the league with less popularity than the original
page.
The following is the sample command we use to run the application:
# spark-submit PopularityLeagueSpark.py dataset/links/ dataset/league.txt partE
# cat partE
Here is the output with League={5300058,3294332,3078798,1804986,2370447,81615,3,1}):

Here is the output with
League={88822,774931,4861926,1650573,66877,5115901,75323,4189215}):

The order matters. Please sort your output (key value) in alphabetic order. Also, make sure
the key and value pairs in the final output are tab-separated.
Note that we will use a different League file in our autograder runs.

mp5-ciuadv1k.pdf

Answered 2 days After Mar 03, 2023

Solution

Nidhi answered on Mar 05 2023

46 Votes

SOLUTION.PDF

Programming Assignment: Machine Problem 5: Spark MapReduce Deadline Pass this assignment by Mar 5, 9:59 PM PST Instructions 1. Overview Welcome to the Spark MapReduce programming...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment