Great Deal! Get Instant $25 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Apache Spark: Streaming Data </o:p> </o:p> Data set: Baby names.csv(dropbox link)</o:p> </o:p> For this exercise, you will be streaming data from your local file system and...

1 answer below »

Apache Spark:Streaming Data

Data set:Baby names.csv(dropbox link)

For this exercise, you will be streaming data from your local file system and outputting the results to the console. The examples used in this exercise are loosely based on examples fromSpark’s Structured Streaming Programming Guide.

1. Stream Directory Data

In the first part of the exercise, you will create a simple Spark streaming program that reads an input stream from a file source. The file source stream reader reads data from a directory on a file system. When a new file is added to the folder, Spark adds that file’s data to the input data stream.

You can find the input data for this exercise in the baby-names/streaming directory. This directory contains thebaby names CSV filerandomized and split into 98 individual files. You will use these files to simulate incoming streaming data.

a. Count the Number of Females

In the first part of the exercise, you will create a Spark program that monitors an incoming directory. To simulate streaming data, you will copy CSV files from the baby-names/streaming directory into the incoming directory. Since you will be loading CSV data, you will need to define a schema before you initialize the streaming dataframe.

From this input data stream, you will create a simple output data stream that counts the number of females and writes it to the console. Approximately every 10 seconds or so, copy a new file into the directory and report the console output. Do this for the first ten files.

2.Micro-Batching

Repeat the last step, but use a micro-batch interval to trigger the processing every 30 seconds. Approximately every 10 seconds or so, copy a new file into the directory and report the console output. Do this for the first ten files. How did the output differ from the previous example?

Answered Same DayOct 09, 2021

Solution

Abr Writing answered on Oct 20 2021
52 Votes
partA.py
# Importing the necessary li
aries
from pyspark.sql.types import *
from pyspark.sql import SQLContext
from pyspark import SparkContext
import pandas as pd
import os
# Defining the contants about the Spark Application
master = "local"
appName = "Streaming Data"
dataDirectory = "./incoming/"
path_to_watch = dataDirectory
# Declaring the Schema for the data files
schema = StructType([StructField("state", StringType(), True),
StructField("sex", StringType(), True),
StructField("year", IntegerType(), True),
StructField("name", StringType(), True),
StructField("count", IntegerType(), True)])
# Creating a Spark Context and starting streaming from the incoming data directory
sc = SparkContext(master, appName)
sqlContext = SQLContext(sc)
efore = dict ([(f, None) for f in os.listdir (path_to_watch)])
while True:
after = dict ([(f,...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here