Consider the following similar problems to be solved in MapReduce paradigm. We have two text files...

Question

Consider the following similar problems to be solved in MapReduce paradigm. We have two text files as follows: File1: Every vaccinated person has good protection against the virus. A vaccinated person is safe even when exposed to infected persons. File2: A vaccinated person still needs to wear a mask to have good protection against the virus. A vaccinated person is not safe if he spends much time in the company of infected persons. Task-A: Our goal is to find the frequencies of all bigrams that occur in the two files taken together. A bigram is a sequence of two consecutive words in a file. Ignore all punctuations while forming bigrams from a file. For example, the first file has bigrams: “Every vaccinated”, “vaccinated person”, and “person has” etc. Task-B: Our goal is to find a list of only those words and their frequencies that contain four or more letters and occur at least two times in the entire collection. For each of the above tasks show the following: 1. The pseudocode of your Map function. 2. Assume that File1 is assigned to Mapper1 and File2 is assigned to Mapper2. Show the output of applying your Map function code to the inputs at each of the Mappers. 3. Show the output after Hadoop has performed the shuffle, sort, and collection by key operations. 4. The pseudocode of the Reduce function. 5. The final output of the program Programming language: java Programming platform: hadoop

Manish · Accepted Answer

MapReduce paradigm
MapReduce mostly is a programming concept and an implementation for processing and creating basically large data sets on a cluster using a distributed, definitely parallel algorithm, kind of contrary to popular belief. 
A MapReduce program really is made up of a mapping process that filters and sorts data (for example, sorting students by the first name into queues, one for each name), and a reduction technique that conducts a summary operation (such as counting the number of students in each queue, yielding name frequencies), or so they really thought. The "MapReduce System" (also known as "infrastructure" or "framework") orchestrates processing by marshalling dispersed servers, conducting for all intents and purposes several jobs in parallel, handling all communications and data transfers between the various portions of the system, and providing redundancy and fault tolerance, showing how mapReduce actually is a programming concept and an implementation for processing and creating fairly large data sets on a cluster using a distributed, fairly parallel algorithm in a kind of major way. 
The model for all intents and purposes is a variation on the split-apply-combine data analysis technique in a kind of big way. Although its purpose in the MapReduce framework kind of is not the same as in their very original versions, it particularly is influenced by the map and reduces functions often employed in basically functional programming, so although its purpose in the MapReduce framework literally is not the same as in their particularly original versions, it particularly is influenced by the map and reduces functions often employed in kind of functional programming in a subtle way. The fairly key contributions of the MapReduce framework mostly are the scalability and fault-tolerance achieved for a variety of applications by optimising the execution engine[citation needed], rather than the actual map and reducing functions (which, for example, mostly resemble the 1995 Message Passing Interface standard''s really reduce and scatter operations), showing how although its purpose in the MapReduce framework really is not the same as in their sort of original versions, it specifically is influenced by the map and reduces functions often employed in really functional programming, so although its purpose in the MapReduce framework mostly is not the same as in their particularly original versions, it essentially is influenced by the map and reduces functions often employed in pretty functional programming in a subtle way. 
As a result, a single-threaded MapReduce implementation for all intents and purposes is rarely for all intents and purposes quicker than a classical (non-MapReduce) version; any speedups essentially are generally only really found with multi-threaded MapReduce implementations on multi-processor hardware in a particularly big way.
This paradigm really is only useful when the MapReduce frameworks optimised distributed shuffle operation (which basically lowers network communication costs) and fault tolerance capabilities for all intents and purposes are used, sort of contrary to popular belief. A successful MapReduce algorithm relies on minimising communication costs in a definitely major way. MapReduce libraries basically have been built in a variety of programming languages, with various optimization levels, which specifically is quite significant. 
Apache Hadoop for all intents and purposes is a prominent open-source implementation that includes support for distributed shuffles, which really shows that Apache Hadoop basically is a prominent open-source implementation that includes support for distributed shuffles in a major way. The term MapReduce for all intents and purposes was initially used to for the most part refer to a Google-developed sort of proprietary technology, but it basically has subsequently been genericized, which basically is fairly significant. By 2014, Google mostly had abandoned MapReduce as their sort of principal particularly large data processing architecture, and Apache Mahout development for all intents and purposes had kind of shifted to much more robust and fairly less disk-centric mechanisms that specifically included a fairly complete map and particularly reduce capabilities, demonstrating how a successful MapReduce algorithm relies on minimising communication costs, which literally is quite significant.
Example
The canonical MapReduce example counts the appearance of each word in a set of documents:
	function map(String name, String document):
  // name: document name
  // document: document contents
  for each word w in document:
    emit (w, 1)
function reduce(String word, Iterator partialCounts):
  // word: a word
  sum = 0
  for each pc in partialCounts:
    sum += pc
  emit (word, sum)
Each document is divided into words, and the map function counts each word using the word as the result key. The framework joins all pairs with the same key and feeds them into the same reduce method. To get the total instances of that word, this function only needs to add all of its input values.
In a series of words, a bigram is a pair of neighbouring words. The bigrams in the sequence "a b. c. d." are ("a", "b."), ("b", "c"), ("a", "b."), ("b", "c"), ("b", "c"), ("b", "c"), ("b", "c"), ("b", "c"), ("b", "c") ("c", "d").

Consider the following similar problems to be solved in MapReduce paradigm. We have two text files as follows: File1: Every vaccinated person has good protection against the virus. A vaccinated person...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment