Option #1: Working with Big Data using Multithreading
The goal of this project is to use the concepts taught in this course to develop an efficient way of working with Big Data.
You should have 2 files in your Linux system: hugefile1.txtand hugefile2.txt, with one billion lines in each one. If you do not, please go back to the Module 7 Portfolio Reminder and complete the steps there.
Create a program, using a programming language of your choice, to produce a new file: totalfile.txt, by taking the numbers from each line of the two files and adding them. So, each line in file #3 is the sum of the corresponding line in hugefile1.txtand hugefile2.txt.
For example, if the first 5 lines of your files look as follows:
$ head -5 hugefile*txt
== hugefile1.txt ==
4131
29929
6483
7659
25003
== hugefile1.txt ==
8866
19171
11029
4889
27069
then the first 5 lines of totalfile.txtlook like this:
$ head -5 totalfile.txt
12997
49100
17512
12548
52072
Because the files of such large sizes cannot be read into memory in their entirety at the same time, you need to use concurrency. Reading the files one line at a time will take a long time, so use what you have learned in this course to optimize this process. Be sure to record the amount of time it takes for each version of your program to complete this task.
Optimize the program by using threads, so that you benefit from multiple cores in your CPU. Create a multithreaded program, where each thread works on the next chunk of the file.
Now, break up hugefile1.txtand hugefile2.txtinto 10 files each, and run your process on all 10 sets in parallel. How do the run times compare to the original process?
Explain your methods and results in detail. What conclusions can you make about the different methods of optimizing large file processing? How has the information that you learned in this course helped you to accomplish this task?