CS-IT019-2-A-Data Management XXXXXXXXXXIndividual Assignment ...

Question

CS-IT019-2-A-Data Management XXXXXXXXXXIndividual Assignment XXXXXXXXXXPage 5 of 5 DATA MANAGEMENT LEARNING OUTCOMES 1. Evaluate the various data types, data storage systems and associated techniques...

1 answer below »

CS-IT019-2-A-Data Management XXXXXXXXXXIndividual Assignment XXXXXXXXXXPage 5 of 5
DATA MANAGEMENT
LEARNING OUTCOMES
1. Evaluate the various data types, data storage systems and associated techniques for indexing and retrieving data.
2. Design feature engineering techniques to transform transactional data into meaningful inputs in order to create a predictive model.
3. Propose a suitable approach to designing a data warehouse to store and process large datasets.
DATA MANAGEMENT
The machine learning pipeline involves several tasks before the development of a predictive/descriptive models. The inevitable and vital process includes preparing and understanding the data. Moreover, the performance of the predictive/descriptive model depends on the choice of pre-processing techniques.
For the assignment, you are required to prepare and explore the given dataset. It is imperative to explain and justify the pre-processing, transformation, and feature engineering techniques that have been chosen. Your analysis should be deep and in detail, also it must go further than what has already been covered in this course.
The assignment should involve a number of experiments, and a detailed exploration and analysis of the results using SAS Studio, Apache Hadoop distribution, and Visual Analytics Tools (Tableau).
You need to do the following tasks:
1. Related Works
In this section, you are supposed to research and present the other works related to the application domain.
2. Initial Data Exploration
This section should contain the following task.
· Indicate the type of each attribute (nominal, ordinal, interval or ratio).
· Identify the values of the summarising properties for each attribute including frequency and spread e.g. value ranges of the attributes, frequency of values, distributions, medians, means, variances, and percentiles. Wherever necessary, use proper visualisations for the co
esponding statistics.
· Using SAS explore your dataset and identify any outliers, missing values, "interesting" attributes and specific values of those attributes.
3. Data Pre-processing
Investigate the required method(s) to handle the incomplete, noisy and inconsistent data.
Report each of the applied techniques with detailed explanations. Show your results and justify your approach.
NOTE: Easiest way to handle dirty data is through removing the feature(s) / instance(s). Choosing this method will be award ZERO for pre-processing.
4. Feature Engineering
Several Data Mining/Machine Learning algorithms are designed to work with qualitative or quantitative data and very few algorithms support mixed data. Hence, this task requires you to develop two datasets. The first dataset should represent all variables in the qualitative and second dataset in quantitative.
Individual attributes, need to be discretized/transformed with an appropriate method(s) and proper justification to be provided. In addendum, the metadata should be created for each dataset.
5. Exploratory Data Analysis (EDA)
This task requires you to perform an analysis on the two datasets generated during your feature engineering. You are evaluated based on the approaches undertaken to get familiar with the dataset.
6. Apache Hadoop
Load the dataset (cleaned dataset or transformed dataset) into Hive configured with optimized read performance on the tables.
You are free to choose your own choice of Apache Hadoop distribution (Hortonworks, Cloudera, MapR etc.).
7. Hypothesis
Formulate a minimum of FIVE (5) hypotheses based on the dataset (cleaned dataset or transformed dataset) with required analytical variable(s). Interpret the hypotheses with the query resulted from HIVEQL and/or visualization.
Deliverables
The deliveries include:
· A report, which structure should follow the tasks of the assignment.
· SAS program (Initial Data Exploration, Data Pre-processing, and Dataset Transformation) and Hive queries with an individual file for each task.
Your report should include the following:
Abstract – A self-contained, short, and powerful statement
ief that describes your work. It may contain the scope, purpose, results, and contents of the work. [180 to 250 words]
Introduction - The purpose of your report. Background information about the topic. You also have to place some
ief details of your methods applied for the study. Include an outline of the structure of the report. [800 to 1000 words]
Related Work - Carefully structure your findings. It may be useful to do a chronological format where you discuss from the earliest to the latest research, placing your research appropriately in the chronology. Alternately, you could write in a thematic way, outlining the various themes that you discovered in the research regarding the topic. [1000 to 1500 words]
Method - This section should contain detail exploration of the dataset, pre-processing, feature engineering, EDA, Hive and Hypothesis. [No limit]
Discussion - For each of the task include a section title in your report. Finally, you need to summarize your findings, and this summary section should NOT be a na
ative of your tasks, but a summarized informative section of what is your findings of the data. This section should provide detail interpretation of the work along with the supporting related works. [500 to 1000 words]
For example, it should include details like specific characteristics (or values) of some attributes, important information about the distributions, relationship or association that exist between variables found that should be investigated more rigorously, etc.
Conclusion – In this section, you need to state your position about what you gained in this assignment that can contribute to other readers.
Documentation Format:
· Typeface: Times New Roman. Boldface, italic & lines can be used for emphasizing and to enhance readability.
· Font size: 12 (except titles and headings).
· Margins: 1” from the left, right, top & bottom of the edges of the A4 paper.
· Spacing: 1.5 lines between texts of a paragraph.
· Alignment: Justify.
· Headers and footers can be used all pages must be numbered accordingly.
· Standard cover page as available in the learning management system
Level Masters COMSATS 2019

Tutorial Business Analytics
Data Mining Cup (SS 2017)

Description
This is a dataset from one bank in the United States. Besides usual services, this bank also provides car
insurance services. The bank organizes regular campaigns to attract new clients. The bank has potential
customers’ data, and bank’s employees call them for advertising available car insurance options. We are
provided with general information about clients (age, job, etc.) as well as more specific information about the
cu
ent insurance sell campaign (communication, last contact day) and previous campaigns (attributes like
previous attempts, outcome).
You have data about 4000 customers who were contacted during the last campaign and for whom the results
of campaign (did the customer buy insurance or not) are known.
Classification Task
The task is to predict for 1000 customers who were contacted during the cu
ent campaign, whether they will
uy car insurance or not.

Feature Overview

Feature Description Example
Id Unique ID number. Predictions file
should contain this feature.
“1” … “5000”
Age Age of the client
Job Job of the client. "admin.", "blue-collar", etc.
Marital Marital status of the client "divorced", "ma
ied", "single"
Education Education level of the client "primary", "secondary", etc.
Default Has credit in default? "yes" - 1,"no" - 0
Balance Average yearly balance, in USD
HHInsurance Is household insured "yes" - 1,"no" - 0
CarLoan Has the client a car loan "yes" - 1,"no" - 0
Communication Contact communication type "cellular", "telephone", “NA”
LastContactMonth Month of the last contact "jan", "feb", etc.
LastContactDay Day of the last contact
CallStart Start time of the last call
(HH:MM:SS)
12:43:15
CallEnd End time of the last call
(HH:MM:SS)
12:43:15
NoOfContacts Number of contacts performed
during this campaign for this
client
DaysPassed Number of days that passed by
after the client was last contacted
from a previous campaign
(numeric; -1 means client was not
previously contacted)
PrevAttempts Number of contacts performed
efore this campaign and for this
client
Outcome Outcome of the previous
marketing campaign
"failure", "other", "success", “NA”
CarInsurance Has the client subscribed a
CarInsurance?
"yes" - 1,"no" - 0

qfile_636873832748937894_124973_1.docx qfile_636873832748937894_124973_2.csv qfile_636873832748937894_124973_3.csv qfile_636873832748937894_124973_4.pdf qfile_636873832748937894_124973_5.xls

Answered Same Day Mar 05, 2021

Solution

Shikha answered on Mar 10 2021

142 Votes

hadoopanswe
DMC1_description.pdf
Data Mining Cup 1
Description
This is a dataset from one bank in the United States. Besides usual services, this bank also provides car
insurance services. The bank organizes regular campaigns to attract new clients. The bank has potential
customers’ data, and bank’s employees call them for advertising available car insurance options. We are
provided with general information about clients (age, job, etc.) as well as more specific information about the
cu
ent insurance sell campaign (communication, last contact day) and previous campaigns (attributes like
previous attempts, outcome).
You have data about 4000 customers who were contacted during the last campaign and for whom the results
of campaign (did the customer buy insurance or not) are known.
Classification Task
The task is to predict for 1000 customers who were contacted during the cu
ent campaign, whether they will
uy car insurance or not.
Feature Overview
Feature Description Example
Id Unique ID number. Predictions file
should contain this feature.
“1” … “5000”
Age Age of the client
Job Job of the client. "admin.", "blue-collar", etc.
Marital Marital status of the client "divorced", "ma
ied", "single"
Education Education level of the client "primary", "secondary", etc.
Default Has credit in default? "yes" - 1,"no" - 0
Balance Average yearly balance, in USD
HHInsurance Is household insured "yes" - 1,"no" - 0
CarLoan Has the client a car loan "yes" - 1,"no" - 0
Communication Contact communication type "cellular", "telephone", “NA”
LastContactMonth Month of the last contact "jan", "feb", etc.
LastContactDay Day of the last contact
CallStart Start time of the last call
(HH:MM:SS)
12:43:15
CallEnd End time of the last call
(HH:MM:SS)
12:43:15
NoOfContacts Number of contacts performed
during this campaign for this
client
DaysPassed Number of days that passed by
after the client was last contacted
from a previous campaign
(numeric; -1 means client was not
previously contacted)
PrevAttempts Number of contacts performed
efore this campaign and for this
client
Outcome Outcome of the previous
marketing campaign
"failure", "other", "success", “NA”
CarInsurance Has the client subscribed a
CarInsurance?
"yes" - 1,"no" - 0...

SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Shikha · Accepted Answer

hadoopanswer/DMC1_description.pdf
Data Mining Cup 1 
Description 
This is a dataset from one bank in the United States. Besides usual services, this bank also provides car 
insurance services. The bank organizes regular campaigns to attract new clients. The bank has potential 
customers’ data, and bank’s employees call them for advertising available car insurance options. We are 
provided with general information about clients (age, job, etc.) as well as more specific information about the 
current insurance sell campaign (communication, last contact day) and previous campaigns (attributes like 
previous attempts, outcome).  
You have data about 4000 customers who were contacted during the last campaign and for whom the results 
of campaign (did the customer buy insurance or not) are known. 
Classification Task 
The task is to predict for 1000 customers who were contacted during the current campaign, whether they will 
buy car insurance or not. 
Feature Overview 
Feature Description Example 
Id Unique ID number. Predictions file 
should contain this feature. 
“1” … “5000” 
Age Age of the client 
Job Job of the client. "admin.", "blue-collar", etc. 
Marital Marital status of the client "divorced", "married", "single" 
Education Education level of the client "primary", "secondary", etc. 
Default Has credit in default? "yes" - 1,"no" - 0 
Balance Average yearly balance, in USD 
HHInsurance Is household insured "yes" - 1,"no" - 0 
CarLoan Has the client a car loan "yes" - 1,"no" - 0 
Communication Contact communication type "cellular", "telephone", “NA” 
LastContactMonth Month of the last contact "jan", "feb", etc. 
LastContactDay Day of the last contact 
CallStart Start time of the last call 
(HH:MM:SS) 
12:43:15 
CallEnd End time of the last call 
(HH:MM:SS) 
12:43:15 
NoOfContacts Number of contacts performed 
during this campaign for this 
client 
DaysPassed Number of days that passed by 
after the client was last contacted 
from a previous campaign 
(numeric; -1 means client was not 
previously contacted) 
PrevAttempts Number of contacts performed 
before this campaign and for this 
client 
Outcome Outcome of the previous 
marketing campaign 
"failure", "other", "success", “NA” 
CarInsurance Has the client subscribed a 
CarInsurance? 
"yes" - 1,"no" - 0 
hadoopanswer/predictionsresult.csv
id,prediction
4001,0
4002,0
4003,0
4004,0
4005,0
4006,1
4007,0
4008,1
4009,0
4010,0
4011,0
4012,1
4013,1
4014,0
4015,0
4016,0
4017,1
4018,0
4019,0
4020,1
4021,0
4022,1
4023,0
4024,0
4025,0
4026,0
4027,1
4028,0
4029,0
4030,0
4031,0
4032,1
4033,1
4034,0
4035,0
4036,0
4037,0
4038,1
4039,1
4040,1
4041,1
4042,1
4043,0
4044,0
4045,1
4046,0
4047,1
4048,0
4049,1
4050,0
4051,0
4052,0
4053,0
4054,0
4055,0
4056,0
4057,1
4058,1
4059,0
4060,1
4061,1
4062,1
4063,0
4064,0
4065,0
4066,0
4067,0
4068,0
4069,0
4070,1
4071,0
4072,0
4073,0
4074,1
4075,1
4076,0
4077,0
4078,0
4079,1
4080,0
4081,0
4082,0
4083,1
4084,1
4085,1
4086,0
4087,1
4088,1
4089,0
4090,1
4091,1
4092,0
4093,1
4094,1
4095,1
4096,

CS-IT019-2-A-Data Management XXXXXXXXXXIndividual Assignment XXXXXXXXXXPage 5 of 5 DATA MANAGEMENT LEARNING OUTCOMES 1. Evaluate the various data types, data storage systems and associated techniques...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment