Great Deal! Get Instant \$10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

# 1- ISYS3374 Business Analytics – Final Exam Note: You need to submit your answers in a word document. You need to transfer the results from the excel file into the word document. In addition, you must...

1-
ISYS3374 Business Analytics – Final Exam
Note: You need to submit your answers in a word document. You need to transfer the results from the excel file into the word document. In addition, you must submit your Excel files but note that only the word document will be marked. If you think there is any issue with any questions, please make your assumptions and clearly explain them in your report.
SECTION A: Discussion Questions
1- Explain the concept of having the imbalance data in classification techniques and the way that it should be treated in developing the classification models?
2- Explain the concept of over-fitting. Explain how overfitting can be avoided?
3- Give two examples of how logistic regression can be used. You only need to explain the problem. One example is the bank that are using logistic regression to classify its new customers for loan approval. The bank wanted to identify customers that are more likely to default on their loan. Explain why you cannot use linear regression in your examples.

SECTION B: QUANTITATIVE QUESTIONS
1. There are 500 client records in the first sheet of the file Toy-Info which have shopped many special toys from an e-Business website. Each record includes data on types of product purchased (between 1-5), purchase amount (\$), age, gender, marital status, whether the client has a membership and whether the customer has a discount card.
A business analyst has applied the k-means clustering method on all seven variables. The analyst increased the number of clusters to recommend a proper value of k. The resultant tests for k=5 and k=6 shown in the following sheets of the file revealed the best k as k=6.

a) Explain how the analyst found that k=6 is a proper number of clusters. Refer the relevant sheet name, table name and the values you compared.
) Describe all 6 clusters by their average characteristics.

2- A company provides maintenance service for washing machines in Victoria. The analyst of the company aims to estimate the repair time and the service cost for each maintenance. He assumes the repair time as the dependent variable which can be related to number of months since last service, type of repair and the repair person. The following table reports 10 samples of the maintenances.
Repair time (hours)
Months since last service
Type of repai
Repairperson
2.1
2
Mechanical
John
2.8
2
Electrical
John
1.6
3
Mechanical
John
3.9
4
Electrical
Bo
2.5
6
Mechanical
John
3.1
6
Electrical
John
4.5
7
Electrical
Bo
4.7
8
Electrical
Bo
3.8
9
Mechanical
Bo
4.6
9
Electrical
Bo
a) Create an estimated simple regression model for this data where months since last service is the independent variable. What does the model indicate about the relationship between months since last service and repair time? How strong is the relationship? Report the accuracy measures and the equation.
) Calculate the residual e
ors for each repair exists in the table and interpret the meaning of positive and negative values of the residuals in this analysis. Which type of repair (electrical or mechanical) is more desirable and which repairperson (John or Bob) has worked more efficient?
c) Create a scatter chart with months since last service on the x axis for which the points representing electrical and mechanical repairs are shown in different colors. Create a similar chart of months since last service and repair time for which the points representing repairs by John and Bob are shown in different colors. Do these charts suggest any potential modifications to your simple linear regression model? Why?

3- The following data is the results of a 4- year study conducted to assess how age, weight, and gender influence the risk of diabetes. Risk is interpreted as the probability (times 100) that the patient will have diabetes over the next 4-year period.
a) Develop a multiple regression model that relates risk of diabetes to the person’s age, weight and the gender. Present the regression formula as a mathematical equation. Interpret the coefficients of the regression and comment on the strength of the regression.
) Develop an estimated multiple regression model that relates risk of diabetes to the person’s age, weight, gender and life style. Present the regression formula as a mathematical equation. Interpret the coefficients of the regression and comment on the strength of the regression.
c) What is the risk percentage of diabetes over the next 4 years for a 55-year-old man living in a big city with 70 kg weight? Use both models to estimate the risk and compare the result.

Age
Weight (Kg)
Gende
Life style
Risk (%)
53
78
Female
Small town
40
24
77
Male
Big city
23
77
83
Female
Country
67
88
89
Female
Small town
71
56
65
Male
Big city
45
71
82
Female
Country
54
53
79
Female
Small town
48
70
66
Male
Small town
49
80
80
Female
Big city
65
78
67
Male
Big city
59
71
69
Male
Big city
56
70
78
Female
Small town
59
67
75
Male
Country
46
77
95
Female
Big city
64
60
57
Male
Country
39
82
100
Female
Big city
73
66
85
Male
Small town
63
80
96
Male
Big city
87
62
83
Female
Country
52
59
93
Male
Big city
61

4- An internet provider company in Australia is interested in identifying the reason for individuals who are still undecided in buying the new NBN service of the company. The file NBN-service contains data on the first sheet which introduces a sample of customers with variables that tracked the decision outcome.
A business analyst has created a standard partition of the data with all tracked variables and 40% of observations in the training set, 35% in the validation set, and 25% in the test set. The analyst applied two logistic regression models to classify undecided customers of the company. The resultant output of the Solver software for both models has been added in the following sheets.
a) Determine the selected input variables in each model and explain why the analyst has changed one of the input variables.
) Write the obtained logistic regression equation for the first model shown in worksheet “4-1-1” and predict a customer with Contract duration of 16 months, Bonus data of 63 GB and Usage of 237 GB whether he/she will decide to buy the new service or not? Explain how you found the prediction.
c) Find the class 1 and class 0 e
ors based on the sheet “4-1-2” and compare your results with the confusion matrix. Explain which kind of these e
ors are more undesirable in this model?
d) In the second model (shown in worksheet “4-2-1”), compare the accuracy of the model with the first model. Which one do you recommend?

5- Paul has a new job in project management. He plans to invest the same amount of \$15,000 into a retirement account at the end of every year for the next 30 years. Suppose that annual return is 6%, then:
a) Create a data table which shows Paul the balance of retirement account for various levels of annual investments and returns.
) If Paul aim to gain \$1,500,000 at the end of the 30th year, how much money he should put in the investment annually.

6- FSUB is a company that intended to introduce its product by advertising them in 3 relevant websites. The names of these websites are determined as A, B and C by the marketing manager of the company. Viewer estimates, cost per advertisement, and maximum usage
Answered Same Day May 30, 2020

## Solution

Pooja answered on Jun 01 2020
Section A
1)
Approach to handling Imbalanced Datasets
1) Data Level approach: Resampling Techniques:
· Random Under-Sampling
· Random Over-Sampling
· Cluster-Based Over Sampling
· Informed Over Sampling: Synthetic Minority Over-sampling Technique
· Modified synthetic minority oversampling technique (MSMOTE)
2) Algorithmic Ensemble Techniques
· Bagging Based
· Boosting-Based
When faced with imbalanced data sets there is no one-stop solution to improve the accuracy of the prediction model. One may need to try out multiple methods to figure out the best-suited sampling techniques for the dataset. In most cases, synthetic techniques like SMOTE and MSMOTE will outperform the conventional oversampling and under sampling methods.
For better results, one can use synthetic sampling methods like SMOTE and MSMOTE along with advanced boosting methods like gradient boosting and XG Boost.
One of the advanced bagging techniques commonly used to counter the imbalanced dataset problem is SMOTE bagging. It follows an entirely different approach from conventional bagging to create each Bag/Bootstrap. It generates the positive instances by the SMOTE Algorithm by setting a SMOTE resampling rate in each iteration. The set of negative instances is bootstrapped in each iteration.
Depending on the characteristics of the imbalanced data set, the most effective techniques will vary. Relevant evaluation parameters should be considered during the model comparison.
2)
Informed Over Sampling: Synthetic Minority Over-sampling Technique
The technique of Informed over Sampling Technique is Synthetic Minority Over-sampling. This is followed to avoid over-fitting which occurs when exact replicas of minority instances are added to the main dataset. A subset of data is taken from the minority class as an example and then new synthetic similar instances are created. These synthetic instances are then added to the original dataset. The new dataset is used as a sample to train the classification models.
3)
Logistic regression is used when the dependent variable is categorical in nature.
Example 1: I want to predict the probability of default for a credit card company on the basis of income, age. In this case Logistic regression is applied. The expected regression equation is in the form of:
Example 2: I want to predict the probability of theft in an electricity department on the basis of the number of consumption units, grade of the area (categorized as either high or low). In this case, Logistic regression analysis is an appropriate measure for the analysis. The Logistic regression equation is given in the form of

Section B
1)
a)
Considering the table of original coordinates in sheet 1-a-1-1there are 5 clusters with number of observations as 122, 100, 61, 105, 112, and 500. All the observations are not approximately equally distributed
And 1-a-2-1 sheet, considering the table of original coordinates in the data summary section,  the number of observations in each cluster is approximately 80. This indicates that all points are nearly...
SOLUTION.PDF