1-ISYS3374 Business Analytics – Final ExamNote: You need to submit your answers in a word document....

Question

1-ISYS3374 Business Analytics – Final ExamNote: You need to submit your answers in a word document. You need to transfer the results from the excel file into the word document. In addition, you must submit your Excel files but note that only the word document will be marked. If you think there is any issue with any questions, please make your assumptions and clearly explain them in your report.SECTION A:  Discussion Questions1- Explain the concept of having the imbalance data in classification techniques and the way that it should be treated in developing the classification models?2- Explain the concept of over-fitting. Explain how overfitting can be avoided?3- Give two examples of how logistic regression can be used.  You only need to explain the problem. One example is the bank that are using logistic regression to classify its new customers for loan approval. The bank wanted to identify customers that are more likely to default on their loan. Explain why you cannot use linear regression in your examples.                                                                                                                                      SECTION B:  QUANTITATIVE QUESTIONS 1. There are 500 client records in the first sheet of the file Toy-Info which have shopped many special toys from an e-Business website. Each record includes data on types of product purchased (between 1-5), purchase amount ($), age, gender, marital status, whether the client has a membership and whether the customer has a discount card. A business analyst has applied the k-means clustering method on all seven variables. The analyst increased the number of clusters to recommend a proper value of k. The resultant tests for k=5 and k=6 shown in the following sheets of the file revealed the best k as k=6. a) Explain how the analyst found that k=6 is a proper number of clusters. Refer the relevant sheet name, table name and the values you compared. ) Describe all 6 clusters by their average characteristics.                                                                                                                                                2- A company provides maintenance service for washing machines in Victoria. The analyst of the company aims to estimate the repair time and the service cost for each maintenance. He assumes the repair time as the dependent variable which can be related to number of months since last service, type of repair and the repair person. The following table reports 10 samples of the maintenances.    Repair time (hours)    Months since last service    Type of repai    Repairperson    2.1    2    Mechanical    John    2.8    2    Electrical    John    1.6    3    Mechanical    John    3.9    4    Electrical    Bo    2.5    6    Mechanical    John    3.1    6    Electrical    John    4.5    7    Electrical    Bo    4.7    8    Electrical    Bo    3.8    9    Mechanical    Bo    4.6    9    Electrical    Boa) Create an estimated simple regression model for this data where months since last service is the independent variable. What does the model indicate about the relationship between months since last service and repair time? How strong is the relationship? Report the accuracy measures and the equation.) Calculate the residual eors for each repair exists in the table and interpret the meaning of positive and negative values of the residuals in this analysis. Which type of repair (electrical or mechanical) is more desirable and which repairperson (John or Bob) has worked more efficient?c) Create a scatter chart with months since last service on the x axis for which the points representing electrical and mechanical repairs are shown in different colors. Create a similar chart of months since last service and repair time for which the points representing repairs by John and Bob are shown in different colors. Do these charts suggest any potential modifications to your simple linear regression model? Why?                                                                                                             3- The following data is the results of a 4- year study conducted to assess how age, weight, and gender influence the risk of diabetes. Risk is interpreted as the probability (times 100) that the patient will have diabetes over the next 4-year period. a) Develop a multiple regression model that relates risk of diabetes to the person’s age, weight and the gender. Present the regression formula as a mathematical equation. Interpret the coefficients of the regression and comment on the strength of the regression. ) Develop an estimated multiple regression model that relates risk of diabetes to the person’s age, weight, gender and life style. Present the regression formula as a mathematical equation. Interpret the coefficients of the regression and comment on the strength of the regression. c) What is the risk percentage of diabetes over the next 4 years for a 55-year-old man living in a big city with 70 kg weight? Use both models to estimate the risk and compare the result.     Age     Weight (Kg)    Gende    Life style    Risk (%)    53    78    Female    Small town    40    24    77    Male    Big city    23    77    83    Female    Country     67    88    89    Female    Small town    71    56    65    Male    Big city    45    71    82    Female    Country    54    53    79    Female    Small town    48    70    66    Male    Small town    49    80    80    Female    Big city    65    78    67    Male    Big city    59    71    69    Male    Big city    56    70    78    Female    Small town    59    67    75    Male    Country    46    77    95    Female    Big city    64    60    57    Male    Country    39    82    100    Female    Big city    73    66    85    Male    Small town    63    80    96    Male    Big city    87    62    83    Female    Country    52    59    93    Male    Big city    61                                                                                                                                                                                  4- An internet provider company in Australia is interested in identifying the reason for individuals who are still undecided in buying the new NBN service of the company. The file NBN-service contains data on the first sheet which introduces a sample of customers with variables that tracked the decision outcome. A business analyst has created a standard partition of the data with all tracked variables and 40% of observations in the training set, 35% in the validation set, and 25% in the test set. The analyst applied two logistic regression models to classify undecided customers of the company. The resultant output of the Solver software for both models has been added in the following sheets.a) Determine the selected input variables in each model and explain why the analyst has changed one of the input variables.) Write the obtained logistic regression equation for the first model shown in worksheet “4-1-1” and predict a customer with Contract duration of 16 months, Bonus data of 63 GB and Usage of 237 GB whether he/she will decide to buy the new service or not? Explain how you found the prediction.c) Find the class 1 and class 0 eors based on the sheet “4-1-2” and compare your results with the confusion matrix. Explain which kind of these eors are more undesirable in this model? d) In the second model (shown in worksheet “4-2-1”), compare the accuracy of the model with the first model. Which one do you recommend?                                                                                                                             5- Paul has a new job in project management. He plans to invest the same amount of $15,000  into a retirement account at the end of every year for the next 30 years. Suppose that annual return is 6%, then:a) Create a data table which shows Paul the balance of retirement account for various levels of annual investments and returns. ) If Paul aim to gain $1,500,000 at the end of the 30th year, how much money he should put in the investment annually. 6- FSUB is a company that intended to introduce its product by advertising them in 3 relevant websites. The names of these websites are determined as A, B and C by the marketing manager of the company. Viewer estimates, cost per advertisement, and maximum usage

Pooja · Accepted Answer

Section A
1)
Approach to handling Imbalanced Datasets
1) Data Level approach: Resampling Techniques:
· Random Under-Sampling
· Random Over-Sampling
· Cluster-Based Over Sampling
· Informed Over Sampling: Synthetic Minority Over-sampling Technique
· Modified synthetic minority oversampling technique (MSMOTE)
2) Algorithmic Ensemble Techniques
· Bagging Based
· Boosting-Based
When faced with imbalanced data sets there is no one-stop solution to improve the accuracy of the prediction model. One may need to try out multiple methods to figure out the best-suited sampling techniques for the dataset. In most cases, synthetic techniques like SMOTE and MSMOTE will outperform the conventional oversampling and under sampling methods.
For better results, one can use synthetic sampling methods like SMOTE and MSMOTE along with advanced boosting methods like gradient boosting and XG Boost.
One of the advanced bagging techniques commonly used to counter the imbalanced dataset problem is SMOTE bagging. It follows an entirely different approach from conventional bagging to create each Bag/Bootstrap. It generates the positive instances by the SMOTE Algorithm by setting a SMOTE resampling rate in each iteration. The set of negative instances is bootstrapped in each iteration.
Depending on the characteristics of the imbalanced data set, the most effective techniques will vary. Relevant evaluation parameters should be considered during the model comparison.
2)
Informed Over Sampling:

1- ISYS3374 Business Analytics – Final Exam Note: You need to submit your answers in a word document. You need to transfer the results from the excel file into the word document. In addition, you must...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment