ITECH7406- Business Intelligence and Data Warehousing Individual assignment – 2017 Worth – 20% ...

Question

ITECH7406- Business Intelligence and Data Warehousing Individual assignment – 2017 Worth – 20% ANALYTIC REPORT Learning Outcomes Assessed : S1,S2,S3,K3, K4, A1,A2 Purpose: To encourage and...

1 answer below »

ITECH7406- Business Intelligence and Data Warehousing Individual assignment – 2017

Worth – 20%

ANALYTIC REPORT

Learning Outcomes Assessed: S1,S2,S3,K3, K4, A1,A2

Purpose: To encourage and provide students on-hands experiences of using SAP Analytics tools to explore, extract and analyse enterprise data

This is a business analytics project aimed at generating innovative analytics solutions. It will allow students to show innovation and creativity in applying SAP Business Object/Predictive Analytics and designing useful visualization solutions and predictive models for different types of analytics problems.

The topic will be on environmental issues. Your main task is to apply any of the analytical tools to develop innovative analytics visualization solutions and predictive models with regards to environment, e.g. climate change, energy consumption, carbon footprint, greenhouse gas emission, pollution dashboard, etc. Besides the suggested datasets/sources, you may apply any other real-world dataset to illustrate your approach (the different datasets can be combined too). Please find the attached sample reports

Some possible datasets/sources:

http://data.un.org/Explorer.aspx http://data.worldbank.org/topic/environment https://data.oecd.org/ http://geodata.grid.unep.ch/

http://open-data.europa.eu/en/data/publisher/eea https://www.data.gov/

Report Submission: Hard-copy to tutors/lecturers assignment box in week 10. Double- sided printing for the hard-copy is encouraged in order to save paper.

Add references in your report from peer- reviewed sources. Include any and all sources of information including any person(s) you interviewed for this project.

Please note that all references must adhere to APA style. See http://owl.english.purdue.edu/owl/resource/560/01 and http://www.apastyle.org/ for details on how to format a report and how to cite references. Make sure your follow formal report structure with cover page, introduction, use of headings, subheadings, conclusion sand reference section. You are reminded to read the “Plagiarism” section of the

course description. Your essay should be a synthesis of ideas from a variety of sources expressed in your own words.

All reports must use the APA referencing style. University Referencing/Citation Style Guide:

The University has published a style guide to help students correctly reference and cite information they use in assignments (American Psychological Association (APA) citation style, http://www.ballarat.edu.au/aasp/student/learning_support/generalguide/print/ch06s04.shtml

or Australian citation style

Reports are to be presented in hard copy in size 12 Arial Font and double spaced. Your report should include a list of references used in the essay and a bibliography of the wider reading you have done to familiarize yourself on the topic.

A passing grade will be awarded to assignments adequately addressing all assessment criteria. Higher grades require better quality and more effort. For example, a minimum is set on the wider reading required. A student reading vastly more than this minimum will be better prepared to discuss the issues in depth and consequently their report is likely to be of a higher quality. So before submitting, please read through the assessment criteria very carefully.

Assignment 1- Data Analysis- Marking Scheme

Percentage 20%

Due Week 11(Friday 5pm) – Hard and Soft Copies

Tasks	Max Marks	Marks Awarded	Comments
What are the BI reporting solution/dashboards you will need to develop for the Senior Executives of chosen data Set– You must have at least two types of analytics i.e Predictive/prescriptive/ descriptive	10
Justify why these BI reporting solution/dashboards are chosen and why those attributes are present and laid out in the fashion you proposed (feel free to include all other relevant justifications using the academic articles). Note: To ensure that you discuss this task properly, you must include visual samples of the reports you produce (i.e. the screenshots of the BI report/dashboard must be presented and explained in the written report; use ‘Snipping tool’), and also include any assumptions that you may have made about the analysis in your assignment report (i.e. the report to the senior executive team of the company).	10
Furthermore, the CEO would like to improve the operations. Based on your BI analysis and the insights gained from “Data Set”, make some logical recommendations to the CEO, and justify why/how your proposal could enhance company operations, sales etc. Include the relevant screenshots of the BI analysis, and also any assumptions that you may have made about the analysis.	10
Report is well-written and presented professionally, containing: · Title page · Executive Summary (outlining the scope of report, key findings and recommendations) · Table of Contents · Appropriate use of headings within report · Appropriate use of figures (i.e. graphs, summary tables) and · References (APA Style)- in text and bibliography (10 articles)	1 2 1 1 1 4
Total Marks	40
Total Marks out of 20	20%

tae_636409486561929414_81840_1.docx

Answered Same Day Sep 14, 2019 ITECH7406

David · Accepted Answer

Executive Summary
This report provides an analysis and evaluation of the current techniques used in Spam Detection. Five different machine learning techniques including Support Vector Machines, K-Nearest Neighbors, AdaBoost, Random Forest and Naive Bayes were analyzed and compared using different performance evaluation metrics: Accuracy, precision, Recall, Error rate, False Positive Rate. All training and prediction by the different algorithms were carried out in R using available packages for respective techniques and Twitter Spam Dataset. The performance metrics were calculated based on the formulas that can be found in Introduction section. Results of data analyzed shows that overall performance of K-Nearest Neighbor (KNN) was superior to other algorithms. 
The report finds the prospects of the Naïve Bayes Classifier in increasing the precision and in decreasing the False Positive rate for the prediction. The precision in predication was one of the major areas of weakness of all the algorithms except Naïve Bayes. For further investigation a combination of Naïve Bayes and KNN algorithm, a probabilistic KNN was recommended. Some of the limitations include: features used for prediction can be easily manipulated by a spammer in his/her account to hide from these algorithms.
Introduction
This paper gives an overview of the state of the art of machine learning applications for spam detection in Twitter dataset using five different machine learning classifiers as listed below:
1. K-Nearest Neighbors (KNN)
2. Support Vector Machines (SVM)
3. Naïve Bayes (NB) Classifier
4. Adaptive Boosting (AdaBoost)
5. Random Forest Classifier
For evaluating and comparing the performance of various classifiers, five different performance evaluation metrics were used, and are defined below:
1. Accuracy
It is the total number of correct prediction to the total number of cases examined. That is:
2. Recall
It is the ratio of number of correctly detected spam to the actual number of spam in the dataset. That is:
3. Precision
It is the ratio of number of correctly detected spam to the total number of spam predicted. That is:
4. Error rate
It is the number of incorrect prediction to the total number of cases in the dataset. That is:
5. False positive rate
It is the ratio of incorrectly labelled legitimate data-point (account/mail/ …) to the total number of legitimate data-points. That is:
Where,
	 
	 
	Prediction condition
	 
	 
	positive
	negative
	Actual Condition
	positive
	TP
(True Positive)
	FN
(False Negative)
	
	negative
	FP
 (False Positive)
	TN
 (True Negative)
The data for training and testing the spam filter consists of six Account based feature and 7 Content based features as list below:
	 
	Features
	Description
	Account based Features
	account_age
	The age of an account
	
	no_follower
	# of followers
	
	no_following
	# of following
	
	no_userfavorites
	# of favorites the user received
	
	no_lists
	# of lists in which the user is a member of
	
	no_tweets
	# of tweets that has been posted by the user
	Content based Features
	no_retweets
	# of times this tweet has been retweeted
	
	no_tweetfavorites
	# of favorites this tweet received
	
	no_hashtag
	# of hashtags in this tweet
	
	no_usermention
	# of times this tweet being mentioned
	
	no_urls
	# of URLs contained in this tweet
	
	no_char
	# of characters in this tweet
	
	no_digits
	# of digits in this tweet
The Principal Component Analysis (PCA) was carried out using above 13 features and of the new principal components received from analysis, 3 components were used that were able to describe almost 99% of the variation in the data. 
The following table shows of the datasets used for training and testing the Twitter spam filters:
	Dataset
	# of Spam tweets
	# of Non-spam tweets
	Training
	1000
	1000
	Testing
	1000
	1000
	
	190
	1900
The second training dataset was much more realistic because in real life the number of spam tweets are much lower than the non-spam tweets.
The chapter technical demonstration contains all the details of PCA analysis and step by step training and prediction on the both the testing dataset using the five classification method listed above. The Performance evaluation chapter shows the performance of each Classifier based on the five evaluation metric listed above. 
Literature review
Spamming is one of the major problem of the Information age, has been tapped rigorously by the researchers and practitioners in the field of Security data analytics using different Machine Learning Techniques. According to Alexa, Twitter is one of the most visited websites in the world. The ever increasing traffic on Twitter also attracts the spammers and therefore it becomes very important to detect and remove any spammers on the site to ensure quality time of general public on Twitter. 
The problem of spam detection is difficult because it is very easy for the spammers to fabricate the features of a benign account. So to detect the spammers, the researchers have to consider many different features such as account-based features like the age of an account, number of follower, number of following etc. and content based features like number of times a tweet has been retweeted, number of favorites a tweet received etc. This increases the overall dimension of the analysis and also considering the fact that every day more than 100 million users login to their Twitter account and on average every second 6000 tweets are tweeted on twitter, makes the task of spam detection impossible for any human reader. Even with such high dimensional data, the researchers have reached over 90% performance considering different evaluation metrics. 
Many researchers have proposed strategy utilizing multiple account-based and content-based features to filter spam. Some of the most studied methods include k-nearest neighbors, support vector machines, Naïve Bayes and ensemble methods. The ensemble methods are aimed at combining predictions from different classifiers. The ensemble methods can be further classified into averaging methods and Boosting methods. Random Forest and AdaBoost are the most famous classifiers under averaging and boosting methods respectively. 
The following table summarizes the various machine learning algorithms and their use in spam filtering in previous investigations.
	Classifier
	Previous Investigations
	KNN
	1, 2, 3, 4, 5, 6, 22
	Support Vector Machines
	7, 8, 9, 10, 11, 12, 13, 14, 5, 6,22
	Naïve Bayes
	1, 15, 16, 7, 9, 17 , 18, 19, 20, 5, 6, 22
	Boosting
	7, 21, 10, 5, 6
	Random Forest
	23, 24, 25, 26
As evident from the above table, there has been many research studies, included and not-included in table, for comparing the performance of different machine learning classifiers. However, no investigation has been found on a common dataset to compare the performance of all the different machine learning schemes listed above. Moreover, this study will also constitute dimensionality reduction of the data using Principal Component Analysis (PCA) to reduce the dimensionality of the data which will helps in reducing the computational cost and time.
Technical demonstration
First all the necessary library (packages) required for purpose were installed and loaded, along with loading all the tree dataset (training and two testing data) was loaded into the program as shown below:
Visualizing the data:

ITECH7406- Business Intelligence and Data Warehousing Individual assignment – 2017 Worth – 20% ANALYTIC REPORT Learning Outcomes Assessed : S1,S2,S3,K3, K4, A1,A2 Purpose: To encourage and...

Assignment 1- Data Analysis- Marking Scheme

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment