Great Deal! Get Instant $25 FREE in Account on First Order + 10% Cashback on Every Order Order Now

ITECH7406- Business Intelligence and Data Warehousing Individual assignment – 2017 </o:p> Worth – 20% </o:p> </o:p> ANALYTIC REPORT </o:p> </o:p> Learning Outcomes...

1 answer below »

ITECH7406- Business Intelligence and Data Warehousing Individual assignment – 2017

Worth – 20%

ANALYTIC REPORT

Learning Outcomes Assessed: S1,S2,S3,K3, K4, A1,A2

Purpose: To encourage and provide students on-hands experiences of using SAP Analytics tools to explore, extract and analyse enterprise data

This is a business analytics project aimed at generating innovative analytics solutions. It will allow students to show innovation and creativity in applying SAP Business Object/Predictive Analytics and designing useful visualization solutions and predictive models for different types of analytics problems.

The topic will be on environmental issues. Your main task is to apply any of the analytical tools to develop innovative analytics visualization solutions and predictive models with regards to environment, e.g. climate change, energy consumption, carbon footprint, greenhouse gas emission, pollution dashboard, etc. Besides the suggested datasets/sources, you may apply any other real-world dataset to illustrate your approach (the different datasets can be combined too). Please find the attached sample reports

Some possible datasets/sources:

http://data.un.org/Explorer.aspx http://data.worldbank.org/topic/environment https://data.oecd.org/ http://geodata.grid.unep.ch/

http://open-data.europa.eu/en/data/publisher/eea https://www.data.gov/

Report Submission: Hard-copy to tutors/lecturers assignment box in week 10. Double- sided printing for the hard-copy is encouraged in order to save paper.

Add references in your report from peer- reviewed sources. Include any and all sources of information including any person(s) you interviewed for this project.

Please note that all references must adhere to APA style. See http://owl.english.purdue.edu/owl/resource/560/01 and http://www.apastyle.org/ for details on how to format a report and how to cite references. Make sure your follow formal report structure with cover page, introduction, use of headings, subheadings, conclusion sand reference section. You are reminded to read the “Plagiarism” section of the


course description. Your essay should be a synthesis of ideas from a variety of sources expressed in your own words.

All reports must use the APA referencing style. University Referencing/Citation Style Guide:

The University has published a style guide to help students correctly reference and cite information they use in assignments (American Psychological Association (APA) citation style, http://www.ballarat.edu.au/aasp/student/learning_support/generalguide/print/ch06s04.shtml

or Australian citation style

Reports are to be presented in hard copy in size 12 Arial Font and double spaced. Your report should include a list of references used in the essay and a bibliography of the wider reading you have done to familiarize yourself on the topic.

A passing grade will be awarded to assignments adequately addressing all assessment criteria. Higher grades require better quality and more effort. For example, a minimum is set on the wider reading required. A student reading vastly more than this minimum will be better prepared to discuss the issues in depth and consequently their report is likely to be of a higher quality. So before submitting, please read through the assessment criteria very carefully.


Assignment 1- Data Analysis- Marking Scheme

Percentage 20%

Due Week 11(Friday 5pm) – Hard and Soft Copies

Tasks

Max Marks

Marks Awarded

Comments

What are the BI reporting solution/dashboards you will need to develop for the Senior Executives of chosen data Set– You must have at least two types of analytics i.e Predictive/prescriptive/ descriptive

10

Justify why these BI reporting solution/dashboards are chosen and why those attributes are present and laid out in the fashion you proposed (feel free to include all other relevant justifications using the academic articles).

Note: To ensure that you discuss this task properly, you must include visual samples of the reports you produce (i.e. the screenshots of the BI report/dashboard must be presented and explained in the written report; use ‘Snipping tool’), and also include any assumptions that you may have made about the analysis in your assignment report (i.e. the report to the senior executive team of the company).

10

Furthermore, the CEO would like to improve the operations. Based on your BI analysis and the insights gained from “Data Set”, make some logical recommendations to the CEO, and justify why/how your proposal could enhance company operations, sales etc. Include the relevant screenshots of the BI analysis, and also any assumptions that you may have made about the analysis.

10

Report is well-written and presented professionally, containing:

· Title page

· Executive Summary (outlining the scope of report, key findings and recommendations)

· Table of Contents

· Appropriate use of headings within report

· Appropriate use of figures (i.e. graphs, summary tables) and

· References (APA Style)- in text and bibliography (10 articles)

1

2

1

1

1

4

Total Marks

40

Total Marks out of 20

20%

Answered Same DaySep 14, 2019ITECH7406

Solution

David answered on Dec 28 2019
66 Votes
Executive Summary
This report provides an analysis and evaluation of the cu
ent techniques used in Spam Detection. Five different machine learning techniques including Support Vector Machines, K-Nearest Neighbors, AdaBoost, Random Forest and Naive Bayes were analyzed and compared using different performance evaluation metrics: Accuracy, precision, Recall, E
or rate, False Positive Rate. All training and prediction by the different algorithms were ca
ied out in R using available packages for respective techniques and Twitter Spam Dataset. The performance metrics were calculated based on the formulas that can be found in Introduction section. Results of data analyzed shows that overall performance of K-Nearest Neighbor (KNN) was superior to other algorithms.
The report finds the prospects of the Naïve Bayes Classifier in increasing the precision and in decreasing the False Positive rate for the prediction. The precision in predication was one of the major areas of weakness of all the algorithms except Naïve Bayes. For further investigation a combination of Naïve Bayes and KNN algorithm, a probabilistic KNN was recommended. Some of the limitations include: features used for prediction can be easily manipulated by a spammer in his/her account to hide from these algorithms.
Introduction
This paper gives an overview of the state of the art of machine learning applications for spam detection in Twitter dataset using five different machine learning classifiers as listed below:
1. K-Nearest Neighbors (KNN)
2. Support Vector Machines (SVM)
3. Naïve Bayes (NB) Classifie
4. Adaptive Boosting (AdaBoost)
5. Random Forest Classifie
For evaluating and comparing the performance of various classifiers, five different performance evaluation metrics were used, and are defined below:
1. Accuracy
It is the total number of co
ect prediction to the total number of cases examined. That is:
2. Recall
It is the ratio of number of co
ectly detected spam to the actual number of spam in the dataset. That is:
3. Precision
It is the ratio of number of co
ectly detected spam to the total number of spam predicted. That is:
4. E
or rate
It is the number of inco
ect prediction to the total number of cases in the dataset. That is:
5. False positive rate
It is the ratio of inco
ectly labelled legitimate data-point (account/mail/ …) to the total number of legitimate data-points. That is:
Where,
     
     
    Prediction condition
     
     
    positive
    negative
    Actual Condition
    positive
    TP
(True Positive)
    FN
(False Negative)
    
    negative
    FP
(False Positive)
    TN
(True Negative)
The data for training and testing the spam filter consists of six Account based feature and 7 Content based features as list below:
     
    Features
    Description
    Account based Features
    account_age
    The age of an account
    
    no_followe
    # of followers
    
    no_following
    # of following
    
    no_userfavorites
    # of favorites the user received
    
    no_lists
    # of lists in which the user is a member of
    
    no_tweets
    # of tweets that has been posted by the use
    Content based Features
    no_retweets
    # of times this tweet has been retweeted
    
    no_tweetfavorites
    # of favorites this tweet received
    
    no_hashtag
    # of hashtags in this tweet
    
    no_usermention
    # of times this tweet being mentioned
    
    no_urls
    # of URLs contained in this tweet
    
    no_cha
    # of characters in this tweet
    
    no_digits
    # of digits in this tweet
The Principal Component Analysis (PCA) was ca
ied out using above 13 features and of the new principal components received from analysis, 3 components were used that were able to describe almost 99% of the variation in the data.
The following table shows of the datasets used for training and testing the Twitter spam filters:
    Dataset
    # of Spam tweets
    # of Non-spam tweets
    Training
    1000
    1000
    Testing
    1000
    1000
    
    190
    1900
The second training dataset was much more realistic because in real life the number of spam tweets are much lower than the non-spam tweets.
The chapter technical demonstration contains all the details of PCA analysis and step by step training and prediction on the both the testing dataset using the five classification method listed above. The Performance evaluation chapter shows the performance of each Classifier based on the five evaluation metric listed above.
Literature review
Spamming is one of the major problem of the Information age, has been tapped rigorously by the researchers and practitioners in the field of Security data analytics using different Machine Learning Techniques. According to Alexa, Twitter is one of the most visited websites in the world. The ever increasing traffic on Twitter also attracts the spammers and therefore it becomes very important to detect and remove any spammers on the site to ensure quality time of general public on Twitter.
The problem of spam detection is difficult because it is very easy for the spammers to fa
icate the features of a benign account. So to detect the spammers, the researchers have to consider many different features such as account-based features like the age of an account, number of follower, number of following etc. and content based features like number of times a tweet has been retweeted, number of favorites a tweet received etc. This increases the overall dimension of the analysis and also considering the fact that every day more than 100 million users login to their Twitter account and on average every second 6000 tweets are tweeted on twitter, makes the task of spam detection impossible for any human reader. Even with such high dimensional data, the researchers have reached over 90% performance considering different evaluation metrics.
Many researchers have proposed strategy utilizing multiple account-based and content-based features to filter spam. Some of the most studied methods include k-nearest neighbors, support vector machines, Naïve Bayes and ensemble methods. The ensemble methods are aimed at combining predictions from different classifiers. The ensemble methods can be further classified into averaging methods and Boosting methods. Random Forest and AdaBoost are the most famous classifiers under averaging and boosting methods respectively.
The following table summarizes the various machine learning algorithms and their use in spam filtering in previous investigations.
    Classifie
    Previous Investigations
    KNN
    1, 2, 3, 4, 5, 6, 22
    Support Vector Machines
    7, 8, 9, 10, 11, 12, 13, 14, 5, 6,22
    Naïve Bayes
    1, 15, 16, 7, 9, 17 , 18, 19, 20, 5, 6, 22
    Boosting
    7, 21, 10, 5, 6
    Random Forest
    23, 24, 25, 26
As evident from the above table, there has been many research studies, included and not-included in table, for comparing the performance of different machine learning classifiers. However, no investigation has been found on a common dataset to compare the performance of all the different machine learning schemes listed above. Moreover, this study will also constitute dimensionality reduction of the data using Principal Component Analysis (PCA) to reduce the dimensionality of the data which will helps in reducing the computational cost and time.
Technical demonstration
First all the necessary li
ary (packages) required for purpose were installed and loaded, along with loading all the tree dataset (training and two testing data) was loaded into the program as shown below:
Visualizing the data:
For any data analysis it is...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here