Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Assignment 1 The development of a data mining or rule visualisation routine There are two parts to this assignment. You are required to answer EITHER part. Part A You will be provided with various...

1 answer below »
Assignment 1
The development of a data mining or rule visualisation routine There are two parts to this assignment. You are required to answer EITHER part.
Part A
You will be provided with various sets of data for mining (and you can create your own). The assignment is to develop and implement a data mining algorithm (of any kind) such that:
• it does not already exist in any commercially available system (although a significant extension to one that does is acceptable),
• it is backed by appropriate research.

Documentation (in the form of a 3-5 page description relating the research behind the algorithm and a discussion anything that is novel/useful about your algorithm. Note that I do not require "formal" documentation).

Part B
You will be provided with various rulesets that require appropriate visualisation tools. The assignment is to develop and implement a visualisation algorithm (of any kind) such that:
• it does not already exist in any commercially available system
• it is backed by appropriate research

Documentation (in the form of a 3-5 page description relating the research behind the algorithm and a discussion anything that is novel/useful about your algorithm.
Note that I do not require "formal" documentation).

Extensions
Other extension (or undertaking both parts of the assignment!) would be looked on favourably and marks will be awarded up to the maximum mark available for the assignment - ie a nice extension can make up for lost marks.
Marking Criteria for Both Parts
Basic algorithm coded in any language 16 marks.
Bonus for extensions 4 marks.
Documentation 10 marks.

As far as the algorithm is concerned, you will be marked on the quality of your solution as follows:

 a. computational complexity of you algorithm. 

b. elegance of your programming. 

c. accuracy and configurability (ie. setting thresholds).
As far as the documentation is concerned, you will be marked on:

 a. your research into methods available and the novelty of your solution. 

. your explanation of your algorithm.

Submission of Assignment
All assignment 1's should be zipped into an archive (using your favorite zip package) and uploaded to FLO. It should include everything including documentation, the source, the executable and any test data you developed for yourself. Name the document surname.zip where surname is your surname.

Data Mining and Knowledge Discovery
COMP7707
Advanced Data Mining (and Knowledge Discovery)
Prof. John Roddick
XXXXXXXXXX
With contributions from
Aaron Ceglar, Carl Mooney and Mark Leth
idge.
Naturally occu
ing Cubic Pyrite
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Overview of Topic
© 2018, Flinders University
*
Topics
Introduction
    The Role of Common Sense
    Trends in Information Management
    Fundamental Ideas
    Developing Data Mining Algorithms
    Applications of Knowledge Discovery
    Future Directions in DMKD
Data Mining Techniques
    Association Rule Mining
    Clustering Algorithms
    Classification and Prediction
    Sequential Pattern Mining
    Text Mining
    Higher Order Data Mining
    Visualisation Techniques
Including Higher Semantics
    Spatial Data Mining
    Temporal and Longitudinal Data Mining
    Interestingness
    Web Mining
Knowledge Discovery
    Ethics in Data Mining
    Knowledge Discovery Frameworks
Naturally occu
ing Cubic Pyrite
COMP7707
Advanced Data Mining
Prof. John Roddick
Flinders University
XXXXXXXXXX
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
DMKD - the discipline
    A merger of (at least) four disciplines.
*
© 2018, Flinders University
*
Data Mining and
Knowledge Discovery
Artificial
Intelligence
Database
Systems
Statistics
Visualisation
VLDB, data warehousing, data modelling, data semantics, …
Decision Tree Induction, Clustering, Inductive Logic, …
Validity, Confidence, Autoco
elation, …
Data Visualisation, Dimension Reduction, …
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Where it fits in ICT
    Database queries can be considered to confirm answers to fairly well formed questions or provide simple answers to (relatively) simple questions.
    Data Analysis is used to give answers to questions which might require some discussion or where the answer is at first vague.
    Data Mining allows the question itself to be ill-formed. “Tell me something interesting about …”
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Terminology
    Data Mining is the term used to describe the algorithms
outines used to discover interesting aspects about a dataset.
    Knowledge Discovery is the term used to describe the overarching discovery process.
    The difference is similar to the difference between programming and software engineering.
    The terminology is misused (and misappropriated) quite a bit.
    DMKD is one of the hottest research topic to emerge in the database research area in some years.
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Research Sources
    Major Conferences
    ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD
    IEEE International Conference on Data Mining, ICDM
    European Conference on Principles of Data Mining and Knowledge Discovery, PKDD
    Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD
    SIAM International Conference on Data Mining
    International Conference on Data Warehousing and Knowledge Discovery, DaWaK
    … plus local conferences such as AusDM
    Conferences that have many DMKD papers
    ACM SIGMOD International Conference on the Management of Data, SIGMOD
    International Conference on Information and Knowledge Management, CIKM
    International Conference on Very Large Data Bases, VLDB
    IEEE International Conference on Data Engineering, ICDE
    Journals
    Data Mining and Knowledge Discovery, DMKD
    ACM Transactions on Knowledge Discovery from Data, TOKDD
    ACM Transactions on Database Systems, TODS
    IEEE Transactions on Knowledge and Data Engineering, TKDE
    Knowledge and Intelligent Systems, KAIS
    Data and Knowledge Engineering, DKE
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
About ADM - the topic
    Knowledge of Database Systems, Artificial Intelligence, Statistics and Visualisation is not required for this topic.
    HOWEVER, if you find something a little difficult as a result of not having studied it, do read up on it. I will try and provide references.
    Being such a new area, some of the subject matter will come direct from research material. Ie. do not expect to find all of the things we talk about implemented in commercial systems yet.
    Enormous scope to join the team at Flinders in doing postdoctoral, postgraduate or adjunct research.
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Topic Organisation
    SAM has important details - please read
    Assignments
    I’ve kept it simple.
    You can do all of them and get best of them - but be strategic.
    Tutorial/Discussions Sessions
    Will start in week 3
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Topic Organisation 2
    Timetable
    Thursdays for 13 weeks
    Lectures.
    3pm – 5pm, 1 hr 50 mins
    Tonsley 1.03
    Tutorial - Starting wk 3.
    noon – 1pm, 50 mins
    Tonsley 1.14
    Text Book
    Tan, Steinbach and Kumar - worth the investment but not critical to buy
    Other resources available in various University li
aries
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Assessment
    Any two of…
    Assignment 1 - The development of a data mining or rule visualisation routine
    Assignment 2 - A research based pape
    Assignment 3 - A critique of a seminal DMKD pape
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Topic 1
The Role of Common Sense
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Benford’s Law
    In 1938 Benford noticed that pages of logarithms co
esponding to numbers starting with the numeral 1 were much dirtier than other pages.
    The Theory …
    Ask anyone to choose numbers randomly and, over a largish number of numbers, there will be
    1/9th starting with 1,
    1/9th starting with 2, etc.
*
© 2018, Flinders University
*
However, naturally occu
ing numbers do not follow this pattern. They generally have:
30% starting with 1,
18% starting with 2, etc.
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Benford’s Law, cont.
    We can therefore tell if something that was supposed to be naturally occu
ing has been faked. For example,
    the numbers in an audited set of accounts …
    random samples from a day's stock quotations,
    a tournament's tennis scores,
    the numbers on the front page of The New York Times,
    the populations of towns,
    the molecular weights of compounds,
    the half-lives of radioactive atoms…
    Has been applied to
    fraud cases in Brooklyn
    Income tax fraud in California
*
© 2018, Flinders University
*
(From "The First-Digit Phenomenon" by T. P. Hill, American Scientist, July-August 1998)
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Topic 2
Trends in Information Management
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders
Answered Same Day May 09, 2020 COMP7707 Flinders University

Solution

Abr Writing answered on May 20 2020
131 Votes
DataMining
May 25, 2018
Decision trees are very often used for prediction task and is extremely useful for following
easons:
1. Decision trees perform the task of feature selection absolutely Features selection is one of
the most important task in data analysis. In a decision tree, when we fit the classifier to
dataset, it become very easy to figure out the most important features in the data from the
top few nodes. Higher the node is in the hierarchy, the more important and better its powe
to split the data and perform the classification task. We described here why feature selection
is important in analytics.
2. Decision trees classifier can be easily trained by users than other classifier The different kind
of data normalization and transformation is not necessary in a decision tree because the
structure of the tree remains the same i
espective of that. For example, if we have to mea-
sure the passenger fare based on the different features available in the titanic dataset, we can
fit a regression model and then interpret the slopes/coefficients of the resulting model but
such a fit requires some form of normalization or scaling of the data. In addition, even if
we have any missing data points that will not affect the decision tree from building trees o
splitting the training data as well as the outliers will not cause any difference to decision tree
unlike other classifier like regression model.
3. The performance of the decision tree classifier is not dependent on nonlinear relationship In
some simple models such as a regression, any kind of nonlinear relationships makes a model
invalid. However, decision trees do not require any assumptions of linearity in the data.
4. Decision trees are easy to interpret and explain. Decision trees are very intuitive and easy
to explain. With these benefits, it is important to reduce the importance of decision trees:
without pruning or limiting tree growth, they often get accustomed to overfitting training
data, which can be very harmful. The algorithm just grows the tree top-down. It looks
at all the variables in the input...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here