Assignment 1
The development of a data mining or rule visualisation routine There are two parts to this assignment. You are required to answer EITHER part.
Part A
You will be provided with various sets of data for mining (and you can create your own). The assignment is to develop and implement a data mining algorithm (of any kind) such that:
• it does not already exist in any commercially available system (although a significant extension to one that does is acceptable),
• it is backed by appropriate research.
Documentation (in the form of a 3-5 page description relating the research behind the algorithm and a discussion anything that is novel/useful about your algorithm. Note that I do not require "formal" documentation).
Part B
You will be provided with various rulesets that require appropriate visualisation tools. The assignment is to develop and implement a visualisation algorithm (of any kind) such that:
• it does not already exist in any commercially available system
• it is backed by appropriate research
Documentation (in the form of a 3-5 page description relating the research behind the algorithm and a discussion anything that is novel/useful about your algorithm.
Note that I do not require "formal" documentation).
Extensions
Other extension (or undertaking both parts of the assignment!) would be looked on favourably and marks will be awarded up to the maximum mark available for the assignment - ie a nice extension can make up for lost marks.
Marking Criteria for Both Parts
Basic algorithm coded in any language 16 marks.
Bonus for extensions 4 marks.
Documentation 10 marks.
As far as the algorithm is concerned, you will be marked on the quality of your solution as follows:
a. computational complexity of you algorithm.
b. elegance of your programming.
c. accuracy and configurability (ie. setting thresholds).
As far as the documentation is concerned, you will be marked on:
a. your research into methods available and the novelty of your solution.
. your explanation of your algorithm.
Submission of Assignment
All assignment 1's should be zipped into an archive (using your favorite zip package) and uploaded to FLO. It should include everything including documentation, the source, the executable and any test data you developed for yourself. Name the document surname.zip where surname is your surname.
Data Mining and Knowledge Discovery
COMP7707
Advanced Data Mining (and Knowledge Discovery)
Prof. John Roddick
XXXXXXXXXX
With contributions from
Aaron Ceglar, Carl Mooney and Mark Leth
idge.
Naturally occu
ing Cubic Pyrite
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Overview of Topic
© 2018, Flinders University
*
Topics
Introduction
The Role of Common Sense
Trends in Information Management
Fundamental Ideas
Developing Data Mining Algorithms
Applications of Knowledge Discovery
Future Directions in DMKD
Data Mining Techniques
Association Rule Mining
Clustering Algorithms
Classification and Prediction
Sequential Pattern Mining
Text Mining
Higher Order Data Mining
Visualisation Techniques
Including Higher Semantics
Spatial Data Mining
Temporal and Longitudinal Data Mining
Interestingness
Web Mining
Knowledge Discovery
Ethics in Data Mining
Knowledge Discovery Frameworks
Naturally occu
ing Cubic Pyrite
COMP7707
Advanced Data Mining
Prof. John Roddick
Flinders University
XXXXXXXXXX
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
DMKD - the discipline
A merger of (at least) four disciplines.
*
© 2018, Flinders University
*
Data Mining and
Knowledge Discovery
Artificial
Intelligence
Database
Systems
Statistics
Visualisation
VLDB, data warehousing, data modelling, data semantics, …
Decision Tree Induction, Clustering, Inductive Logic, …
Validity, Confidence, Autoco
elation, …
Data Visualisation, Dimension Reduction, …
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Where it fits in ICT
Database queries can be considered to confirm answers to fairly well formed questions or provide simple answers to (relatively) simple questions.
Data Analysis is used to give answers to questions which might require some discussion or where the answer is at first vague.
Data Mining allows the question itself to be ill-formed. “Tell me something interesting about …”
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Terminology
Data Mining is the term used to describe the algorithms
outines used to discover interesting aspects about a dataset.
Knowledge Discovery is the term used to describe the overarching discovery process.
The difference is similar to the difference between programming and software engineering.
The terminology is misused (and misappropriated) quite a bit.
DMKD is one of the hottest research topic to emerge in the database research area in some years.
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Research Sources
Major Conferences
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD
IEEE International Conference on Data Mining, ICDM
European Conference on Principles of Data Mining and Knowledge Discovery, PKDD
Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD
SIAM International Conference on Data Mining
International Conference on Data Warehousing and Knowledge Discovery, DaWaK
… plus local conferences such as AusDM
Conferences that have many DMKD papers
ACM SIGMOD International Conference on the Management of Data, SIGMOD
International Conference on Information and Knowledge Management, CIKM
International Conference on Very Large Data Bases, VLDB
IEEE International Conference on Data Engineering, ICDE
Journals
Data Mining and Knowledge Discovery, DMKD
ACM Transactions on Knowledge Discovery from Data, TOKDD
ACM Transactions on Database Systems, TODS
IEEE Transactions on Knowledge and Data Engineering, TKDE
Knowledge and Intelligent Systems, KAIS
Data and Knowledge Engineering, DKE
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
About ADM - the topic
Knowledge of Database Systems, Artificial Intelligence, Statistics and Visualisation is not required for this topic.
HOWEVER, if you find something a little difficult as a result of not having studied it, do read up on it. I will try and provide references.
Being such a new area, some of the subject matter will come direct from research material. Ie. do not expect to find all of the things we talk about implemented in commercial systems yet.
Enormous scope to join the team at Flinders in doing postdoctoral, postgraduate or adjunct research.
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Topic Organisation
SAM has important details - please read
Assignments
I’ve kept it simple.
You can do all of them and get best of them - but be strategic.
Tutorial/Discussions Sessions
Will start in week 3
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Topic Organisation 2
Timetable
Thursdays for 13 weeks
Lectures.
3pm – 5pm, 1 hr 50 mins
Tonsley 1.03
Tutorial - Starting wk 3.
noon – 1pm, 50 mins
Tonsley 1.14
Text Book
Tan, Steinbach and Kumar - worth the investment but not critical to buy
Other resources available in various University li
aries
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Assessment
Any two of…
Assignment 1 - The development of a data mining or rule visualisation routine
Assignment 2 - A research based pape
Assignment 3 - A critique of a seminal DMKD pape
*
© 2018, Flinders University
*
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Topic 1
The Role of Common Sense
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Benford’s Law
In 1938 Benford noticed that pages of logarithms co
esponding to numbers starting with the numeral 1 were much dirtier than other pages.
The Theory …
Ask anyone to choose numbers randomly and, over a largish number of numbers, there will be
1/9th starting with 1,
1/9th starting with 2, etc.
*
© 2018, Flinders University
*
However, naturally occu
ing numbers do not follow this pattern. They generally have:
30% starting with 1,
18% starting with 2, etc.
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Benford’s Law, cont.
We can therefore tell if something that was supposed to be naturally occu
ing has been faked. For example,
the numbers in an audited set of accounts …
random samples from a day's stock quotations,
a tournament's tennis scores,
the numbers on the front page of The New York Times,
the populations of towns,
the molecular weights of compounds,
the half-lives of radioactive atoms…
Has been applied to
fraud cases in Brooklyn
Income tax fraud in California
*
© 2018, Flinders University
*
(From "The First-Digit Phenomenon" by T. P. Hill, American Scientist, July-August 1998)
© 2018, Flinders University
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders University
Topic 2
Trends in Information Management
COMP7707 Advanced Data Mining, Semester 1, 2018
COMP7707 Advanced Data Mining, Semester 1, 2018
John F. Roddick, Flinders University
*
John F. Roddick, Flinders