You should write a 1800 word structured report (see Section 3) that includes the following headings (more details on how the report will be assessed are provided below):
• Introduction - introduce the prediction problem.
• Data mining theory - provide a theoretical description of your two chosen supervised data mining methods (for example, the classification or regression techniques that you have used) and why they are appropriate to your task.
• Data preparation – describe how you explored and prepared the data for data mining. Use graphical methods provided by software, for example Tableau, to communicate characteristics about your data and rationalise your choices.
• Methods - describe the experimental setup you used in KNIME for each data mining method including a discussion of how you varied the setup in order to find the best model. When describing your experimental setup, you should ensure that you provide sufficient information for someone else to repeat your study. For example, you should explain which nodes you have used in KNIME and which parameter settings you used. It may be appropriate to present this data in table form.
Charts, tables, references and appendices are not included in the word count.
Use this dataset: https:
www.kaggle.com/c/titanic
Data available for download: https:
www.kaggle.com/c/titanic/data
The training set consists of 891 passengers characterised by up to 10 attributes. The aim of this challenge is to build a model that is able to predict whether or not a passenger will survive the sinking of the titanic
Sections:
Introduction - This section should introduce the data mining problem that is addressed in the report. You should indicate the property/data values that you want to predict and give a
ief overview of the dataset and methods that you will use. (150 words)
Data Mining Theory - This section should provide an overview of your chosen algorithms for predictive data mining from a theoretical aspect. Explain why they are relevant to your prediction problem. Support your rationale by providing references to the literature the techniques have been applied to similar problems. Include a discussion of the most appropriate methods for evaluating the performance of your chosen data mining methods. This should include a discussion of the role of training, test and validation (if appropriate) sets as well as model performance measures. (750 words)
Data Preparation - This section should provide a
ief description of the data and of the exploratory approaches you have used to understand and pre-process your data. You should present an investigation of the attributes (including the data value to be predicted) and describe any data cleaning including handling of missing data, data transformations and data aggregations that you have ca
ied out. For example, you could use statistical methods to examine the variance of attributes and the degree of co
elation between the different attributes. You should use graphical techniques such as Tableau to communicate your data to the reader and to rationalise the decisions made. (300 words)
Methods - This section should describe the experimental design used for each data mining method. You should discuss how you divided the data into training, test and (if appropriate) validation sets. You should also describe the process you followed in order to find the best performing model for each method. You should describe your experimental design in sufficient detail that your study could be repeated by someone else. For example, which KNIME nodes did you use? How did you configure them? Did you try using different sets of attributes or different examples for training. Did you use cross-validation and if so what parameters did you use and why? (600 words)