Microsoft Word - CW-220CT-2
Faculty of Engineering, Environment and Computing
EEC 220CT
Assignment Brief 2018/19
Module Title
Data and Information Retrieval
individual Cohort
(Sep)
Module Code
220CT
Coursework Title (e.g. CWK1)
CW 1
Hand out date:
29 October 2018
Lecturer
Rachid Anane
Due date:
7 December 2018
Estimated Time (hrs): 20
Coursework type:
CW
% of Module Mark
50
Submission a
angement online via CUMoodle:
File types and method of recording:
Mark and Feedback date:
Mark and Feedback method: feedback file
Module Learning Outcomes Assessed:
1. Explain the difference between data and information and its significance as a business
esource. 2. Identify the main advantages and disadvantages of using database and
information retrieval systems. 3. Analyse, design, implement and manage a database
solution for a specified commercial or scientific objective. 4. Demonstrate understanding of
Big Data as a concept and as a business tool through the application of data analysis
techniques
Task and Mark distribution:
1. Normalisation (25%)
2. Database design (25%)
3. MapReduce (25%)
4. Recommendation Systems (25%)
Notes:
1. You are expected to use the CUHarvard referencing format. For support and advice on how this
students can contact Centre for Academic Writing (CAW).
2. Please notify your registry course support team and module leader for disability support.
3. Any student requiring an extension or defe
al should follow the university process as outlined
here.
4. The University cannot take responsibility for any coursework lost or co
upted on disks, laptops
or personal computer. Students should therefore regularly back-up any work and are advised to
save it on the University system.
5. If there are technical or performance issues that prevent students submitting coursework
through the online coursework submission system on the day of a coursework deadline, an
appropriate extension to the coursework submission deadline will be agreed. This extension
will normally be 24 hours or the next working day if the deadline falls on a Friday or over the
weekend period. This will be communicated via email and as a CUMoodle announcement.
220CT – Data and Information retrieval
This assignment is made up of four parts:
- Part 1 deals with normalisation and E-R modelling.
- Part 2 covers database design.
- Part 3 involves the application of MapReduce
- Part 4 concerns recommendation systems
Part 1: Normalisation (This task is worth 25 marks)
The International Space Station (ISS) is a habitable artificial satellite in low Earth o
it. It is
the ninth space station to be inhabited by crews following previous o
ital stations that were
launched by the US the former Soviet Union and later Russia. The ISS is intended to be a
laboratory, observatory and factory in space as well as to provide transportation,
maintenance, and act as a staging base for possible future missions to the Moon, Mars and
eyond. In order to support the crew and overall operation of ISS the space agencies in
charge of running the station conduct regular missions to launch spacecraft ca
ying
payloads of essential or replacement equipment up to ISS. A payload inventory, see table
elow, is recorded of each mission, consisting of the space agency leading the mission and
the equipment payload to be sent up to ISS.
Mission
No.
Agency
No.
Lead
Agency
Country Mission
Date
Equipment Qty Equipment
Weight
ISS-
2237
178 JAXA Japan 14/12/2016 Potable
water
dispenser
2 100kg
Flexible air
duct
6 0.5kg
Small
storage
ack
4 2kg
ISS-
3664
526 ESA EU 16/01/2017 Bio Filter 6 0.20kg
ISS-
2356
167 NASA USA 12/042017 Small
storage
ack
3 2kg
Battery
pack
2 5Kg
Urine
transfer
tubing
2 1.5kg
O2 scru
er 1 50kg
ISS-
1234
032 Roskosmos Russia 16/04/218 Small
storage
ack
1 2kg
Flexible air
duct
2 0.5kg
1. Explain why the table is not normalised
2. Identify and state the functional dependencies in the table
3. Generate 1NF, 2NF and 3NF normalised relations.
- Justify clearly every step
- Produce the co
esponding tables
4. Produce SQL statements to create the 3NF relations (tables), and include SQL insert
statements for each of the tables.
5. Comment critically on the normalisation process.
6. Generate the ER diagram co
esponding to the table.
Part 2: Database Design (This task is worth 25 marks)
The NASA exoplanet dataset archive can be found here:
https:
exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-
tblView?app=ExoTbls&config=planets
In the context of Big Data, you are asked to design a database solution for the exoplanet
data set above. Your solution must include the following:
1. The database solution of your choice.
2. Justification for the choice of the database.
3. A detailed explanation of how the data will be stored and accessed in the database
you choose.
4. The benefits and drawbacks of this solution in relation to the type of data above and
the size of the data set.
5. The quality of service (QoS), such as scalability that should be provided to the user
should this solution be adopted.
Part 3: Sequential and parallel processing (This task is worth 25 marks)
Consider a flight data store with the following data structure, where all times are in GMT.
Each record consists of the 13 attributes; the set of allowable values of the attributes and
format are specified in the description (metadata).
XXXXXXXXXXData Value Description
1 Year XXXXXXXXXX
2 Month XXXXXXXXXX
3 Day of Month XXXXXXXXXX
4 Day of the Week 1 (Monday) – 7(Sunday)
5 Departure Time Recorded Departure time (hhmm)
6 Actual Departure time Scheduled Departure time (hhmm)
7 A
ival Time Recorded A
ival time (hhmm)
8 Ca
ier Ca
ier code (unique)
9 Flight Number Flight Number
10 Departure Delay minutes
11 A
ival Delay minutes
12 Cancellation Yes or No
13 Weather Delay minutes
An example record would have the following values:
(2015, 4, 20, 5, 1430, 1400, 1820, 131, JL729, 30, 15, No, 0)
Flight monitors would like to determine the number of flights which were delayed for each
ca
ier.
1. Assuming that the data is stored in a relational database produce, with justification,
the SQL statement to create the table and the SQL statement to determine the
number of flights which were delayed for each ca
ier.
2. Assuming that the data is too large to be processed in a centralised manner, and that
it is stored in an ordinary file, produce a distributed solution which applies
MapReduce to the data processing.
a) Justify your decisions and all the steps of your solution, and specify clearly the
map and reduce functions.
) Identify the advantages and drawbacks of this solution.
c) Use diagrams if required.
3. Assuming that the monitors wish to determine the number of delayed flights for a
specific year or month for example, comment on the general applicability of your
solution.
Part 4: Big Data and recommendation systems (This task is worth 25 marks)
Research and comment critically on the structure and the use of recommendation systems.
a) You should pay particular attention to the rationale, the architecture, the processes,
the effectiveness, the implications of recommendation systems and relevant issues
within a Big Data context.
Your arguments should be supported by specific examples and case studies and
should be properly referenced.
Use suitable diagrams if required.
) Produce in your own words a well-structured and adequately referenced report that
should be no more than 1000 words.
Mark Scheme
Q1
Achieve 40% Achieve 70%
• Evidence of partially co
ect applicable
and co
ectly identified database.
• Evidence of reasoning behind database
choice.
• For each activity a
ief explanation of
design decisions should be provided.
• Models providing detail about the design
decisions and database design provided.
• A complete and co
ect design, including
all elements.
• A complete explanation of the reasons
ehind the choice of Database.
• A complete and fully implemented database.
• For each step an explanation and
justification of how and why it was applied.
Q2
Achieve 40% Achieve70%
• Basic d e f i n i t i o n of wh a t d a t a
m i n i n g i s with a few references.
• Basic understanding of sequential and
parallel processing.
• Basic application of a partially co
ect
SQL query.
• Partial understanding of parallel
processing.
• Partially co
ect MapReduce solution.
• Basic rationale for the solution
presented.
• Excellent definition of what data mining
is with a diverse set of