Week 6 Assessment: Code
Task
The Iris data set is a comprehensive data set compiled by Robert Fisher in 1936,
detailing a number of measurements of three species of Iris flowers. It has gained some
popularity in the fields of Data Analytics and Machine Learning, as it provides a large
number of measurements across a relatively small number of categories.
The assessment task is to ca
y out some simple, computer-supported analysis of the
Iris data set.
Task details
The task has been
oken into several, largely independent stages.
Stage 1: Reading and processing data
For this stage you need to complete the specification of the
ead_and_process(csv_filename) function. This function should do the following:
• Import the csv file named csv_filename as a Pandas DataFrame
• Drop any rows that do not contain entries in all columns
• Strip ' cm' and ' mm' from each data point, and convert them to floats
• Divide the second column ('sepal_width') by 10
• Return the resulting DataFrame
You may assume that csv_filename is a readable csv file with a similar format to iris.csv
Stage 2: User menu
For this stage you need to implement the initial interactions. When your program is run:
• Prompt the user to enter a csv file with Enter csv file:
• Read and process the user-entered file using the function from Stage 1
• Display the menu:
1. Create textual analysis
2. Create graphical analysis
3. Exit
• Prompt the user to select an option with Please select an option:
• Process the user's choice:
o If they select '1', proceed to Stage 3
o If they select '2', proceed to Stage 4
o If they select '3', exit the program with the exit() function
You may assume that only valid options are selected.
https:
en.wikipedia.org/wiki/Iris_flower_data_set
Stage 3: Text-based analysis
For this stage you will output some simple statistics based on the DataFrame loaded in
Stage 2. Upon entering this stage, the program should:
• Prompt the user for a species with: Select species (all, setosa,
versicolor, virginica):
To obtain full marks, the available species should be extracted from the DataFrame, and may be different
from those listed above. They should be a
anged alphabetically, after all.
• Display the following statistics: Mean, 25%-ile, Median, 75%-ile, Standard
deviation for each of the characteristics (sepal_length, sepal_width,
petal_length, petal_width) for the species selected by the user. If the user
chose all, then the resulting table should be a summary of all the data.
• The output should be the result of printing a DataFrame with index:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width'] and
column headings: Mean, 25%, Median, 75%, Std. (See Sample Interactions)
• Return to the main menu (Stage 2)
The output resulting from pandas function calls is sufficient. You do not need to manually round any
esults.
Stage 4: Graphics-based analysis
For this stage you will output some simple graphical plots based on the DataFrame
loaded in Stage 2. Upon entering this stage, your program should:
• Prompt the user for a characteristic for the x-axis with: Choose the x-axis
characteristic (all, sepal_length, sepal_width, petal_length,
petal_width):
The available characteristics do not need to be extracted from the DataFrame
• If the user does not select all:
o Prompt the user for a characteristic for the y-axis with: Choose the y-
axis characteristic (sepal_length, sepal_width, petal_length,
petal_width):
o Plot a scatter-plot of the two chosen characteristics (does not have to be
displayed)
• If the user does select all:
o Using a scatter_matrix or pairplot, plot the relationships between all
pairs of characteristics
• In both cases, the program should prompt the user to enter a file with: Enter
save file: and then save the graphical plot to the entered file.
• Return to the main menu (Stage 2)
To obtain full marks, the outputs should differentiate the different species by colouring the data points
ased on their species.
In addition to the automarked test-cases, the output of this Stage will be inspected by
your OL, and up to 5 marks awarded for output.
The marks will be based on the following criteria:
• Scatter plots of the co
ect characteristics (3 marks)
• Differentiation of species by colour (2 marks)
Stage 5: Conclusion
For this Stage you are required to complete the provided function conclusion(). Your
function should return a tuple containing the two (non-species) characteristics you
elieve answer the following question: In iris.csv, which pair of characteristics is best
for separating the species? In other words, which pair of characteristics have the most
significant impact in determining what species the plant belongs to?
The two characteristics should be ordered alphabetically within the tuple, and should be
two of: 'sepal_length', 'sepal_width', 'petal_length', or 'petal_width' .
The return value should be hard-coded into the function (i.e., no calculations are required) based on your
own analysis of the data (using the program you just created, if appropriate).
If you are failing the last (hidden) test case, but passing the second last test case then add a comment
indicating the reason for your choice. Justification is not needed if you pass the last test case.
Subjective component
In addition to the above tasks, your code will be inspected by your OL and evaluated on
its adherence to good coding practices. Particular attention will be on the following
aspects of your code:
• Documentation: Appropriate use of comments
• Modularity: Appropriate use of functions (Note: if appropriate you should define
your own functions outside of those outlined above). All functions should "stand
alone" - that is, not be dependent on global variables
• Readability: Appropriate use of variable names
• Structure: Appropriate code layout so that the program flow is clear
Sample interactions
Enter csv file: iris.csv
1. Create textual analysis
2. Create graphical analysis
3. Exit
Please select an option: 1
Select species (all, setosa, versicolor, virginica): all
XXXXXXXXXXMean 25% Median 75% Std
sepal_length XXXXXXXXXX XXXXXXXXXX
sepal_width XXXXXXXXXX XXXXXXXXXX
petal_length XXXXXXXXXX XXXXXXXXXX
petal_width XXXXXXXXXX XXXXXXXXXX
1. Create textual analysis
2. Create graphical analysis
https:
www.python.org/dev/peps/pep-0008
3. Exit
Please select an option: 3
Enter csv file: iris_test.csv
1. Create textual analysis
2. Create graphical analysis
3. Exit
Please select an option: 1
Select species (all, versicolor, virginica): versicolor
XXXXXXXXXXMean 25% Median 75% Std
sepal_length XXXXXXXXXX XXXXXXXXXX
sepal_width XXXXXXXXXX XXXXXXXXXX296507
petal_length XXXXXXXXXX XXXXXXXXXX
petal_width XXXXXXXXXX XXXXXXXXXX194066
1. Create textual analysis
2. Create graphical analysis
3. Exit
Please select an option: 3
Enter csv file: iris.csv
1. Create textual analysis
2. Create graphical analysis
3. Exit
Please select an option: 2
Choose the x-axis characteristic (all, sepal_length, sepal_width,
petal_length, petal_width): all
Enter save file: iris_all.png
1. Create textual analysis
2. Create graphical analysis
3. Exit
Please select an option: 3
After the above interaction, an example of iris_all.png would be either of the
following:
Enter csv file: iris.csv
1. Create textual analysis
2. Create graphical analysis
3. Exit
Please select an option: 2
Choose the x-axis characteristic (all, sepal_length, sepal_width,
petal_length, petal_width): sepal_width
Choose the y-axis characteristic (sepal_length, sepal_width, petal_length,
petal_width): sepal_width
Enter save file: sw_vs_sw.png
1. Create textual analysis
2. Create graphical analysis
3. Exit
Please select an option: 1
Select species (all, setosa, versicolor, virginica): all
XXXXXXXXXXMean 25% Median 75% Std
sepal_length XXXXXXXXXX XXXXXXXXXX
sepal_width XXXXXXXXXX XXXXXXXXXX
petal_length XXXXXXXXXX XXXXXXXXXX
petal_width XXXXXXXXXX XXXXXXXXXX
1. Create textual analysis
2. Create graphical analysis
3. Exit
Please select an option: 3
After the above interaction, an example of sw_vs_sw.png would be:
Note: Your plots do not have to have the same style options (e.g., colours, fonts) as the ones presented
here. Your plots will be assessed on whether they are plotting the co
ect data with the co
ect chart type
(i.e., a scatterplot)
Submission and feedback
You can click on the mark button, also used to submit your work, as many times as you
like. We will assess your last submission only.
You can see where your code differs from the expected output by examining the
feedback from the non-hidden test cases. The hidden test cases will test your code more
igorously, but with suppressed input/output to limit dishonest attempts.
You are encouraged to test your code yourself and not rely on the provided test cases.
Two files suitable for input have been provided as part of your scaffold.
import pandas as pd
# Stage 1: Read and process