PYTHON ASSIGNMENT
Send code as well as paste plots generated in a word, also explain what it concludes
Let us use data analytical skills to determine which factors contribute to higher medical costs.
The insurance.csv dataset is related to individual medical costs billed by health insurance companies. It also includes some personal information.
Use from these -- matplotlib pyplot numpy panda
Assignment
Data Description
· age: age of primary beneficiary
· sex: insurance contractor gender, 1 (female), 0 (male)
· bmi: body mass index, providing an understanding of body, weights
that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
· children: number of children covered by health insurance / number of dependents
· smoker: 1 (smoking), 0 (non-smoking)
· region: the beneficiary's residential area in the US,
0 (southwest), 1(southeast), 2 (northwest), 3 (northeast)
· charges: individual medical costs billed by health insurance
Questions
1. We will examine if bmi has an impact on the medical costs. Put the bmi on the x-axis. The color of each point will be set according to whether the patient is a smoker. Set the transparency to be 0.7. Be sure to include the colo
ar, and set appropriate labels for x-axis, y-axis and the colo
ar. What business insights can you get?
2. We further compare the distribution of the medical costs of smokers and that of non-smokers. Plot the distribution of medical costs of smokers first. Then on the same figure, plot the distribution of medical costs of non-smokers and set the transparency to 0.6. The number of bins is 12 for both plots. Set appropriate labels and legends.
3. We study whether age is an important factor by comparing the distribution of medical costs of young people and that of elder people. On the same plot, generate a histogram of medical costs of patients younger than 40 years old, and then another histogram representing the rest of the patients. Set the transparency of the second histogram to 0.7. The number of bins is 15 for both histogram. Set appropriate labels and legends. What can you conclude from this figure?
4. Open-ended question. Now it is your turn to discover something interesting and valuable! What else can you conclude from this dataset using the data visualization skills we leant? Generate two more figures and explain your findings.
PART 2 of Assignment
. Visualization Practice: Bike Sharing Systems
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Cu
ently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and a
ival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
Data Description: We will be using the daily version of the Capital Bikeshare System dataset from the UCI Machine Learning Repository. This data set contains information about the daily count of bike rental checkouts in Washington, D.C.’s bikeshare program between 2011 and 2012. It also includes information about the weather and seasonal/temporal features for that day (like whether it was a weekday).
• day: Day of the record (relative to day 1: XXXXXXXXXX)
• season: Season (1:winter, 2:spring, 3:summer, 4:fall)
• weekday: Day of the week (0=Sunday, 6=Saturday)
• workingday: If day is neither weekend nor holiday is 1, otherwise is 0.
• weathersit:
– 1: Clear, Few clouds, Partly cloudy, Partly cloudy
– 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
– 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
• temp: Normalized temperature in Celcius
• windspeed: Normalized wind speed
• casual: Count of checkouts by casual/non-registered users
• registered: Count of checkouts by registered users
• cnt: Total checkouts
[ ]:
import pandas as pd
daily = pd.read_csv('day.csv') daily.head()
Questions:
1. Understand Trends. Generate a line chart to show the checkouts over time by using day column as the x-axis and cnt column as the y-axis. Label the x-axis as ‘Day’, and y-axis as ‘Check Outs’. What can you conclude?
2. Explore Relationships. We will plot the daily count of bikes that were checked out by casual/non-registered users against the temperature. Color the points to be ‘#539cab’. Set the transparency to be 0.7. Be sure to include appropriate labels for x-axis and y-axis. What insight can you get?
3. Explore Relationships with Multidimensional Information. We will plot the daily count of bikes that were checked out by casual/non-registered users against the temperature. The color of each point will be set according to whether it is a working day. Set the trans- parency to be 0.7. Be sure to include appropriate labels for x-axis and y-axis. Change the legend of the color bar to whether it is a working day. What additional insights can you get?
4. Examine Distributions. Let’s first build a histogram of the registered bike checkouts with the number of bins as 10. Set appropriate labels. Also set the title to be “Distribution of Registered Check Outs”.
5. Compare Distributions. We now compare the distributions of registered and casual check- outs. To make the figure easy to understand, additional to the histogram we made for the previous question, we will set the transparency of the casual one to 0.8 and the number of bins to 5. Set appropriate labels.
6. How do the temperatures change across the seasons? You need to choose the type of visual- ization that best serves this purpose. What are the mean and median temperatures?
7. What else can you conclude from this dataset by using various data exploration?
age,sex,bmi,children,smoker,region,charges
19,1,27.9,0,1,0, XXXXXXXXXX
18,0,33.77,1,0,1, XXXXXXXXXX
28,0,33,3,0,1, XXXXXXXXXX
33,0,22.705,0,0,2, XXXXXXXXXX
32,0,28.88,0,0,2, XXXXXXXXXX
31,1,25.74,0,0,1, XXXXXXXXXX
46,1,33.44,1,0,1, XXXXXXXXXX
37,1,27.74,3,0,2, XXXXXXXXXX
37,0,29.83,2,0,3, XXXXXXXXXX
60,1,25.84,0,0,2, XXXXXXXXXX
25,0,26.22,0,0,3, XXXXXXXXXX
62,1,26.29,0,1,1, XXXXXXXXXX
23,0,34.4,0,0,0, XXXXXXXXXX
56,1,39.82,0,0,1, XXXXXXXXXX
27,0,42.13,0,1,1, XXXXXXXXXX
19,0,24.6,1,0,0, XXXXXXXXXX
52,1,30.78,1,0,3, XXXXXXXXXX
23,0,23.845,0,0,3, XXXXXXXXXX
56,0,40.3,0,0,0, XXXXXXXXXX
30,0,35.3,0,1,0, XXXXXXXXXX
60,1,36.005,0,0,3, XXXXXXXXXX
30,1,32.4,1,0,0, XXXXXXXXXX
18,0,34.1,0,0,1, XXXXXXXXXX
34,1,31.92,1,1,3, XXXXXXXXXX
37,0,28.025,2,0,2, XXXXXXXXXX
59,1,27.72,3,0,1, XXXXXXXXXX
63,1,23.085,0,0,3, XXXXXXXXXX
55,1,32.775,2,0,2, XXXXXXXXXX
23,0,17.385,1,0,2, XXXXXXXXXX
31,0,36.3,2,1,0,38711
22,0,35.6,0,1,0, XXXXXXXXXX
18,1,26.315,0,0,3, XXXXXXXXXX
19,1,28.6,5,0,0, XXXXXXXXXX
63,0,28.31,0,0,2, XXXXXXXXXX
28,0,36.4,1,1,0, XXXXXXXXXX
19,0,20.425,0,0,2, XXXXXXXXXX
62,1,32.965,3,0,2, XXXXXXXXXX
26,0,20.8,0,0,0,2302.3
35,0,36.67,1,1,3, XXXXXXXXXX
60,0,39.9,0,1,0, XXXXXXXXXX
24,1,26.6,0,0,3, XXXXXXXXXX
31,1,36.63,2,0,1, XXXXXXXXXX
41,0,21.78,1,0,1, XXXXXXXXXX
37,1,30.8,2,0,1, XXXXXXXXXX
38,0,37.05,1,0,3, XXXXXXXXXX
55,0,37.3,0,0,0, XXXXXXXXXX
18,1,38.665,2,0,3, XXXXXXXXXX
28,1,34.77,0,0,2, XXXXXXXXXX
60,1,24.53,0,0,1, XXXXXXXXXX
36,0,35.2,1,1,1, XXXXXXXXXX
18,1,35.625,0,0,3, XXXXXXXXXX
21,1,33.63,2,0,2, XXXXXXXXXX
48,0,28,1,1,0, XXXXXXXXXX
36,0,34.43,0,1,1, XXXXXXXXXX
40,1,28.69,3,0,2, XXXXXXXXXX
58,0,36.955,2,1,2, XXXXXXXXXX
58,1,31.825,2,0,3, XXXXXXXXXX
18,0,31.68,2,1,1, XXXXXXXXXX
53,1,22.88,1,1,1, XXXXXXXXXX
34,1,37.335,2,0,2, XXXXXXXXXX
43,0,27.36,3,0,3, XXXXXXXXXX
25,0,33.66,4,0,1, XXXXXXXXXX
64,0,24.7,1,0,2, XXXXXXXXXX
28,1,25.935,1,0,2, XXXXXXXXXX
20,1,22.42,0,1,2, XXXXXXXXXX
19,1,28.9,0,0,0, XXXXXXXXXX
61,1,39.1,2,0,0, XXXXXXXXXX
40,0,26.315,1,0,2, XXXXXXXXXX
40,1,36.19,0,0,1, XXXXXXXXXX
28,0,23.98,3,1,1, XXXXXXXXXX
27,1,24.75,0,1,1, XXXXXXXXXX
31,0,28.5,5,0,3, XXXXXXXXXX
53,1,28.1,3,0,0, XXXXXXXXXX
58,0,32.01,1,0,1, XXXXXXXXXX
44,0,27.4,2,0,0, XXXXXXXXXX
57,0,34.01,0,0,2, XXXXXXXXXX
29,1,29.59,1,0,1, XXXXXXXXXX
21,0,35.53,0,0,1, XXXXXXXXXX
22,1,39.805,0,0,3, XXXXXXXXXX
41,1,32.965,0,0,2, XXXXXXXXXX
31,0,26.885,1,0,3, XXXXXXXXXX
45,1,38.285,0,0,3, XXXXXXXXXX
22,0,37.62,1,1,1, XXXXXXXXXX
48,1,41.23,4,0,2, XXXXXXXXXX
37,1,34.8,2,1,0, XXXXXXXXXX
45,0,22.895,2,1,2, XXXXXXXXXX
57,1,31.16,0,1,2, XXXXXXXXXX
56,1,27.2,0,0,0, XXXXXXXXXX
46,1,27.74,0,0,2, XXXXXXXXXX
55,1,26.98,0,0,2, XXXXXXXXXX
21,1,39.49,0,0,1, XXXXXXXXXX
53,1,24.795,1,0,2, XXXXXXXXXX
59,0,29.83,3,1,3, XXXXXXXXXX
35,0,34.77,2,0,2, XXXXXXXXXX
64,1,31.3,2,1,0, XXXXXXXXXX
28,1,37.62,1,0,1, XXXXXXXXXX
54,1,30.8,3