The Abalone Data Set was acquired from the open-source UCI Machine Learning repository https:
archive.ics.uci.edu/ml/datasets/Abalone, which was provided by the Department of Primary Industry and Fisheries, Tasmania. It contains 4177 recordings having 9 features.
Data Set Features
Number of recordings
Number of Attributes
Attribute Characteristics
Multivariate
4177
9
Categorical, Integer, Real
In detail the data can be summarized as this:
Attribute
Data Type
Units
Description
Sex
nominal
N/A
M, F, and I (infant)
Length
continuous
mm
Longest shell measurement
Diamete
continuous
mm
perpendicular to length
Height
continuous
mm
with meat in shell
Whole weight
continuous
grams
whole abalone
Shucked weight
continuous
grams
weight of meat
Viscera weight
continuous
grams
gut weight (after bleeding)
Shell weight
continuous
grams
after being dried
Rings
intege
N/A
+1.5 gives the age in years
After inspecting the dataset we can make test various hypothesizes based on the visualizations and a
ive at different conclusions like:
Hypothesis 1: "the mean of length is the same for Male and Female" is null hypothesis
Hypothesis 2: "the mean of length is not the same for Male and Female" is alternative hypothesis and original claim
Using boxplot, Anova test, T test with, important metrics: α = 0.05 and p-value = 8.987874966189928e-07, we see that because we success to reject the null hypothesis, we conclude that mean of length is not the same between two sexes of abalone, which means the growth pattern is different between male and female.
Hypothesis: "the median of Rings of female and male is the same" is null hypothesis
Using Mann Whitney U-test, median_test and histogram plot with important metrics α = 0.05, p-value for Mann Whitney U-test=6.689638084926974e-05 and p-value for median-test=0 XXXXXXXXXX, it is Because we succeed to reject the null hypothesis, we conclude that also the sample medians look the same, medians of Rings of male and female are not the same actually, which means the age distribution between male and female are different.
Hypothesis: When the rings of abalone is less than the median rings of infant, the abalone's length, height and weight are increasing when rings increase. When the rings of abalone is larger than the median rings of infant, the abalone's length, height and weight are less likely to increase with rings' increase.
Necessary Numbers: Pearsonr, p-value
For Length (less than the median rings of infant): pearsonr=0.74 ; p=1.1e-243
For Length (larger than the median rings of infant): pearsonr=0.14; p=1.2e-13
For Height (less than the median rings of infant): pearsonr=0.54 ; p=3.2e-107
For Height (larger than the median rings of infant): pearsonr=0.27 ; p=5.6e-46
For Whole weight (less than the median rings of infant): pearsonr=0.62 ; p=4.1e-148
For Whole weight (larger than the median rings of infant): pearsonr=0.2 ; p=3e-26
Because the pearsonr of 'larger than the median rings of infant' are all larger than the one of 'less than the median rings of infant', and the low p-values show the reliability of these pearsonr results, we could not reject the H0. This conclusion could suggest that abalones grows (in length, height and weight) until a certain age (the median rings of infant), and after that, it's growth speed slow down dramatically.
Analysis: Which elements in the dataset are likely to have linear relationship with Rings?
Conclusion from Analysis: Because Length has the largest p-value, it is the only element are likely to have linear relationship with Rings.
H0: "Length and Rings have linear relationship" is null hypothesis
Using Linear Regression with learning rate as 0.05 and p-value as 0 we see that, because the p-value is 0, it leads to the conclusion conclude that Length and Rings don't have linear relationship.
H0: "Length and Rings have linear relationship for Infant" is null hypothesis
Using Linear Regression with learning rate as 0.05 and p-value as 0, we can conclude that for infant, Length and Rings also there is no linear relationship.
H0: "Height for infant is Gaussian distribution" is null hypothesis
Using QQ plot and normaltest with learning rate as 0.05 and p-value = XXXXXXXXXX, we conclude that as the p-value is larger than α, so we can we clear that Height for infant is Gaussian distribution.
References:
https:
archive.ics.uci.edu/ml/datasets/Abalone
https:
pubs.com/AlistairGJ/Abalone
https:
datahub.io/machine-learning/abalone
Appendix:
#read in and parse column headers
import pandas as pd
url='https:
archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
abalone=pd.read_csv(url,header=None)
abalone.columns=["Sex","Length","Diameter","Height","Whole weight","Shucked weight","Viscera weight","Shell weight","Rings"]
abalone['Sex'] = abalone['Sex'].map({'M': 2, 'F': 1, 'I':0})
sexes = pd.unique(abalone.Sex.values)
aba_data = {sex:abalone['Length'][abalone.Sex == sex] for sex in sexes}
aba_df=pd.DataFrame({"Male":aba_data[2].tolist()[0:1307],"Female":aba_data[1].tolist()})
aba_df
# draw boxplot to show the shape of distribution
oxplot = aba_df.boxplot(column=['Male', 'Female'])
sex_val = {'Male_mean': aba_data[2].mean(), 'Male_std': aba_data[2].std(),'Female_mean': aba_data[1].mean(), 'Female_std': aba_data[1].std()}
df = pd.DataFrame(data=sex_val,index=[0])
# use Anova test to test for p value
f, p = stats.f_oneway(aba_data[2],aba_data[1])
print("p-value for significance is: ", p)
if p<0.05:
print("reject null hypothesis")
else:
print("accept null hypothesis")
from scipy.stats import ttest_ind
# use T test to reprove the result
ttest,pval = ttest_ind(aba_data[2],aba_data[1])
print("p-value for significance is: ",pval)
#print(aba_data[2].mean(),aba_data[1].mean(),aba_data[2].std(),aba_data[1].std())
if pval < 0.05:
print("we reject null hypothesis")
else:
print("we accept null hypothesis")
#calculate the mean rings of both
aba_rings = {sex:abalone['Rings'][abalone.Sex == sex] for sex in sexes}
aba_rings[1].median()
aba_rings[2].median()
aba_male=abalone[abalone['Sex']==2]
aba_female=abalone[abalone['Sex']==1]
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,7)
aba_male["Rings"].plot(kind='hist', legend=True,title='Male hist')
aba_female["Rings"].plot(kind='hist', legend=True,title='Female hist')
import scipy.stats as stats
u_statistic, pval = stats.mannwhitneyu(aba_male['Rings'], aba_female['Rings'])
print("P value is:",pval)
if pval < 0.05: # alpha value is 0.05 or 5%
print(" we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")
from scipy.stats import median_test
stat, p, med, tbl = median_test(aba_male['Rings'], aba_female['Rings'])
print("P value is:",p, "and the median is",med)
if pval < 0.05: # alpha value is 0.05 or 5%
print(" we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")
aba_sns = sns.jointplot(data=abalone, x='Rings', y='Length',kind='kde')
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
aba_sns = sns.jointplot(data=abalone, x='Rings', y='Height', kind='kde',color='k')
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
aba_sns = sns.jointplot(data=abalone, x='Rings', y='Whole weight', kind='kde')
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
#calculate the median rings of infant
aba_rings = {sex:abalone['Rings'][abalone.Sex == sex] for sex in sexes}
aba_rings[0].median()
abalone_s = abalone[abalone['Rings'] <= aba_rings[0].median()]
aba_sns = sns.jointplot(data=abalone_s, x='Rings', y='Length',kind='resid')
aba_sns.annotate(stats.pearsonr)
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
aba_sns = sns.jointplot(data=abalone_s, x='Rings', y='Height', kind='resid',color='k')
aba_sns.annotate(stats.pearsonr)
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
aba_sns = sns.jointplot(data=abalone_s, x='Rings', y='Whole weight', kind='resid')
aba_sns.annotate(stats.pearsonr)
aba_sns.fig.set_figwidth(10)
abalone_l = abalone[abalone['Rings'] > aba_rings[0].median()]
aba_sns = sns.jointplot(data=abalone_l, x='Rings', y='Length',kind='resid')
aba_sns.annotate(stats.pearsonr)
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
aba_sns = sns.jointplot(data=abalone_l, x='Rings', y='Height', kind='resid',color='k')
aba_sns.annotate(stats.pearsonr)
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
aba_sns = sns.jointplot(data=abalone_l, x='Rings', y='Whole weight', kind='resid')
aba_sns.annotate(stats.pearsonr)
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
# try to use Multiple Linear Regression
maba_y = pd.DataFrame(abalone.Rings)
maba_x =pd.DataFrame(abalone[abalone.columns[0:7]])
# show the separate relationship figures of each element with Rings
fig = plt.figure(figsize=(15,10))
i=1
for col in aba_x.columns:
ax = fig.add_subplot(2,3,i)
i=i+1
plt.scatter(aba_x[col], aba_y.Rings)
ax.set_title(aba_x[col].name)
# instantiate the model
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True,
XXXXXXXXXXn_jobs=4)
# train
mabax_train, mabax_test, mabay_train, mabay_test = train_test_split(maba_x, maba_y,
XXXXXXXXXXtest_size=0.2,
XXXXXXXXXXrandom_state=42)
fit = model.fit(mabax_train, mabay_train)
# make predictions
mpreds = model.predict(mabax_test)
## plot predicted vs actual
plt.figure(figsize=(10,10))
plt.scatter(mabay_test, mpreds)
plt.xlabel("True Values")
plt.ylabel("Predictions")
import statsmodels.api as sm
from scipy import stats
X2 = sm.add_constant(maba_x)
est = sm.OLS(maba_y, X2)
est2 = est.fit()
print(est2.summary())
#Mean Absolute E
o
from sklearn.metrics import mean_absolute_e
o
MAE = mean_absolute_e
or(mabay_test, mpreds)
MAE
#RMSE
from sklearn.metrics import mean_squared_e
o
MSE = mean_squared_e
or(mabay_test, mpreds)
RMSE = np.sqrt(MSE)
RMSE
#R2 Score- Coefficient of Determination
from sklearn.metrics import r2_score
2_score(mabay_test, mpreds)
#Creating a Simple Linear Regression
_sq = abalone[["Length", "Rings"]].co
()
#Calculating Slope (B1)
import numpy as np
B1 = r_sq.values[0][1] * (np.std(abalone.Rings)/np.std(abalone["Length"]))
print("For 1 unit of change in Length, we can predict {} units of change in Rings".format(B1))
#Calculating the Intercept
B0 = abalone.Rings.mean() - (B1 * abalone["Length"].mean())
B0
#Plotting the line of best fit
plt.rcParams["figure.figsize"] = (12,7)
abalone["Rings_line"] = B0 + (B1 * abalone["Length"])
plt.scatter(abalone["Length"],abalone.Rings) # create the main scatter plot
plt.plot(abalone["Length"], abalone.Rings_line) # plot the regression line
plt.ylabel("Dependent Variable")
plt.xlabel("Independent Variable")
#Split into Training and Test Sets
from sklearn.model_selection import train_test_split
aba_y = pd.DataFrame(abalone.Rings)
aba_x =pd.DataFrame(abalone["Length"])
abax_train, abax_test, abay_train, abay_test = train_test_split(aba_x, aba_y,
XXXXXXXXXXtest_size=0.2,
XXXXXXXXXXrandom_state=42)
#Instantiating the linear model
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=True,
XXXXXXXXXXn_jobs=4)
fit = lr.fit(abax_train, abay_train)
#intercept
lr.intercept_
#Coefficients
coef_aba = pd.DataFrame({"feature": "Length",