The Abalone Data Set was acquired from the open-source UCI Machine Learning repository...

Question

The Abalone Data Set was acquired from the open-source UCI Machine Learning repository https://archive.ics.uci.edu/ml/datasets/Abalone, which was provided by the Department of Primary Industry and...

1 answer below »

The Abalone Data Set was acquired from the open-source UCI Machine Learning repository https:
archive.ics.uci.edu/ml/datasets/Abalone, which was provided by the Department of Primary Industry and Fisheries, Tasmania. It contains 4177 recordings having 9 features.
    Data Set Features
    Number of recordings
    Number of Attributes
    Attribute Characteristics
    Multivariate
    4177
    9
    Categorical, Integer, Real
In detail the data can be summarized as this:
    Attribute
    Data Type
    Units
    Description
    Sex
    nominal
    N/A
    M, F, and I (infant)
    Length
    continuous
    mm
    Longest shell measurement
    Diamete
    continuous
    mm
    perpendicular to length
    Height
    continuous
    mm
    with meat in shell
    Whole weight
    continuous
    grams
    whole abalone
    Shucked weight
    continuous
    grams
    weight of meat
    Viscera weight
    continuous
    grams
    gut weight (after bleeding)
    Shell weight
    continuous
    grams
    after being dried
    Rings
    intege
    N/A
    +1.5 gives the age in years
After inspecting the dataset we can make test various hypothesizes based on the visualizations and a
ive at different conclusions like:
Hypothesis 1: "the mean of length is the same for Male and Female" is null hypothesis
Hypothesis 2: "the mean of length is not the same for Male and Female" is alternative hypothesis and original claim
Using boxplot, Anova test, T test with, important metrics: α = 0.05 and p-value = 8.987874966189928e-07, we see that because we success to reject the null hypothesis, we conclude that mean of length is not the same between two sexes of abalone, which means the growth pattern is different between male and female.
Hypothesis: "the median of Rings of female and male is the same" is null hypothesis
Using Mann Whitney U-test, median_test and histogram plot with important metrics α = 0.05, p-value for Mann Whitney U-test=6.689638084926974e-05 and p-value for median-test=0 XXXXXXXXXX, it is Because we succeed to reject the null hypothesis, we conclude that also the sample medians look the same, medians of Rings of male and female are not the same actually, which means the age distribution between male and female are different.
Hypothesis: When the rings of abalone is less than the median rings of infant, the abalone's length, height and weight are increasing when rings increase. When the rings of abalone is larger than the median rings of infant, the abalone's length, height and weight are less likely to increase with rings' increase.
Necessary Numbers: Pearsonr, p-value
For Length (less than the median rings of infant): pearsonr=0.74 ; p=1.1e-243
For Length (larger than the median rings of infant): pearsonr=0.14; p=1.2e-13
For Height (less than the median rings of infant): pearsonr=0.54 ; p=3.2e-107
For Height (larger than the median rings of infant): pearsonr=0.27 ; p=5.6e-46
For Whole weight (less than the median rings of infant): pearsonr=0.62 ; p=4.1e-148
For Whole weight (larger than the median rings of infant): pearsonr=0.2 ; p=3e-26
Because the pearsonr of 'larger than the median rings of infant' are all larger than the one of 'less than the median rings of infant', and the low p-values show the reliability of these pearsonr results, we could not reject the H0. This conclusion could suggest that abalones grows (in length, height and weight) until a certain age (the median rings of infant), and after that, it's growth speed slow down dramatically.
Analysis: Which elements in the dataset are likely to have linear relationship with Rings?
Conclusion from Analysis: Because Length has the largest p-value, it is the only element are likely to have linear relationship with Rings.
H0: "Length and Rings have linear relationship" is null hypothesis
Using Linear Regression with learning rate as 0.05 and p-value as 0 we see that, because the p-value is 0, it leads to the conclusion conclude that Length and Rings don't have linear relationship.
H0: "Length and Rings have linear relationship for Infant" is null hypothesis
Using Linear Regression with learning rate as 0.05 and p-value as 0, we can conclude that for infant, Length and Rings also there is no linear relationship.
H0: "Height for infant is Gaussian distribution" is null hypothesis
Using QQ plot and normaltest with learning rate as 0.05 and p-value = XXXXXXXXXX, we conclude that as the p-value is larger than α, so we can we clear that Height for infant is Gaussian distribution.
References:
https:
archive.ics.uci.edu/ml/datasets/Abalone
https:
pubs.com/AlistairGJ/Abalone
https:
datahub.io/machine-learning/abalone
Appendix:
#read in and parse column headers
import pandas as pd
url='https:
archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
abalone=pd.read_csv(url,header=None)
abalone.columns=["Sex","Length","Diameter","Height","Whole weight","Shucked weight","Viscera weight","Shell weight","Rings"]
abalone['Sex'] = abalone['Sex'].map({'M': 2, 'F': 1, 'I':0})
sexes = pd.unique(abalone.Sex.values)
aba_data = {sex:abalone['Length'][abalone.Sex == sex] for sex in sexes}
aba_df=pd.DataFrame({"Male":aba_data[2].tolist()[0:1307],"Female":aba_data[1].tolist()})
aba_df
# draw boxplot to show the shape of distribution
oxplot = aba_df.boxplot(column=['Male', 'Female'])
sex_val = {'Male_mean': aba_data[2].mean(), 'Male_std': aba_data[2].std(),'Female_mean': aba_data[1].mean(), 'Female_std': aba_data[1].std()}
df = pd.DataFrame(data=sex_val,index=[0])
# use Anova test to test for p value
f, p = stats.f_oneway(aba_data[2],aba_data[1])
print("p-value for significance is: ", p)
if p<0.05:
print("reject null hypothesis")
else:
print("accept null hypothesis")
from scipy.stats import ttest_ind
# use T test to reprove the result
ttest,pval = ttest_ind(aba_data[2],aba_data[1])
print("p-value for significance is: ",pval)
#print(aba_data[2].mean(),aba_data[1].mean(),aba_data[2].std(),aba_data[1].std())
if pval < 0.05:
print("we reject null hypothesis")
else:
print("we accept null hypothesis")
#calculate the mean rings of both
aba_rings = {sex:abalone['Rings'][abalone.Sex == sex] for sex in sexes}
aba_rings[1].median()
aba_rings[2].median()
aba_male=abalone[abalone['Sex']==2]
aba_female=abalone[abalone['Sex']==1]
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,7)
aba_male["Rings"].plot(kind='hist', legend=True,title='Male hist')
aba_female["Rings"].plot(kind='hist', legend=True,title='Female hist')
import scipy.stats as stats
u_statistic, pval = stats.mannwhitneyu(aba_male['Rings'], aba_female['Rings'])
print("P value is:",pval)
if pval < 0.05: # alpha value is 0.05 or 5%
print(" we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")
from scipy.stats import median_test
stat, p, med, tbl = median_test(aba_male['Rings'], aba_female['Rings'])
print("P value is:",p, "and the median is",med)
if pval < 0.05: # alpha value is 0.05 or 5%
print(" we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")
aba_sns = sns.jointplot(data=abalone, x='Rings', y='Length',kind='kde')
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
aba_sns = sns.jointplot(data=abalone, x='Rings', y='Height', kind='kde',color='k')
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
aba_sns = sns.jointplot(data=abalone, x='Rings', y='Whole weight', kind='kde')
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
#calculate the median rings of infant
aba_rings = {sex:abalone['Rings'][abalone.Sex == sex] for sex in sexes}
aba_rings[0].median()
abalone_s = abalone[abalone['Rings'] <= aba_rings[0].median()]
aba_sns = sns.jointplot(data=abalone_s, x='Rings', y='Length',kind='resid')
aba_sns.annotate(stats.pearsonr)
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
aba_sns = sns.jointplot(data=abalone_s, x='Rings', y='Height', kind='resid',color='k')
aba_sns.annotate(stats.pearsonr)
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
aba_sns = sns.jointplot(data=abalone_s, x='Rings', y='Whole weight', kind='resid')
aba_sns.annotate(stats.pearsonr)
aba_sns.fig.set_figwidth(10)
abalone_l = abalone[abalone['Rings'] > aba_rings[0].median()]
aba_sns = sns.jointplot(data=abalone_l, x='Rings', y='Length',kind='resid')
aba_sns.annotate(stats.pearsonr)
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
aba_sns = sns.jointplot(data=abalone_l, x='Rings', y='Height', kind='resid',color='k')
aba_sns.annotate(stats.pearsonr)
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
aba_sns = sns.jointplot(data=abalone_l, x='Rings', y='Whole weight', kind='resid')
aba_sns.annotate(stats.pearsonr)
aba_sns.fig.set_figwidth(10)
aba_sns.fig.set_figheight(10)
# try to use Multiple Linear Regression
maba_y = pd.DataFrame(abalone.Rings)
maba_x =pd.DataFrame(abalone[abalone.columns[0:7]])
# show the separate relationship figures of each element with Rings
fig = plt.figure(figsize=(15,10))
i=1
for col in aba_x.columns:
ax = fig.add_subplot(2,3,i)
i=i+1
plt.scatter(aba_x[col], aba_y.Rings)
ax.set_title(aba_x[col].name)
# instantiate the model
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True,
XXXXXXXXXXn_jobs=4)
# train
mabax_train, mabax_test, mabay_train, mabay_test = train_test_split(maba_x, maba_y,
XXXXXXXXXXtest_size=0.2,
XXXXXXXXXXrandom_state=42)
fit = model.fit(mabax_train, mabay_train)
# make predictions
mpreds = model.predict(mabax_test)
## plot predicted vs actual
plt.figure(figsize=(10,10))
plt.scatter(mabay_test, mpreds)
plt.xlabel("True Values")
plt.ylabel("Predictions")
import statsmodels.api as sm
from scipy import stats
X2 = sm.add_constant(maba_x)
est = sm.OLS(maba_y, X2)
est2 = est.fit()
print(est2.summary())
#Mean Absolute E
o
from sklearn.metrics import mean_absolute_e
o
MAE = mean_absolute_e
or(mabay_test, mpreds)
MAE
#RMSE
from sklearn.metrics import mean_squared_e
o
MSE = mean_squared_e
or(mabay_test, mpreds)
RMSE = np.sqrt(MSE)
RMSE
#R2 Score- Coefficient of Determination
from sklearn.metrics import r2_score
2_score(mabay_test, mpreds)
#Creating a Simple Linear Regression
_sq = abalone[["Length", "Rings"]].co
()
#Calculating Slope (B1)
import numpy as np
B1 = r_sq.values[0][1] * (np.std(abalone.Rings)/np.std(abalone["Length"]))
print("For 1 unit of change in Length, we can predict {} units of change in Rings".format(B1))
#Calculating the Intercept
B0 = abalone.Rings.mean() - (B1 * abalone["Length"].mean())
B0
#Plotting the line of best fit
plt.rcParams["figure.figsize"] = (12,7)
abalone["Rings_line"] = B0 + (B1 * abalone["Length"])
plt.scatter(abalone["Length"],abalone.Rings) # create the main scatter plot
plt.plot(abalone["Length"], abalone.Rings_line) # plot the regression line
plt.ylabel("Dependent Variable")
plt.xlabel("Independent Variable")
#Split into Training and Test Sets
from sklearn.model_selection import train_test_split
aba_y = pd.DataFrame(abalone.Rings)
aba_x =pd.DataFrame(abalone["Length"])
abax_train, abax_test, abay_train, abay_test = train_test_split(aba_x, aba_y,
XXXXXXXXXXtest_size=0.2,
XXXXXXXXXXrandom_state=42)
#Instantiating the linear model
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=True,
XXXXXXXXXXn_jobs=4)
fit = lr.fit(abax_train, abay_train)
#intercept
lr.intercept_
#Coefficients
coef_aba = pd.DataFrame({"feature": "Length",

major-assignment-jn3mq0no.docx

Answered Same Day Jun 03, 2021

Solution

Pooja answered on Jun 03 2021

152 Votes

Abalone Data
Abalone Data
Student ID:
DATA
    Attribute    Data Type    Units    Description
    Sex    nominal    N/A    M, F, and I (infant)
    Length    continuous    mm    Longest shell measurement
    Diameter    continuous    mm    perpendicular to length
    Height    continuous    mm    with meat in shell
    Whole weight    continuous    grams    whole abalone
    Shucked weight    continuous    grams    weight of meat
    Viscera weight    continuous    grams    gut weight (after bleeding)
    Shell weight    continuous    grams    after being dried
    Rings    integer    N/A    +1.5 gives the age in years
Length of Males and Females
There is not much difference in the median Longest shell measurement of males and females.
There is presence of outliers in both the data sets.
With p<5%, reject Ho and conclude that that mean of length is not the same between two sexes of abalone.
There is sufficient evidence to conclude that growth pattern is different between male and female.
Rings of female and male
The distribution of rings for males and females is skewed to the right. There are very few males and females with high values of rings.
With Whitney...

SOLUTION.PDF

The Abalone Data Set was acquired from the open-source UCI Machine Learning repository https://archive.ics.uci.edu/ml/datasets/Abalone, which was provided by the Department of Primary Industry and...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment