Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Titanic Analysis (Steps 1-15) Dataset:train.csv Provide the code and comments as laid out in the following steps: 1.Load the data from the “train.csv” file into a DataFrame. 2.Display the dimensions...

1 answer below »

Titanic Analysis (Steps 1-15)

Dataset:train.csv

Provide the code and comments as laid out in the following steps:

1.Load the data from the “train.csv” file into a DataFrame.

2.Display the dimensions of the file (so you’ll have a good idea the amount of data you are working with.

3.Display the first 5 rows of data so you can see the column headings and the type of data for each column.

a.Notice that Survived is represented as a 1 or 0

b.Notice that missing data is represented as “NaN”

c.The Survived variable will be the “target” and the other variables will be the “features”

4.Think about some questions that might help you predict who will survive:

a.What do the variables look like?For example, are they numerical or categorical data. If they are numerical, what are their distribution; if they are categorical, how many are they in different categories?

b.Are the numerical variables correlated?

c.Are the distributions of numerical variables the same or different among survived and not survived?Is the survival rate different for different values? For example, were people more likely to survive if they were younger?

d.Are there different survival rates in different categories? For example, did more women survived than man?

5.Look at summary information about your data (total, mean, min, max, freq., unique, etc.)Does this present any more questions for you?Does it lead you to a conclusion yet?

6.Make some histograms of your data (“A picture is worth a thousand words!”)

a.Most of the passengers are around 20 to 30 years old and don't have siblings or relatives with them. A large amount of the tickets sold were less than $50. There are very few tickets sold where the fare was over $500.

7.Make some bar charts for variables with only a few options.

a.Ticket and Cabin have more than 100 variables so don’t do those!

8.To see if the data is correlated, make some Pearson Ranking charts

a.Notice that in the sample code, I have saved this png file.

b.The correlation between the variables is low (1 or -1 is high positive or high negative, 0 is low or no correlation). These results show there is “some” positive correlation but it’s not a high correlation.

9.Use Parallel Coordinates visualization tocompare the distributions of numerical variables between passengers that survived and those that did not survive.

a.That’s a cool chart, isn’t it?!Passengers traveling with siblings on the boat have a higher death rate and passengers who paid a higher fare had a higher survival rate.

10.Use Stack Bar Charts to compare passengers who survived to passengers who didn’t survive based on the other variables.

a.More females survived than men.3rdClass Tickets had a lower survival rate.Also, Embarkation from Southampton port had a lower survival rate.

11.Some of my questions have been answered by seeing the charts but in some ways, looking at this much data has created even more questions.

a.Now it’s time to reduce some of the features so we can concentrate on the things that matter!There features we will get rid of are:"PassengerId", "Name", "Ticket" and "Cabin".(ID doesn’t really give us any useful data, Ticket and Cabin have too many variables.Name might reflect that they are related but we’re keeping the category about siblings (for now).

b.We can also fill in missing values.(Cabin has some missing values but we are dropping that feature.)Age has some missing values so I’ll fill in with the average age.Embarked also has some missing so I’ll the most common.

12.If you go back and look at the histograms of Fare, you’ll see that it is very skewed…many low-cost fares, not very many high cost fares.Log Transformation is a good method to use on highly skewed data.

13.Convert your categorical data into numbers (Sex, PClass, Embark)

14.Training - Split your data into two sets:Training and Testing.

15.Evaluation – Remember, we are trying to predict if a passenger has survived or not so this is a classification problem.There are many algorithms that could be used but we’re going to use logistic regression.

a.Metrics for the evaluation:

i.Confusion Matrix (you should get 84% - pretty good)

ii.Precision, Recall & F1 score (all 3 were very good)

iii.ROC curve (the dotted line is the randomly guessed so anything above that is good metric)

Format:The completed task must bein Jupyter Notebook with displayed results.

Answered Same Day Oct 02, 2021

Solution

Ximi answered on Oct 05 2021
157 Votes
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "e4a1cf77-b695-4ca0-9653-6c41ce2393d8",
"_uuid": "ca91678a9dc8cc103a7fbf799a5d903a011334ef"
},
"source": [
"## Some Background Information\n",
"\n",
"\n",
"**The sinking of the RMS Titanic in the early morning of 15 April 1912, four days into the ship's maiden voyage from Southampton to New York City, was one of the deadliest peacetime maritime disasters in history, killing more than 1,500 people. The largest passenger liner in service at the time, Titanic had an estimated 2,224 people on board when she struck an iceberg in the North Atlantic. The ship had received six warnings of sea ice but was travelling at near maximum speed when the lookouts sighted the iceberg. Unable to turn quickly enough, the ship suffered a glancing blow that buckled the sta
oard (right) side and opened five of sixteen compartments to the sea. The disaster caused widespread outrage over the lack of lifeboats, lax regulations, and the unequal treatment of the three passenger classes during the evacuation. Inquiries recommended sweeping changes to maritime regulations, leading to the International Convention for the Safety of Life at Sea (1914), which continues to govern maritime safety.** \n",
"*from Wikipedia*"
]
},
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "998b2a03-c60e-4fd6-9f69-784de6e6c9b8",
"_uuid": "d3086cb02907affe5a674b54e4baaedd632482c7"
},
"source": [
"**Imports**"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"_cell_guid": "872b97b2-56fe-4644-a11f-afb00f422169",
"_uuid": "efb595c75201cdb2a53388dc152a8e526e1b921a"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gender_submission.csv\n",
"test.csv\n",
"train.csv\n",
"\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline\n",
"sns.set()\n",
"\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
"#warnings.filterwarnings(\"ignore\", category=DeprecationWarning)\n",
"#warnings.filterwarnings(\"ignore\")\n",
"\n",
"from subprocess import check_output\n",
"print(check_output([\"ls\", \"../input\"]).decode(\"utf8\"))\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"_uuid": "0da
93c5c6480e69cbc93616932445e614f506f"
},
"outputs": [
{
"data": {
"text/plain": [
"'0.9.0'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sns.__version__"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"_cell_guid": "080fb327-390d-4124-b287-a561d050fe7e",
"_uuid": "0333d5086a63e3870708e7ba7a540d036c53544e"
},
"outputs": [],
"source": [
"df_train = pd.read_csv(\"../input/train.csv\")\n",
"df_test = pd.read_csv(\"../input/test.csv\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "6c7d2500-95b1-4057-98f7-39100e8a6d7f",
"_uuid": "13fd8422db7a1ceae9e
002df452e8293a9ab0c"
},
"source": [
"## Part 1: Exploratory Data Analysis"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"_cell_guid": "17a3c0a2-3aad-47f4-be6f-e8756bddf080",
"_uuid": "48a2091edbeacc9c23dad6bc0c64d0302d01b87b"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"