Titanicdeath Writeup

titanicdeath.com

This is the writeup for titanicdeath.com. Besides having a truly fantastic name, this little website I created tells you whether or not you would be likely to die on the titanic provided some small amount of personal information.

If you went to the site and took the quiz (a few questions) the first question you are probably asking is: why did i die? or why did i live?

The biggest determining factor in whether you lived or died is your gender. Females were much more likely to survive the sinking of the titanic than men. If you have ever heard the phrase ‘women and children first’ you will have an intuitive understanding of why this is.

The next bigest determinant of survival is what class you travelled in; passengers in first class were more likely to survive. This makes some sense; often times first class passengers receive benefits. One of the benefits on a sinking ship may have been easier access to lifeboats. Third class passengers were below deck and may not have had the means to escape.

Age is another important factor, the younger you are the more likely it is that you survived. It seems young people and children had some level of priority on lifeboats.

Whether you were traveling alone or with others on the titanic helps to deterimine whether you passed away tragically or miraculously survived. This is why titanicdeath.com asks if you are married or have siblings. Travellers who were alone on the titanic or travelling with large numbers of relatives were more likely to have died. Travellers who were alone may have had to wait or may not have had lower priority (compared with women and children) on lifeboats. Many large families were in third class where death rates were high.

I want to try to survive the titanic again.

Let us start by loading the dataset:

import pandas as pd
data_df = pd.read_csv('../titanicdeath/static/train.csv')

The ‘sex’ input has the biggest effect on survival.

You can see below the survival rates for women are 75% but for men are only about 19%.

data_df[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived')
Sex Survived
1 male 0.188908
0 female 0.742038

The ‘class’ input has the next biggest effect on survival.

In our data the class of a passenger has the name ‘Pclass.’ You can see here below that the survival rates ase highest for first class - 63%, second class - 47%, third class - 24%.

data_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363

The age input affects your chances of survival as well.

People from ages 0 - 16 have the highest chance of survival. Everyone older than that has lowered chances of surviving the sinking of the titanic.

data_df['age_group'] = pd.cut(data_df['Age'], 5)
data_df[['age_group', 'Survived']].groupby(['age_group'], as_index=False).mean().sort_values(by='age_group', ascending=True)
age_group Survived
0 (0.34, 16.336] 0.550000
1 (16.336, 32.252] 0.369942
2 (32.252, 48.168] 0.404255
3 (48.168, 64.084] 0.434783
4 (64.084, 80] 0.090909

Whether you are travelling alone or with family can affect your chances of survival.

In our data the number of people you are travelling with is reprsented by the ‘companions’ column. If you travelled with 1,2, or 3 companions your chances of survival were decent (roughly 50-70%). If you travelled alone your chances of survival went down significantly to 30%. If you travelled with more than 5 people your chances were just as bad or worse.

data_df['companions'] = data_df['SibSp'] + data_df['Parch']
data_df[['companions', 'Survived']].groupby(['companions'], as_index=False).mean().sort_values(by='Survived', ascending=False)
companions Survived
3 3 0.724138
2 2 0.578431
1 1 0.552795
6 6 0.333333
0 0 0.303538
4 4 0.200000
5 5 0.136364
7 7 0.000000
8 10 0.000000

In the next section we will load a model (logistic regression) that I have taught how to take an individual passenger’s information and give me the probability of survival.

from sklearn.externals import joblib
logreg = joblib.load('../titanicdeath/static/logreg.pkl')
age = 2
fare = 0
embarkation = 2
title = 1
is_alone = 1
age_class = 6

This next bit of code takes in a passenger’s input values and tells us what probability that user has of living and dying. the number shows our probability of survival. 7.8% here if we are a man (sex=0) and in third class (passenger_class=3).

passenger_class = 3 
sex = 0 

passenger_input = pd.DataFrame([[passenger_class, sex, age, fare, embarkation, title, is_alone, age_class]])
pred = logreg.predict_proba(passenger_input)
pred[0][1]
0.078280156772079126

If we change the sex from male to female (sex=1) we see a huge improvement in our chances of survival (now 43%).

passenger_class = 3 
sex = 1

passenger_input = pd.DataFrame([[passenger_class, sex, age, fare, embarkation, title, is_alone, age_class]])
pred = logreg.predict_proba(passenger_input)
pred[0][1]
0.43427750040904095

If we change the class from third class to first class (passenger_class=1) we improve odds of suvival even further to 77%.

passenger_class = 1 
sex = 1

passenger_input = pd.DataFrame([[passenger_class, sex, age, fare, embarkation, title, is_alone, age_class]])
pred = logreg.predict_proba(passenger_input)
pred[0][1]
0.77444683956853377

You can check out some other stuff I have done here:

http://yvanscher.com/

original notebook code and website code

This was intended to be a simple exploration of this titanic dataset. If you would like to explore this data yourself there is a really nice tutorial here:

https://www.kaggle.com/c/titanic

https://www.kaggle.com/startupsci/titanic/titanic-data-science-solutions

If you are interested in machine learning and a very good overview of how an algorithm like logistic regression works i highly recommend you check out the first 3 lectures of andrew ng’s coursera course:

https://www.coursera.org/learn/machine-learning