Exploratory Data Analysis On Titanic Dataset

Suryansh Shrivastava
5 min readJun 20, 2021

--

Titanic Dataset Contains info of few passengers who were aboard the RMS Titanic During its accident on 14 April 1912 in the North Atlantic Ocean, which led to the death of nearly 1,500 passengers. The Dataset even describes the survival status of individual passengers on the Titanic.

Getting started

1. Importing Necessary Modules/Libraries

It's a good practice to keep all the import statements together and at the start of the notebook or the program.

2. Importing The Dataset

There are multiple ways how a dataset can be imported to our Jupyter Notebook, for example, Downloading the dataset in CSV format from the internet and reading it with the help of Pandas (A library in Python for data manipulation and analysis), using its method “read_csv” to read a file in CSV format.

3. Exploring The Dataset

We can see that our dataset has 891 Rows and 12 Columns.

Details of Columns :-

PassengerId - Id of each passenger.

Survived - Survival (0 = No, 1 = Yes)

Pclass - Ticket class 1 = 1st, 2 = 2nd, 3 = 3

Name - Full name and title of each passenger.

Sex - Sex (male or female)

Age - Age in years

SibSp -# of siblings / spouses aboard the Titanic

Parch - # of parents / children aboard the Titanic

Ticket -Ticket number

Fare - Passenger fare

Cabin - Cabin number

Embarked — Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

22.9% of the rows in the data set contain an entry for Cabin , Age contains a fair amount of missing data and Embarked has 2 missing entries.

Analysing the Data

We will now look at how the features contribute towards survival and make some assumptions based off of the knowledge of the problem. I’m going to go over each column and share my thoughts on what we know so far.

PassengerId — It will not play a part in who survives, so can be excluded.

Survived — This is our target, so we want to know how other features affect this outcome.

Pclass — A categorical variable that is likely to play a part in who survives.

Name — Likely won’t play a part in who survives, however the title may do.

Sex — A categorical variable that is likely to play a part in who survives.

Age — A numerical variable that is likely to play a part in who survives. We should fill in the blank rows

SibSp — A numerical variable that could play a small part in who survives.

Parch — A numerical variable that could play a small part in who survives.

Ticket — A feature that is unlikely to play a part in who survives, but may have some information in it.

Fare — A numerical variable that could play a part in who survives.

Cabin — Has a lot of missing values, but might be able to extract some information.

Embarked — A categorical feature that could play a small part in who survives. Should fill in the blank rows.

It is likely that Sex, Age and Pclass will be the main features that contribute towards survival, from our background intuition. Lets start there.

Target outcomes are not skewed, therefore no specific evaluation metrics are needed. Survival rate of 38.4% on average.

No clearly (multi)collinear features within data.

Numerical features should be normalised if not banded as they are significantly skewed.

A woman is almost 4 times as likely to survive as a man, Clear relationship between sex and survival.

Overwhelming majority of women in Classes 1 and 2 survived. Half did in class 3. Across all 3 classes, men survived less of the time, especially in the class 3 where the majority did not survive.

Not perfectly clear as to what is going on here. Can infer that generally very elderly people did not survive and younger passengers did.

Typically those who paid a higher Fare for their ticket were also in a higher class. So their survivability was greater than the others.

Using the above informations , it becomes clear that some of the features are very important for training the model and to predict the outcome , i.e. to predict whether the person survived or not. There also exists some features that are having very less or negligible influence on the outcome such as passenger ID and can be dropped in our case.

Conclusion

We can conclude that Data Analysis is an important step in the ML/DL Domain, it helps in understanding the features and their importance towards the outcome that we want to predict via our model.

--

--

Suryansh Shrivastava

A tech enthusiast pursuing B.Tech from BMSIT, Bangalore. A keen learner of new technologies and exploring use cases to create an impact in the society.