Data Analysis from Scratch

In this post, we will study the data about atheletes and try to analyze the results.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

Reading the data set

athlete = pd.read_csv('/content/drive/MyDrive/ShapeAI DST 11021 Oct-Jan Batch 2021-22/Datasets/athlete_events.csv')

Creating a copy of the DataFrame

ath = athlete.copy()

Data Exploration

It will give the first and last 4 results respectively.

ath.head()
ath.tail()

It will return the number of rows and columns.

ath.shape

This will return the information about data types (float, int, object), non-null values.

ath.info()

This will return the total null values in each columns.

ath.isna().sum()

This will return ratio of non-null and null values for the provided column.

x=ath["Medal"].isna().sum()
y=ath["Medal"].notnull().sum()
print(x,":",y)

This will return all the column names

ath.columns

Getting max or min of a column

ath.Age.min()
ath.Age.max()

This will give data about min, max, 25th percentile, 75 percentile etc.

ath.Age.describe()

Fill all the NaN values with the mean value of the Age column

ath["Age"].fillna(ath.Age.mean(), inplace = True)

Changing the age data type from float to int

ath.Age = ath.Age.astype('int')

This will return those row data that has region column empty or NaN value.

ath[ath['Region'].isna()]

This will return unique values in a columns.

ath.Medal.unique()

This will return number of uniques.

ath.Medal.nunique()

This will count number of values of each types in a column.

ath.Medal.value_counts()

Replacing categories with integer numbers.

ath.Medal.replace([np.nan, 'Gold', 'Silver', 'Bronze'],[0, 1, 2, 3], inplace=True)

Printing al the values of a particular column against another column.

for i in ath[ath['Region'].isna()]['Team']:
  print(i)

Dropping unwanted or irrelevant columns.

ath.drop(['ID', 'Region', 'Games'], axis=1, inplace=True)

Analyzing just particular columns.

ath[['Sport', 'Event']]

This will return number of duplicate rows.

ath.duplicated().sum()

This will return duplicate datasets.

ath[ath.duplicated()]

This will locate the given data in dataframe.

ath.loc[ath['Name']=='Dsir Antoine Acket']

This will drop duplicates.

ath.drop_duplicates(inplace=True)

When inplace=True is used, the operation is performed directly on the DataFrame ath, modifying it in place. This means that the original DataFrame is updated and no new DataFrame is returned. Without inplace=True, the operation would return a new DataFrame and the original would remain unchanged.

Creating new index.

ath.reset_index(inplace=True)

This is to remove by default present index column to drop.

ath.drop('index', axis=1, inplace=True)

drop('index', axis=1): This removes the column named 'index'. The axis=1 specifies that you’re dropping a column (not a row; for rows, you would use axis=0).

Data Exportation

# Export to JSON
ath.to_json('olympics_dataset.json')

# Export to Excel
ath.to_excel('olympics_dataset.xlsx')

# Export to JSON
ath.to_csv('olympics_dataset.csv')

Data Analysis

  1. Show the relation ship between height and weight in graph.
plt.scatter(ath.Height, ath.Weight)
plt.xlabel('Height')
plt.ylabel('Weight')
plt.title('Height Vs Weight')

    2. Name all the athletes who has participated in the game ‘Judo’

    ath[['Name','Sport']][ath.Sport=='Judo']