EDA : Post 01 Univariate Analysis

4 min readNov 22, 2021

Exploratory Data Analysis

performing initial investigations on data
to discover patterns
to spot anomalies
to test hypothesis, and
to check assumptions

Univariate Analysis

“Uni” means one and “variate” means variable, so univariate analysis is analysis of 1 variable at a time.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline# Read data
wine_data = pd.read_csv(r'https://raw.githubusercontent.com/shekhar270779/Learn_ML/main/datasets/winequality-red.csv')
wine_data.sample(5)

df = wine_data.copy()df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB# replace 'space' in column names with 'underscore'
df.columns = df.columns.str.replace(" ","_")df.columnsIndex(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

Measures of Central Tendency

Mean, Median, Mode

# Lets take fixed_acidity column and do its univariate analysis

# Lets calculate Arithmetic Mean or simply called as mean 
print(f"Mean value of fixed_acidity: {np.mean(df.fixed_acidity):.2f}")

# Lets calculate median
print(f"Median value of fixed_acidity: {np.median(df.fixed_acidity):.2f}")Mean value of fixed_acidity: 8.32
Median value of fixed_acidity: 7.90# Mode is most occuring value
# lets find mode of quality

print(df['quality'].value_counts());5    681
6    638
7    199
4     53
8     18
3     10
Name: quality, dtype: int64

Since quality 5 appears maximum time, so mode is 5 for Quality

from scipy import statsstats.mode(df.quality)ModeResult(mode=array([5], dtype=int64), count=array([681]))

Geometric mean

Geometric mean is useful when we have to find mean of different rating systems.
Support product A has got ratings as 4.5/5 and 68/100 as per two ratings systems
Similarly product B has got ratings as 3/5 and 75/100 .
Now we need to compare two products and find which has better ratings.gm_A = stats.gmean([4.5,68])
gm_B = stats.gmean([3, 75])

print(f"As per Geometric Mean,\nRating of A: {gm_A:.2f}\nRating of B: {gm_B:.2f}")As per Geometric Mean,
Rating of A: 17.49
Rating of B: 15.00

Harmonic mean

We use Harmonic mean when we are dealing with rates.

Suppose we are travelling from point A to point B (10Km distance) at speed of 5km/h and returning from B to A at speed of 30km/h. What is avg. speed?

print(f"Harmonic Mean: {stats.hmean((5, 30)):.2f}km/h")Harmonic Mean: 8.57km/h

Measures of Dispersion

Dispersion represents spread of the data

Range: Max — Min

# Fixed_acidity column

print(f"Max: {df.fixed_acidity.max():.2f}\nMin: {df.fixed_acidity.min():.2f}\nRange: {df.fixed_acidity.max() - df.fixed_acidity.min():.2f}")Max: 15.90
Min: 4.60
Range: 11.30

Inter Quartile Range (IQR)

# Q1: 25th percentile , Q2: 50th percentile (median), Q3: 75th percentile
# IQR = Q3 - Q1 , it represents main area where data is located

print(f"IQR of fixed_acidity: {stats.iqr(df.fixed_acidity):.2f}")IQR of fixed_acidity: 2.10# we can also find 25th and 75th percentile to calculate IQR 
q1 = np.percentile(df.fixed_acidity, 25, interpolation='midpoint')
q3 = np.percentile(df.fixed_acidity, 75, interpolation='midpoint')
round(q3 - q1,2)2.1

Variance

It represents how far data points lie away from the mean.

# method 1
print(f"Variance: {stats.tvar(df.fixed_acidity):.2f}");Variance: 3.03# method 2
print(f"Variance: {df.fixed_acidity.var():.2f}");Variance: 3.03# method 3
print(f"Variance: {np.var(df.fixed_acidity):.2f}");Variance: 3.03

Standard Deviation

It is root mean square deviation

# method 1
print(f"Standard deviation: {stats.tstd(df.fixed_acidity):.2f}");Standard deviation: 1.74# method 2

print(f"Standard deviation: {df.fixed_acidity.std():.2f}");Standard deviation: 1.74# method 3

print(f"Standard deviation: {np.std(df.fixed_acidity):.2f}");Standard deviation: 1.74