Dealing with Data Types in Pandas
Data Types is one of the first things we should look at after loading the data.
Data type is essentially an internal construct that a programming language uses to understand how to store and manipulate the data. Most of the times Pandas will correctly infer data types, however as per the best practice we should check data types and ensure these are of correct types otherwise we may get unexpected results or errors.
In this article we will take a small dataset of sales to understand the different datatypes of Pandas and we will make datatype conversion where necessary.
Above is the data we have loaded, at first look, data looks reasonably ‘ok’. Lets quickly check datatypes
df.dtypesID float64
SalesPerson object
H1_TotalSales object
H2_Sales object
PercentGrowth object
Total_Units_sold object
Join_Day int64
Join_Month int64
Join_Year int64
Met_Target object
Pandas uses ‘object’ dtype for storing strings. However is dtype is also assigned as ‘object’ when data is of mixed type present in column.
By default integer types are int64 and float type are float64.
For date time related there is a datetime64 data type.
Upcasting:
Types can potentially be upcasted when combined with other types, meaning they are promoted from for e.g. int
to float
Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values. Examples are gender, social class, rating via Likert scales.
Lets start analyzing each column.
ID : is coming as float64. Since ID is a simple numeric so we can convert it into integer from float.
astype():
The most straight forward way to do data type conversion is using astype()
# to convert ID column from float to intdf['ID'] = df['ID'].astype('int')df['ID']
0 10001
1 10002
2 10003
3 10004
4 10005
As we notice decimal part from ID field is gone now and field is converted to integer type.
Lets apply the same function to convert H1_Sales and H2_Sales from Object to float
# to convert ID column from float to int
df['H1_Sales'].astype('float')We will get
ValueError: could not convert string to float: '$600'
So, as we see it is not very straight forward for astype() to work always.
Custom Function for data type conversion
We can write out own custom function to do data type conversion.
def amount_str_to_float(amt):
'''
Convert amt from string/object type to float
by removing '$', ','
'''
amt_cleaned = amt.replace(',','').replace('$', '')
return float(amt_cleaned)df['H1_Sales'] = df['H1_Sales'].apply(amount_str_to_float)
df['H2_Sales'] = df['H2_Sales'].apply(amount_str_to_float)df.dtypesID int32
SalesPerson object
H1_Sales float64
H2_Sales float64
PercentGrowth object
Total_Units_sold object
Join_Day int64
Join_Month int64
Join_Year int64
Met_Target object
dtype: object
So far we have fixed 3 columns, Lets fix PercentGrowth column.
df['PercentGrowth'] = df['PercentGrowth'].apply(lambda x : float(x.replace('%','')))df.dtypesID int32
SalesPerson object
H1_Sales float64
H2_Sales float64
PercentGrowth float64
Total_Units_sold object
Join_Day int64
Join_Month int64
Join_Year int64
Met_Target object
dtype: object
Lets pick next column ‘Total_Units_Sold’
Since this field has one value as ‘Unknown’ , so we have to fix the data itself first, before fixing data type.
# Correcting data valuedf.loc[df['ID']==10005, 'Total_Units_sold'] = 14# Coverting data type to intdf['Total_Units_sold'] = df['Total_Units_sold'].astype('int')df.dtypes
ID int32
SalesPerson object
H1_Sales float64
H2_Sales float64
PercentGrowth float64
Total_Units_sold int32
Join_Day int64
Join_Month int64
Join_Year int64
Met_Target object
dtype: object
Lets next take ‘Met_Target’ column, it has values as ’N’ and ‘Y’.
Lets convert it to boolean , first fix data , Y as True and N and False.
df['Met_Target'] = np.where(df['Met_Target']=='Y', True, False)df['Met_Target'] = df['Met_Target'].astype('bool')df.dtypesID int32
SalesPerson object
H1_Sales float64
H2_Sales float64
PercentGrowth float64
Total_Units_sold int32
Join_Day int64
Join_Month int64
Join_Year int64
Met_Target bool
Now finally lets deal with ‘Join’ columns, what we will do is combine the columns to create a new column as ‘Join_date’ then convert it into datetime
join_cols = ['Join_Day', 'Join_Month', 'Join_Year']df['Join_date'] = pd.to_datetime(df[join_cols].apply(lambda row : '/'.join(row.values.astype(str)), axis=1))
df.dtypes
ID int32
SalesPerson object
H1_Sales float64
H2_Sales float64
PercentGrowth float64
Total_Units_sold int32
Join_Day int64
Join_Month int64
Join_Year int64
Met_Target bool
Join_date datetime64[ns]
With this we have converted all data types into correct ones, now we can begin analysis with data.