Dealing with Data Types in Pandas

3 min readJan 4, 2020

Data Types is one of the first things we should look at after loading the data.
Data type is essentially an internal construct that a programming language uses to understand how to store and manipulate the data. Most of the times Pandas will correctly infer data types, however as per the best practice we should check data types and ensure these are of correct types otherwise we may get unexpected results or errors.

In this article we will take a small dataset of sales to understand the different datatypes of Pandas and we will make datatype conversion where necessary.

Above is the data we have loaded, at first look, data looks reasonably ‘ok’. Lets quickly check datatypes

df.dtypesID                  float64
SalesPerson          object
H1_TotalSales        object
H2_Sales             object
PercentGrowth        object
Total_Units_sold     object
Join_Day              int64
Join_Month            int64
Join_Year             int64
Met_Target           object

Pandas uses ‘object’ dtype for storing strings. However is dtype is also assigned as ‘object’ when data is of mixed type present in column.
By default integer types are int64 and float type are float64.
For date time related there is a datetime64 data type.
Upcasting:
Types can potentially be upcasted when combined with other types, meaning they are promoted from for e.g. int to float

Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values. Examples are gender, social class, rating via Likert scales.

Lets start analyzing each column.
ID : is coming as float64. Since ID is a simple numeric so we can convert it into integer from float.

astype():
The most straight forward way to do data type conversion is using astype()

# to convert ID column from float to intdf['ID'] = df['ID'].astype('int')df['ID']
0    10001
1    10002
2    10003
3    10004
4    10005

As we notice decimal part from ID field is gone now and field is converted to integer type.

Lets apply the same function to convert H1_Sales and H2_Sales from Object to float
# to convert ID column from float to int

df['H1_Sales'].astype('float')We will get 
ValueError: could not convert string to float: '$600'

So, as we see it is not very straight forward for astype() to work always.

Custom Function for data type conversion
We can write out own custom function to do data type conversion.

def amount_str_to_float(amt):
    '''
    Convert amt from string/object type to float
    by removing '$', ','
    '''
    amt_cleaned = amt.replace(',','').replace('$', '')
    return float(amt_cleaned)df['H1_Sales'] = df['H1_Sales'].apply(amount_str_to_float)
df['H2_Sales'] = df['H2_Sales'].apply(amount_str_to_float)df.dtypesID                    int32
SalesPerson          object
H1_Sales            float64
H2_Sales            float64
PercentGrowth        object
Total_Units_sold     object
Join_Day              int64
Join_Month            int64
Join_Year             int64
Met_Target           object
dtype: object

So far we have fixed 3 columns, Lets fix PercentGrowth column.

df['PercentGrowth'] = df['PercentGrowth'].apply(lambda x : float(x.replace('%','')))df.dtypesID                    int32
SalesPerson          object
H1_Sales            float64
H2_Sales            float64
PercentGrowth       float64
Total_Units_sold     object
Join_Day              int64
Join_Month            int64
Join_Year             int64
Met_Target           object
dtype: object

Lets pick next column ‘Total_Units_Sold’
Since this field has one value as ‘Unknown’ , so we have to fix the data itself first, before fixing data type.

# Correcting data valuedf.loc[df['ID']==10005, 'Total_Units_sold'] = 14# Coverting data type to intdf['Total_Units_sold'] = df['Total_Units_sold'].astype('int')df.dtypes
ID                    int32
SalesPerson          object
H1_Sales            float64
H2_Sales            float64
PercentGrowth       float64
Total_Units_sold      int32
Join_Day              int64
Join_Month            int64
Join_Year             int64
Met_Target           object
dtype: object

Lets next take ‘Met_Target’ column, it has values as ’N’ and ‘Y’.
Lets convert it to boolean , first fix data , Y as True and N and False.

df['Met_Target'] = np.where(df['Met_Target']=='Y', True, False)df['Met_Target'] = df['Met_Target'].astype('bool')df.dtypesID                    int32
SalesPerson          object
H1_Sales            float64
H2_Sales            float64
PercentGrowth       float64
Total_Units_sold      int32
Join_Day              int64
Join_Month            int64
Join_Year             int64
Met_Target             bool

Now finally lets deal with ‘Join’ columns, what we will do is combine the columns to create a new column as ‘Join_date’ then convert it into datetime

join_cols = ['Join_Day', 'Join_Month', 'Join_Year']df['Join_date'] = pd.to_datetime(df[join_cols].apply(lambda row : '/'.join(row.values.astype(str)), axis=1))

df.dtypes

ID                           int32
SalesPerson                 object
H1_Sales                   float64
H2_Sales                   float64
PercentGrowth              float64
Total_Units_sold             int32
Join_Day                     int64
Join_Month                   int64
Join_Year                    int64
Met_Target                    bool
Join_date           datetime64[ns]

With this we have converted all data types into correct ones, now we can begin analysis with data.

Dealing with Data Types in Pandas

Written by shekhar pandey

No responses yet