Converting Pandas Dataframe types
I have a pandas dataFrame created through a mysql call which returns the data as object type. The data is mostly numeric, with some ‘na’ values. How can I cast the type of the dataFrame so the numeric values are appropriately typed (floats) and the ‘na’ values are represented as numpy NaN values?
3 Answers 3
Use the replace method on dataframes:
import numpy as np df = DataFrame(< 'k1': ['na'] * 3 + ['two'] * 4, 'k2': [1, 'na', 2, 'na', 3, 4, 4]>) print df df = df.replace('na', np.nan) print df
I think it’s helpful to point out that df.replace(‘na’, np.nan) by itself won’t work. You must assign it back to the existing dataframe.
df = df.convert_objects(convert_numeric=True) will work in most cases.
I should note that this copies the data. It would be preferable to get it to a numeric type on the initial read. If you post your code and a small example, someone might be able to help you with that.
This doesn’t seem to work e.g. s = pd.Series([1, ‘na’, 3 ,4]); s.convert_objects(convert_numeric=True)
Hmm it works for a DataFrame. I guess they aren’t using the same heuristics to recast? EDIT: I guess the example you gave didn’t work. I was working with something like s = pd.DataFrame([‘1’, ‘na’, ‘3’, ‘4’]). It works for that.
This is what Tom suggested and is correct
In [134]: s = pd.Series(['1','2.','na']) In [135]: s.convert_objects(convert_numeric=True) Out[135]: 0 1 1 2 2 NaN dtype: float64
As Andy points out, this doesn’t work directly (I think that’s a bug), so convert to all string elements first, then convert
In [136]: s2 = pd.Series(['1','2.','na',5]) In [138]: s2.astype(str).convert_objects(convert_numeric=True) Out[138]: 0 1 1 2 2 NaN 3 5 dtype: float64
pandas.DataFrame.convert_dtypes#
Convert columns to the best possible dtypes using dtypes supporting pd.NA .
Parameters infer_objects bool, default True
Whether object dtypes should be converted to the best possible types.
convert_string bool, default True
Whether object dtypes should be converted to StringDtype() .
convert_integer bool, default True
Whether, if possible, conversion can be done to integer extension types.
convert_boolean bool, defaults True
Whether object dtypes should be converted to BooleanDtypes() .
convert_floating bool, defaults True
Whether, if possible, conversion can be done to floating extension types. If convert_integer is also True, preference will be give to integer dtypes if the floats can be faithfully casted to integers.
Which dtype_backend to use, e.g. whether a DataFrame should use nullable dtypes for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential.
Copy of input object with new dtype.
Convert argument to datetime.
Convert argument to timedelta.
Convert argument to a numeric type.
By default, convert_dtypes will attempt to convert a Series (or each Series in a DataFrame) to dtypes that support pd.NA . By using the options convert_string , convert_integer , convert_boolean and convert_floating , it is possible to turn off individual conversions to StringDtype , the integer extension types, BooleanDtype or floating extension types, respectively.
For object-dtyped columns, if infer_objects is True , use the inference rules as during normal Series/DataFrame construction. Then, if possible, convert to StringDtype , BooleanDtype or an appropriate integer or floating extension type, otherwise leave as object .
If the dtype is integer, convert to an appropriate integer extension type.
If the dtype is numeric, and consists of all integers, convert to an appropriate integer extension type. Otherwise, convert to an appropriate floating extension type.
Changed in version 1.2: Starting with pandas 1.2, this method also converts float columns to the nullable floating extension type.
In the future, as new dtypes are added that support pd.NA , the results of this method will change to support those new dtypes.
>>> df = pd.DataFrame( . . "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")), . "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")), . "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")), . "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")), . "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")), . "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")), . > . )
Start with a DataFrame with default dtypes.
>>> df a b c d e f 0 1 x True h 10.0 NaN 1 2 y False i NaN 100.5 2 3 z NaN NaN 20.0 200.0
>>> df.dtypes a int32 b object c object d object e float64 f float64 dtype: object
Convert the DataFrame to use best possible dtypes.
>>> dfn = df.convert_dtypes() >>> dfn a b c d e f 0 1 x True h 10 1 2 y False i 100.5 2 3 z 20 200.0
>>> dfn.dtypes a Int32 b string[python] c boolean d string[python] e Int64 f Float64 dtype: object
Start with a Series of strings and missing data represented by np.nan .
>>> s = pd.Series(["a", "b", np.nan]) >>> s 0 a 1 b 2 NaN dtype: object
Obtain a Series with dtype StringDtype .
>>> s.convert_dtypes() 0 a 1 b 2 dtype: string
pandas.DataFrame.astype#
Use a str, numpy.dtype, pandas.ExtensionDtype or Python type to cast entire pandas object to the same type. Alternatively, use a mapping, e.g. , where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
copy bool, default True
Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).
errors , default ‘raise’
Control raising of exceptions on invalid data for provided dtype.
- raise : allow exceptions to be raised
- ignore : suppress exceptions. On error return original object.
Convert argument to datetime.
Convert argument to timedelta.
Convert argument to a numeric type.
Cast a numpy array to a specified type.
Changed in version 2.0.0: Using astype to convert from timezone-naive dtype to timezone-aware dtype will raise an exception. Use Series.dt.tz_localize() instead.
>>> d = 'col1': [1, 2], 'col2': [3, 4]> >>> df = pd.DataFrame(data=d) >>> df.dtypes col1 int64 col2 int64 dtype: object
Cast all columns to int32:
>>> df.astype('int32').dtypes col1 int32 col2 int32 dtype: object
Cast col1 to int32 using a dictionary:
>>> df.astype('col1': 'int32'>).dtypes col1 int32 col2 int64 dtype: object
>>> ser = pd.Series([1, 2], dtype='int32') >>> ser 0 1 1 2 dtype: int32 >>> ser.astype('int64') 0 1 1 2 dtype: int64
Convert to categorical type:
>>> ser.astype('category') 0 1 1 2 dtype: category Categories (2, int32): [1, 2]
Convert to ordered categorical type with custom ordering:
>>> from pandas.api.types import CategoricalDtype >>> cat_dtype = CategoricalDtype( . categories=[2, 1], ordered=True) >>> ser.astype(cat_dtype) 0 1 1 2 dtype: category Categories (2, int64): [2 < 1]
>>> ser_date = pd.Series(pd.date_range('20200101', periods=3)) >>> ser_date 0 2020-01-01 1 2020-01-02 2 2020-01-03 dtype: datetime64[ns]
Converting a column within pandas dataframe from int to string
I have a dataframe in pandas with mixed int and str data columns. I want to concatenate first the columns within the dataframe. To do that I have to convert an int column to str . I’ve tried to do as follows:
mtrx['X.3'] = mtrx.to_string(columns = ['X.3'])
but in both cases it’s not working and I’m getting an error saying «cannot concatenate ‘str’ and ‘int’ objects». Concatenating two str columns is working perfectly fine.
7 Answers 7
In [16]: df = DataFrame(np.arange(10).reshape(5,2),columns=list('AB')) In [17]: df Out[17]: A B 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 In [18]: df.dtypes Out[18]: A int64 B int64 dtype: object
In [19]: df['A'].apply(str) Out[19]: 0 0 1 2 2 4 3 6 4 8 Name: A, dtype: object In [20]: df['A'].apply(str)[0] Out[20]: '0'
Don’t forget to assign the result back:
In [21]: df.applymap(str) Out[21]: A B 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 In [22]: df.applymap(str).iloc[0,0] Out[22]: '0'
I really don’t understand why, but mtrx[‘X.3’].apply(str) does not work for me either 🙁 dtype still shows int64. The dataframe for 23177 row and X.3 column got only numbers. In [21]: mtrx[‘X.3’].dtype Out[21]: dtype(‘int64’)
df[‘A’].apply(str) is not working. but df.column_name = df.column_name.astype(str) works. No idea why.
@DmitryKonovalov in python strings are immutable, so whenever you manipulating the data, you have to put the result back in to the variable.
Change data type of DataFrame column:
This is appealing, but it is about 4x slower than apply(str) from @Jeff, in my test using pd.Series(np.arange(1000000)) .
This works for me. df[‘A’] = df[‘A’].apply(str) also works. The answer provided by @Jeff does not work for me.
Regarding @JohnZwinck’s comment, using Python3 it seems to be more like 2x as fast to use apply() instead of astype() : timeit.Timer(‘c.apply(str)’, setup=’import pandas as pd; c = pd.Series(range(1000))’).timeit(1000) >>> 0.41499893204309046 >>> timeit.Timer(‘c.astype(str)’, setup=’import pandas as pd; c = pd.Series(range(1000))’).timeit(1000) 0.8004439630312845
Warning: Both solutions given ( astype() and apply() ) do not preserve NULL values in either the nan or the None form.
import pandas as pd import numpy as np df = pd.DataFrame([None,'string',np.nan,42], index=[0,1,2,3], columns=['A']) df1 = df['A'].astype(str) df2 = df['A'].apply(str) print df.isnull() print df1.isnull() print df2.isnull()
I believe this is fixed by the implementation of to_string()
df.column_name = df.column_name.astype('str')
There are four ways to convert columns to string
1. astype(str) df['column_name'] = df['column_name'].astype(str) 2. values.astype(str) df['column_name'] = df['column_name'].values.astype(str) 3. map(str) df['column_name'] = df['column_name'].map(str) 4. apply(str) df['column_name'] = df['column_name'].apply(str)
Lets see the performance of each type
#importing libraries import numpy as np import pandas as pd import time #creating four sample dataframes using dummy data df1 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A']) df2 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A']) df3 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A']) df4 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A']) #applying astype(str) time1 = time.time() df1['A'] = df1['A'].astype(str) print('time taken for astype(str) : ' + str(time.time()-time1) + ' seconds') #applying values.astype(str) time2 = time.time() df2['A'] = df2['A'].values.astype(str) print('time taken for values.astype(str) : ' + str(time.time()-time2) + ' seconds') #applying map(str) time3 = time.time() df3['A'] = df3['A'].map(str) print('time taken for map(str) : ' + str(time.time()-time3) + ' seconds') #applying apply(str) time4 = time.time() df4['A'] = df4['A'].apply(str) print('time taken for apply(str) : ' + str(time.time()-time4) + ' seconds')
time taken for astype(str): 5.472359895706177 seconds time taken for values.astype(str): 6.5844292640686035 seconds time taken for map(str): 2.3686647415161133 seconds time taken for apply(str): 2.39758563041687 seconds
If you run multiple times, time for each technique might vary. On average map(str) and apply(str) are takes less time compare with remaining two techniques