DataFrame is a 2-dimensional labeled data structure with columns of different types. DataFrame is the most widely used data structure.
pandas.DataFrame( data, index, columns, dtype, copy)
Parameters:
- data : ndarray, dict, Series, or DataFrame
- index : Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided
- columns : Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided
- dtype : Datatype
- copy : Copy data from input
Create Pandas DataFrame
A DataFrame can accept many kinds of input:
- List
- Dict
- Series
- 2-D numpy.ndarray
- Another DataFrame
1. From List of List –
import pandas as pd data = [['tea',50],['coffe',70],['sugar',40]] df = pd.DataFrame(data,columns=['item','price']) print (df) Output: item price 0 tea 50 1 coffe 70 2 sugar 40
2. From Dict of the list –
The list of dict must be the same length. If an index is passed, it must clearly also be the same length as the list. If no index is passed, the result will be range(n)
, where n
is the list length.
import pandas as pd
data = {'item' : ['tea','coffe','sugar'],
'price': [50, 70, 40, 45]} # data must be same length
df = pd.DataFrame(data)
Output:
...
ValueError: arrays must all be same length
import pandas as pd data = {'item' : ['tea','coffe','sugar'],'price':[50,70,40]} df1 = pd.DataFrame(data) # index is not passed, so default 0 to n-1 index will get df2 = pd.DataFrame(data,index=['item_1','item_2','item_3']) print ('df1:\n',df1) print ('\ndf2:\n',df2) Output: df1: item price 0 tea 50 1 coffe 70 2 sugar 40 df2: item price item_1 tea 50 item_2 coffe 70 item_3 sugar 40
3. From List of Dicts
>>> import pandas as pd >>> data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] >>> df1 = pd.DataFrame(data) >>> df1 a b c 0 1 2 NaN 1 5 10 20.0 >>> pd.DataFrame(data, index=['first', 'second']) a b c first 1 2 NaN second 5 10 20.0 >>> pd.DataFrame(data, index=['first', 'second'],columns=['a','b']) a b first 1 2 second 5 10 >>> pd.DataFrame(data,columns=['a','d']) a d 0 1 NaN 1 5 NaN
4. From Dict of Series
>>> import pandas as pd >>> d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), ... 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} >>> df1 = pd.DataFrame(d) >>> df1 one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4 >>> pd.DataFrame(d, index=['d', 'b', 'a']) one two d NaN 4 b 2.0 2 a 1.0 1 >>> pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three']) two three d 4 NaN b 2 NaN a 1 NaN
Column selection, addition, deletion
Operation | Syntax | Result |
---|---|---|
Select column | df[col] |
Series |
Select row by label | df.loc[label] |
Series |
Select row by integer location | df.iloc[loc] |
Series |
Slice rows | df[5:10] |
DataFrame |
Select rows by boolean vector | df[bool_vec] |
DataFrame |
In [1]: import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) df Out[1]: one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4
In [2]: df['one'] # Select column by specified column name Out[2]: a 1.0 b 2.0 c 3.0 d NaN Name: one, dtype: float64 # Create a new column by multiplying two columns In [3]: df['three'] = df['one'] * df['two'] # Create a new boolean column from another column In [4]: df['flag'] = df['one'] > 2 In [5]: df Out[5]: one two three flag a 1.0 1.0 1.0 False b 2.0 2.0 4.0 False c 3.0 3.0 9.0 True d NaN 4.0 NaN False
Column Deletion – Columns can be deleted or popped like with a dict.
Pandas’ pop() function used to return and remove column from DataFrame.
In [6]: del df['two'] In [7]: three = df.pop('three') In [8]: df Out[8]: one flag a 1.0 False b 2.0 False c 3.0 True d NaN False
Column Insertion
When inserting a scalar value, it will naturally be propagated to fill the column:
In [9]: df['foo'] = 'bar' # Create a new column 'foo' In [10]: df Out[10]: one flag foo a 1.0 False bar b 2.0 False bar c 3.0 True bar d NaN False bar
Pandas’ insert() function used to insert a column into DataFrame at a specified location. If the column already exists in the DataFrame, Python interpreter raises a ValueError.
In [11]: df.insert(1,'new',[1,2,34,50]) Out[11]: one new flag foo a 1.0 1 False bar b 2.0 2 False bar c 3.0 34 True bar d NaN 50 False bar
Row selection, addition, deletion
In [1]: import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) df Out[1]: one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4
Row Selection
In [2]: df.loc['b'] # select row by index Out[2]: one 2.0 two 2.0 Name: b, dtype: float64 In [3]: df.iloc[2] # select row by location Out[3]: one 3.0 two 3.0 Name: c, dtype: float64 In [4]: df[2:4] # select rows by defining range using slice Out[4]: one two c 3.0 3 d NaN 4
Row Insertion
Pandas’ append() function used to append rows to DataFrame.
In [5]: df Out[5]: one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4
In [6]: temp_df = pd.DataFrame([[4.0,5]],columns=['one','two']) In [7]: temp_df Out[7]: one two 0 4.0 5 In [8]: df = df.append(temp_df) In [9]: df Out[9]: one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4 0 4.0 5
Row Deletion – drop() function is used to delete a row or column of the DataFrame by specifying the axis parameter.
- axis = 0 for removing rows
- axis = 1 for removing columns
In [10]: df.drop('c',axis=0,inplace=True) # set axis = 0 to delete row Out[10]: one two a 1.0 1 b 2.0 2 d NaN 4 0 4.0 5 In [11]: df.drop('two',axis=1,inplace=True) # set axis = 1 to delete column Out[11]: one a 1.0 b 2.0 d NaN 0 4.0
. . .