Python Pandas – DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of different types. DataFrame is the most widely used data structure.

pandas.DataFrame( data, index, columns, dtype, copy)

Parameters:

  • data : ndarray, dict, Series, or DataFrame
  • index : Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided
  • columns : Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided
  • dtype : Datatype
  • copy : Copy data from input

Create Pandas DataFrame

A DataFrame can accept many kinds of input:

  • List
  • Dict
  • Series
  • 2-D numpy.ndarray
  • Another DataFrame

 

1. From List of List –

import pandas as pd
data = [['tea',50],['coffe',70],['sugar',40]]
df = pd.DataFrame(data,columns=['item','price'])
print (df)
 
Output:
 
    item  price
0    tea     50
1  coffe     70
2  sugar     40

2. From Dict of the list –

The list of dict must be the same length. If an index is passed, it must clearly also be the same length as the list. If no index is passed, the result will be range(n), where n is the list length.

import pandas as pd
data = {'item' : ['tea','coffe','sugar'],    
        'price': [50, 70, 40, 45]}           # data must be same length
df = pd.DataFrame(data)
 
Output:
...
ValueError: arrays must all be same length
import pandas as pd
data = {'item' : ['tea','coffe','sugar'],'price':[50,70,40]}
 
df1 = pd.DataFrame(data)     # index is not passed, so default 0 to n-1 index will get
df2 = pd.DataFrame(data,index=['item_1','item_2','item_3'])
 
print ('df1:\n',df1)
print ('\ndf2:\n',df2)
 
Output:
 
df1:
     item  price
0    tea     50
1  coffe     70
2  sugar     40

df2:
          item  price
item_1    tea     50
item_2  coffe     70
item_3  sugar     40

3. From List of Dicts

>>> import pandas as pd
>>> data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
>>> df1 = pd.DataFrame(data) 
>>> df1
   a   b     c
0  1   2   NaN
1  5  10  20.0

>>> pd.DataFrame(data, index=['first', 'second'])
        a   b     c
first   1   2   NaN
second  5  10  20.0

>>> pd.DataFrame(data, index=['first', 'second'],columns=['a','b'])
        a   b
first   1   2
second  5  10

>>> pd.DataFrame(data,columns=['a','d'])
   a   d
0  1 NaN
1  5 NaN

4. From Dict of Series

>>> import pandas as pd
>>> d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
...      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
>>> df1 = pd.DataFrame(d) 
>>> df1
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4

>>> pd.DataFrame(d, index=['d', 'b', 'a'])
   one  two
d  NaN    4
b  2.0    2
a  1.0    1

>>> pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
   two three
d    4   NaN
b    2   NaN
a    1   NaN

Column selection, addition, deletion

Operation Syntax Result
Select column df[col] Series
Select row by label df.loc[label] Series
Select row by integer location df.iloc[loc] Series
Slice rows df[5:10] DataFrame
Select rows by boolean vector df[bool_vec] DataFrame
In [1]:
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d) 
df
Out[1]:
    one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
In [2]: df['one']                  # Select column by specified column name
Out[2]: 
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

# Create a new column by multiplying two columns
In [3]: df['three'] = df['one'] * df['two']  

# Create a new boolean column from another column
In [4]: df['flag'] = df['one'] > 2

In [5]: df
Out[5]: 
   one  two  three   flag
a  1.0  1.0    1.0  False
b  2.0  2.0    4.0  False
c  3.0  3.0    9.0   True
d  NaN  4.0    NaN  False

Column Deletion – Columns can be deleted or popped like with a dict.

Pandas’ pop() function used to return and remove column from DataFrame.

In [6]: del df['two']

In [7]: three = df.pop('three')

In [8]: df
Out[8]: 
   one   flag
a  1.0  False
b  2.0  False
c  3.0   True
d  NaN  False

Column Insertion

When inserting a scalar value, it will naturally be propagated to fill the column:

In [9]: df['foo'] = 'bar'       # Create a new column 'foo'

In [10]: df
Out[10]: 
   one   flag  foo
a  1.0  False  bar
b  2.0  False  bar
c  3.0   True  bar
d  NaN  False  bar

Pandas’ insert() function used to insert a column into DataFrame at a specified location. If the column already exists in the DataFrame, Python interpreter raises a ValueError.

In [11]: df.insert(1,'new',[1,2,34,50])
Out[11]: 
   one  new   flag  foo
a  1.0    1  False  bar
b  2.0    2  False  bar
c  3.0   34   True  bar
d  NaN   50  False  bar

Row selection, addition, deletion

In [1]:
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d) 
df
Out[1]:
    one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4

Row Selection

In [2]: df.loc['b']         # select row by index
Out[2]: 
one    2.0
two    2.0
Name: b, dtype: float64

In [3]: df.iloc[2]          # select row by location
Out[3]:
one    3.0
two    3.0
Name: c, dtype: float64

In [4]: df[2:4]             # select rows by defining range using slice
Out[4]: 
   one  two
c  3.0    3
d  NaN    4

Row Insertion

Pandas’ append() function used to append rows to DataFrame.

In [5]: df
Out[5]:
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
In [6]: temp_df = pd.DataFrame([[4.0,5]],columns=['one','two'])
In [7]: temp_df
Out[7]:
   one  two
0  4.0    5

In [8]: df = df.append(temp_df) 
In [9]: df
Out[9]:
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
0  4.0    5

Row Deletion – drop() function is used to delete a row or column of the DataFrame by specifying the axis parameter.

  • axis = 0 for removing rows
  • axis = 1 for removing columns
In [10]: df.drop('c',axis=0,inplace=True)   # set axis = 0 to delete row
Out[10]: 
   one  two
a  1.0    1
b  2.0    2
d  NaN    4
0  4.0    5

In [11]: df.drop('two',axis=1,inplace=True) # set axis = 1 to delete column 
Out[11]:  
   one
a  1.0
b  2.0
d  NaN
0  4.0

.     .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Python Pandas Tutorials