10minutes to Pandas

2022-03-24

2022-10-05

Import Library

import numpy as np
print("numpy ver." + np.__version__)
import pandas as pd
print("pandas ver." + pd.__version__)

numpy ver.1.21.5
pandas ver.1.3.5

Object Creation

Creating Series

by passing a list of values, letting pandas create a default integer index

1 2	s = pd.Series([1,3,5,np.nan, 6,8]) print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creation DataFrame

by passing a NumPy array, with a datetime index and labeled columns

dates = pd.date_range("20130101", periods=6)
print(dates)
print()
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list("ABCD"))
print(df)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

                   A         B         C         D
2013-01-01 -0.818896 -0.409184  0.087447 -0.935887
2013-01-02  0.681543 -1.390976  2.013105  0.644468
2013-01-03  1.017911  0.033224 -0.103912  0.634459
2013-01-04 -0.450437  0.501915  1.003776  0.691249
2013-01-05  1.633764  0.324234 -1.707570  1.163615
2013-01-06  0.282402 -0.922663 -1.641314  0.505433

by passing a dictionary of objects that can be converted into a series-like structure

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
print(df2)
print()
print("dtypes:")
print(df2.dtypes)

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

dtypes:
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

1 2	# IPython # df2.<TAB>

Viewing Data

Basic Structure

1
2
3

print("Head 5:\n", df.head())
print()
print("Tail 5:\n", df.tail())

Head 5:
                    A         B         C         D
2013-01-01 -0.818896 -0.409184  0.087447 -0.935887
2013-01-02  0.681543 -1.390976  2.013105  0.644468
2013-01-03  1.017911  0.033224 -0.103912  0.634459
2013-01-04 -0.450437  0.501915  1.003776  0.691249
2013-01-05  1.633764  0.324234 -1.707570  1.163615

Tail 5:
                    A         B         C         D
2013-01-02  0.681543 -1.390976  2.013105  0.644468
2013-01-03  1.017911  0.033224 -0.103912  0.634459
2013-01-04 -0.450437  0.501915  1.003776  0.691249
2013-01-05  1.633764  0.324234 -1.707570  1.163615
2013-01-06  0.282402 -0.922663 -1.641314  0.505433

1 2	print("Index :\n", df.index) print("Columns:\n", df.columns)

Index :
 DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
Columns:
 Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame to NumPy array

This can be an expensive operation when your DataFrame has columns with different data types:
- NumPy arrays have one dtype for the entire array.
- Pandas DataFrames have one dtype per column.
With heterogeneous data, the lowest common type will have to be used.
For a mix of numeric and non-numeric types, the output array will have object dtype.

df.values

array([[-0.81889557, -0.40918379,  0.08744721, -0.93588677],
       [ 0.68154262, -1.39097644,  2.01310478,  0.64446846],
       [ 1.01791104,  0.03322364, -0.1039122 ,  0.63445887],
       [-0.45043711,  0.50191541,  1.00377617,  0.69124864],
       [ 1.633764  ,  0.32423439, -1.70757042,  1.16361464],
       [ 0.28240247, -0.92266305, -1.64131359,  0.50543323]])

1
2
3

# For df, our DataFrame of all floating-point values,
# DataFrame.to_numpy() is fast and doesn't require copying data:
df.to_numpy()

array([[-0.81889557, -0.40918379,  0.08744721, -0.93588677],
       [ 0.68154262, -1.39097644,  2.01310478,  0.64446846],
       [ 1.01791104,  0.03322364, -0.1039122 ,  0.63445887],
       [-0.45043711,  0.50191541,  1.00377617,  0.69124864],
       [ 1.633764  ,  0.32423439, -1.70757042,  1.16361464],
       [ 0.28240247, -0.92266305, -1.64131359,  0.50543323]])

1	df2.values

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

1
2
3

# For df2, the DataFrame with multiple dtypes,
# DataFrame.to_numpy() is relatively expensive:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

Describe, Transpose, Sort

1	print(df.describe())

              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.391048 -0.310575 -0.058078  0.450556
std    0.917121  0.739319  1.460690  0.715967
min   -0.818896 -1.390976 -1.707570 -0.935887
25%   -0.267227 -0.794293 -1.256963  0.537690
50%    0.481973 -0.187980 -0.008232  0.639464
75%    0.933819  0.251482  0.774694  0.679554
max    1.633764  0.501915  2.013105  1.163615

1	print(df.T)

   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A   -0.818896    0.681543    1.017911   -0.450437    1.633764    0.282402
B   -0.409184   -1.390976    0.033224    0.501915    0.324234   -0.922663
C    0.087447    2.013105   -0.103912    1.003776   -1.707570   -1.641314
D   -0.935887    0.644468    0.634459    0.691249    1.163615    0.505433

print("Sorted by axis 1:")
print(df.sort_index(axis=1, ascending=False))
print()
print("Sorted by values B:")
print(df.sort_values(by="B"))

Sorted by axis 1:
                   D         C         B         A
2013-01-01 -0.935887  0.087447 -0.409184 -0.818896
2013-01-02  0.644468  2.013105 -1.390976  0.681543
2013-01-03  0.634459 -0.103912  0.033224  1.017911
2013-01-04  0.691249  1.003776  0.501915 -0.450437
2013-01-05  1.163615 -1.707570  0.324234  1.633764
2013-01-06  0.505433 -1.641314 -0.922663  0.282402

Sorted by values B:
                   A         B         C         D
2013-01-02  0.681543 -1.390976  2.013105  0.644468
2013-01-06  0.282402 -0.922663 -1.641314  0.505433
2013-01-01 -0.818896 -0.409184  0.087447 -0.935887
2013-01-03  1.017911  0.033224 -0.103912  0.634459
2013-01-05  1.633764  0.324234 -1.707570  1.163615
2013-01-04 -0.450437  0.501915  1.003776  0.691249

Selecting Data

Getting

1 2	print(df["A"]) # print(df.A)

2013-01-01   -0.818896
2013-01-02    0.681543
2013-01-03    1.017911
2013-01-04   -0.450437
2013-01-05    1.633764
2013-01-06    0.282402
Freq: D, Name: A, dtype: float64

1 2	print(df[:3]) print(df["20130102":"20130104"])

                   A         B         C         D
2013-01-01 -0.818896 -0.409184  0.087447 -0.935887
2013-01-02  0.681543 -1.390976  2.013105  0.644468
2013-01-03  1.017911  0.033224 -0.103912  0.634459
                   A         B         C         D
2013-01-02  0.681543 -1.390976  2.013105  0.644468
2013-01-03  1.017911  0.033224 -0.103912  0.634459
2013-01-04 -0.450437  0.501915  1.003776  0.691249

Selection by Label

1	print(df.loc[dates[0]])

A   -0.818896
B   -0.409184
C    0.087447
D   -0.935887
Name: 2013-01-01 00:00:00, dtype: float64

1	print(df.loc["20130102":"20130104", ["A","B"]])

                   A         B
2013-01-02  0.681543 -1.390976
2013-01-03  1.017911  0.033224
2013-01-04 -0.450437  0.501915

1	print(df.loc["20130102", ["A","B"]])

A    0.681543
B   -1.390976
Name: 2013-01-02 00:00:00, dtype: float64

1 2	print(df.loc[dates[0], "A"]) print(df.at[dates[0],"A"]) # equivalent to the prior method

-0.818895566676464
-0.818895566676464

Selection by Position

1	print(df.iloc[3])

A   -0.450437
B    0.501915
C    1.003776
D    0.691249
Name: 2013-01-04 00:00:00, dtype: float64

1	print(df.iloc[3:5, 0:2])

                   A         B
2013-01-04 -0.450437  0.501915
2013-01-05  1.633764  0.324234

1	print(df.iloc[[1,2,5],[1,3]])

                   B         D
2013-01-02 -1.390976  0.644468
2013-01-03  0.033224  0.634459
2013-01-06 -0.922663  0.505433

1	print(df.iloc[1:3,:])

                   A         B         C         D
2013-01-02  0.681543 -1.390976  2.013105  0.644468
2013-01-03  1.017911  0.033224 -0.103912  0.634459

1	print(df.iloc[:,1:3])

                   B         C
2013-01-01 -0.409184  0.087447
2013-01-02 -1.390976  2.013105
2013-01-03  0.033224 -0.103912
2013-01-04  0.501915  1.003776
2013-01-05  0.324234 -1.707570
2013-01-06 -0.922663 -1.641314

1 2	print(df.iloc[1,1]) print(df.iat[1,1]) # equivalent to the prior method

-1.3909764417520816
-1.3909764417520816

Boolean Indexing

1	print(df[df["A"]>0])

                   A         B         C         D
2013-01-02  0.681543 -1.390976  2.013105  0.644468
2013-01-03  1.017911  0.033224 -0.103912  0.634459
2013-01-05  1.633764  0.324234 -1.707570  1.163615
2013-01-06  0.282402 -0.922663 -1.641314  0.505433

1	print(df[df>0])

                   A         B         C         D
2013-01-01       NaN       NaN  0.087447       NaN
2013-01-02  0.681543       NaN  2.013105  0.644468
2013-01-03  1.017911  0.033224       NaN  0.634459
2013-01-04       NaN  0.501915  1.003776  0.691249
2013-01-05  1.633764  0.324234       NaN  1.163615
2013-01-06  0.282402       NaN       NaN  0.505433

df2 = df.copy()
df2["E"] = ['one', 'one', 'two', 'three', 'four', 'three']
print(df2)
print()
print(df2[df2["E"].isin(["two", "four"])])

                   A         B         C         D      E
2013-01-01 -0.818896 -0.409184  0.087447 -0.935887    one
2013-01-02  0.681543 -1.390976  2.013105  0.644468    one
2013-01-03  1.017911  0.033224 -0.103912  0.634459    two
2013-01-04 -0.450437  0.501915  1.003776  0.691249  three
2013-01-05  1.633764  0.324234 -1.707570  1.163615   four
2013-01-06  0.282402 -0.922663 -1.641314  0.505433  three

                   A         B         C         D     E
2013-01-03  1.017911  0.033224 -0.103912  0.634459   two
2013-01-05  1.633764  0.324234 -1.707570  1.163615  four

Setting

s1 = pd.Series([1,2,3,4,5,6], index = pd.date_range("20130102",periods=6))
print(s1)
print()
df["F"] = s1
print(df)

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

                   A         B         C         D    F
2013-01-01 -0.818896 -0.409184  0.087447 -0.935887  NaN
2013-01-02  0.681543 -1.390976  2.013105  0.644468  1.0
2013-01-03  1.017911  0.033224 -0.103912  0.634459  2.0
2013-01-04 -0.450437  0.501915  1.003776  0.691249  3.0
2013-01-05  1.633764  0.324234 -1.707570  1.163615  4.0
2013-01-06  0.282402 -0.922663 -1.641314  0.505433  5.0

df.at[dates[0], "A"] = 0 # setting values by label
df.iat[0,1] = 0 # setting values by position
df.loc[:, "D"] = np.array([5] * len(df)) # setting by assigning with a NumPy array
print(df)

                   A         B         C  D    F
2013-01-01  0.000000  0.000000  0.087447  5  NaN
2013-01-02  0.681543 -1.390976  2.013105  5  1.0
2013-01-03  1.017911  0.033224 -0.103912  5  2.0
2013-01-04 -0.450437  0.501915  1.003776  5  3.0
2013-01-05  1.633764  0.324234 -1.707570  5  4.0
2013-01-06  0.282402 -0.922663 -1.641314  5  5.0

1
2
3

df2 = df.copy()
df2[df2>0] = -df2 # setting values by boolean
print(df2)

                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -0.087447 -5  NaN
2013-01-02 -0.681543 -1.390976 -2.013105 -5 -1.0
2013-01-03 -1.017911 -0.033224 -0.103912 -5 -2.0
2013-01-04 -0.450437 -0.501915 -1.003776 -5 -3.0
2013-01-05 -1.633764 -0.324234 -1.707570 -5 -4.0
2013-01-06 -0.282402 -0.922663 -1.641314 -5 -5.0

Missing Data

1
2
3

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1
print(df1)

                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000  0.087447  5  NaN  1.0
2013-01-02  0.681543 -1.390976  2.013105  5  1.0  1.0
2013-01-03  1.017911  0.033224 -0.103912  5  2.0  NaN
2013-01-04 -0.450437  0.501915  1.003776  5  3.0  NaN

1
2
3

print(df1.dropna(how="any")) # how="any" (default) : where any NA values are present
print()
print(df1.dropna(how="all")) # how="all" : where all values are NA

                   A         B         C  D    F    E
2013-01-02  0.681543 -1.390976  2.013105  5  1.0  1.0

                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000  0.087447  5  NaN  1.0
2013-01-02  0.681543 -1.390976  2.013105  5  1.0  1.0
2013-01-03  1.017911  0.033224 -0.103912  5  2.0  NaN
2013-01-04 -0.450437  0.501915  1.003776  5  3.0  NaN

1
2
3

print(pd.isna(df1))
print()
print(df1.fillna(value=5))

                A      B      C      D      F      E
2013-01-01  False  False  False  False   True  False
2013-01-02  False  False  False  False  False  False
2013-01-03  False  False  False  False  False   True
2013-01-04  False  False  False  False  False   True

                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000  0.087447  5  5.0  1.0
2013-01-02  0.681543 -1.390976  2.013105  5  1.0  1.0
2013-01-03  1.017911  0.033224 -0.103912  5  2.0  5.0
2013-01-04 -0.450437  0.501915  1.003776  5  3.0  5.0

Operations

Stats

print("axis 0 :")
print(df.mean())
print()
print("axis 1 :")
print(df.mean(1))

axis 0 :
A    0.527531
B   -0.242378
C   -0.058078
D    5.000000
F    3.000000
dtype: float64

axis 1 :
2013-01-01    1.271862
2013-01-02    1.460734
2013-01-03    1.589444
2013-01-04    1.811051
2013-01-05    1.850086
2013-01-06    1.543685
Freq: D, dtype: float64

print(df)
print()
s = pd.Series([1, 3, 4, np.nan, 6, 8], index=dates).shift(2)
print(s)
print()
print(df.sub(s, axis=0)) # equivalent to (dataframe - other)

                   A         B         C  D    F
2013-01-01  0.000000  0.000000  0.087447  5  NaN
2013-01-02  0.681543 -1.390976  2.013105  5  1.0
2013-01-03  1.017911  0.033224 -0.103912  5  2.0
2013-01-04 -0.450437  0.501915  1.003776  5  3.0
2013-01-05  1.633764  0.324234 -1.707570  5  4.0
2013-01-06  0.282402 -0.922663 -1.641314  5  5.0

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    4.0
2013-01-06    NaN
Freq: D, dtype: float64

                   A         B         C    D    F
2013-01-01       NaN       NaN       NaN  NaN  NaN
2013-01-02       NaN       NaN       NaN  NaN  NaN
2013-01-03  0.017911 -0.966776 -1.103912  4.0  1.0
2013-01-04 -3.450437 -2.498085 -1.996224  2.0  0.0
2013-01-05 -2.366236 -3.675766 -5.707570  1.0  0.0
2013-01-06       NaN       NaN       NaN  NaN  NaN

Apply

1	print(df.apply(np.cumsum))

                   A         B         C   D     F
2013-01-01  0.000000  0.000000  0.087447   5   NaN
2013-01-02  0.681543 -1.390976  2.100552  10   1.0
2013-01-03  1.699454 -1.357753  1.996640  15   3.0
2013-01-04  1.249017 -0.855837  3.000416  20   6.0
2013-01-05  2.882781 -0.531603  1.292846  25  10.0
2013-01-06  3.165183 -1.454266 -0.348468  30  15.0

1	print(df.apply(lambda x: x.max() - x.min()))

A    2.084201
B    1.892892
C    3.720675
D    0.000000
F    4.000000
dtype: float64

Histogramming

s = pd.Series(np.random.randint(0, 7, size=10))
print(s)
print()
print(s.value_counts())

0    1
1    6
2    5
3    1
4    1
5    3
6    2
7    1
8    1
9    6
dtype: int64

1    5
6    2
5    1
3    1
2    1
dtype: int64

String Methods

s = pd.Series(["A","Aaba", np.nan, "CABA", "cat"])
print(s.str.lower())
print()
print(s.str.upper())

0       a
1    aaba
2     NaN
3    caba
4     cat
dtype: object

0       A
1    AABA
2     NaN
3    CABA
4     CAT
dtype: object

Merge

Concat

combining together Series and DataFrame objects

1 2	df = pd.DataFrame(np.random.randn(10,4)) print(df)

          0         1         2         3
0 -0.146700  1.278642  1.837775  0.644873
1 -0.332040  1.597295 -0.681229 -1.238212
2  0.751823  1.117058 -0.453366  0.953989
3  0.074173 -1.043050  0.276312  0.926186
4 -0.090674  1.679349  2.130480  0.950658
5  0.483778 -0.330530  0.370747  0.569736
6 -0.603331  2.363939 -0.052191  0.186119
7 -2.002784 -1.237193 -1.876920 -0.876104
8  1.121755 -0.104830 -1.675228  1.250540
9  0.008456 -1.287063  0.070528 -0.642563

pieces = [df[:3], df[3:7], df[7:]]
print(pieces)
print()
print(pd.concat(pieces)) # Concatenating objects together

[          0         1         2         3
0 -0.146700  1.278642  1.837775  0.644873
1 -0.332040  1.597295 -0.681229 -1.238212
2  0.751823  1.117058 -0.453366  0.953989,           0         1         2         3
3  0.074173 -1.043050  0.276312  0.926186
4 -0.090674  1.679349  2.130480  0.950658
5  0.483778 -0.330530  0.370747  0.569736
6 -0.603331  2.363939 -0.052191  0.186119,           0         1         2         3
7 -2.002784 -1.237193 -1.876920 -0.876104
8  1.121755 -0.104830 -1.675228  1.250540
9  0.008456 -1.287063  0.070528 -0.642563]

          0         1         2         3
0 -0.146700  1.278642  1.837775  0.644873
1 -0.332040  1.597295 -0.681229 -1.238212
2  0.751823  1.117058 -0.453366  0.953989
3  0.074173 -1.043050  0.276312  0.926186
4 -0.090674  1.679349  2.130480  0.950658
5  0.483778 -0.330530  0.370747  0.569736
6 -0.603331  2.363939 -0.052191  0.186119
7 -2.002784 -1.237193 -1.876920 -0.876104
8  1.121755 -0.104830 -1.675228  1.250540
9  0.008456 -1.287063  0.070528 -0.642563

Join

SQL style merging

left = pd.DataFrame({'key':['foo', 'foo'], 'lval':[1, 2]})
right = pd.DataFrame({'key':['foo', 'foo'], 'rval':[4, 5]})
print(left)
print()
print(right)
print()
print(pd.merge(left, right, on='key'))

   key  lval
0  foo     1
1  foo     2

   key  rval
0  foo     4
1  foo     5

   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

left = pd.DataFrame({'key':['foo', 'bar'], 'lval':[1, 2]})
right = pd.DataFrame({'key':['foo', 'bar'], 'rval':[4, 5]})
print(left)
print()
print(right)
print()
print(pd.merge(left, right, on='key'))

   key  lval
0  foo     1
1  bar     2

   key  rval
0  foo     4
1  bar     5

   key  lval  rval
0  foo     1     4
1  bar     2     5

Append

1 2	df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D']) print(df)

          A         B         C         D
0  0.120201 -0.286267 -0.205740  1.107503
1 -0.818479 -0.059564  0.192151  1.920259
2  0.687467 -0.833330 -0.426913  0.852926
3 -1.103335  0.213012  1.003000 -0.648385
4 -0.576006  1.414470  0.851227 -0.277232
5  0.189853  0.311170 -0.448372 -0.201059
6  1.707547  0.099129 -1.332999  0.478148
7  1.153861  1.831233  0.125928 -1.330353

1 2	s = df.iloc[3] print(df.append(s, ignore_index=True))

          A         B         C         D
0  0.120201 -0.286267 -0.205740  1.107503
1 -0.818479 -0.059564  0.192151  1.920259
2  0.687467 -0.833330 -0.426913  0.852926
3 -1.103335  0.213012  1.003000 -0.648385
4 -0.576006  1.414470  0.851227 -0.277232
5  0.189853  0.311170 -0.448372 -0.201059
6  1.707547  0.099129 -1.332999  0.478148
7  1.153861  1.831233  0.125928 -1.330353
8 -1.103335  0.213012  1.003000 -0.648385

Grouping

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
print(df)

     A      B         C         D
0  foo    one  0.031321 -0.238988
1  bar    one -0.626332 -0.445851
2  foo    two -1.448981  1.262838
3  bar  three -0.424664 -0.639157
4  foo    two  1.547849  2.378992
5  bar    two -0.351304 -0.521492
6  foo    one -1.903777  1.998802
7  foo  three -0.613947 -0.422391

1	print(df.groupby("A").sum())

            C         D
A                      
bar -1.402300 -1.606500
foo -2.387536  4.979252

1	print(df.groupby(["A","B"]).sum())

                  C         D
A   B                        
bar one   -0.626332 -0.445851
    three -0.424664 -0.639157
    two   -0.351304 -0.521492
foo one   -1.872457  1.759814
    three -0.613947 -0.422391
    two    0.098868  3.641830

df_min = df.groupby(["A","B"])["C"].min()
df_max = df.groupby(["A","B"])["C"].max()

print(pd.merge(df_min, df_max, on=['A',"B"], suffixes=('_min', '_max')))

              C_min     C_max
A   B                        
bar one   -0.626332 -0.626332
    three -0.424664 -0.424664
    two   -0.351304 -0.351304
foo one   -1.903777  0.031321
    three -0.613947 -0.613947
    two   -1.448981  1.547849

Reshaping

Stack

stack() : Compress a level in the DataFrame’s columns:

tuples = list(
    zip(
        *[
            ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
            ["one", "two", "one", "two", "one", "two", "one", "two"],
        ]
    )
)

index = pd.MultiIndex.from_tuples(tuples, names=["idx_1", "idx_2"])
df = pd.DataFrame(np.random.randn(8,2), index=index, columns=["A","B"])
df2 = df[:4]

print(df2)

                    A         B
idx_1 idx_2                    
bar   one   -0.748802  0.560048
      two   -0.214015 -0.658540
baz   one   -1.968829 -0.806776
      two   -1.314742 -0.174498

1 2	stacked = df2.stack() print(stacked) # level: idx_1 > idx_2 > columns

idx_1  idx_2   
bar    one    A   -0.748802
              B    0.560048
       two    A   -0.214015
              B   -0.658540
baz    one    A   -1.968829
              B   -0.806776
       two    A   -1.314742
              B   -0.174498
dtype: float64

unstack() : the inverse operation of stack()

print(stacked.unstack()) # columns(last level) / default -1
print()
print(stacked.unstack(0)) # idx_1 
print()
print(stacked.unstack(1)) # idx_2

                    A         B
idx_1 idx_2                    
bar   one   -0.748802  0.560048
      two   -0.214015 -0.658540
baz   one   -1.968829 -0.806776
      two   -1.314742 -0.174498

idx_1         bar       baz
idx_2                      
one   A -0.748802 -1.968829
      B  0.560048 -0.806776
two   A -0.214015 -1.314742
      B -0.658540 -0.174498

idx_2         one       two
idx_1                      
bar   A -0.748802 -0.214015
      B  0.560048 -0.658540
baz   A -1.968829 -1.314742
      B -0.806776 -0.174498

Pivot Table

df = pd.DataFrame(
    {
        "A": ["one", "one", "two", "three"] * 3,
        "B": ["aa", "bb", "cc"] * 4,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
        "D": np.random.randn(12),
        "E": np.random.randn(12),
    }
)
print(df)

        A   B    C         D         E
0     one  aa  foo  1.055556 -0.342298
1     one  bb  foo -0.463657 -0.004332
2     two  cc  foo  0.953746  0.690613
3   three  aa  bar -0.980697  0.498251
4     one  bb  bar -0.352120 -0.503475
5     one  cc  bar  0.298470 -1.316212
6     two  aa  foo  0.580929  0.483970
7   three  bb  foo  0.391527  0.200354
8     one  cc  foo -1.314898 -1.183403
9     one  aa  bar -0.058855 -0.004713
10    two  bb  bar -0.253133  1.255313
11  three  cc  bar  0.882602  0.561369

1	print(pd.pivot_table(df, values="D", index=["A","B"], columns="C"))

C              bar       foo
A     B                     
one   aa -0.058855  1.055556
      bb -0.352120 -0.463657
      cc  0.298470 -1.314898
three aa -0.980697       NaN
      bb       NaN  0.391527
      cc  0.882602       NaN
two   aa       NaN  0.580929
      bb -0.253133       NaN
      cc       NaN  0.953746

Time Series

rng = pd.date_range("1/1/2022", periods=100, freq="S") # secondly frequency
ts = pd.Series(np.random.randint(0,500,len(rng)), index=rng)
print(ts.resample("5Min").first())
print(ts.resample("5Min").last())
print(ts.resample("5Min").sum())

2022-01-01    192
Freq: 5T, dtype: int64
2022-01-01    304
Freq: 5T, dtype: int64
2022-01-01    26586
Freq: 5T, dtype: int64

1
2
3

rng = pd.date_range("3/6/2021 00:00", periods=5, freq="D") # daily frequency
ts = pd.Series(np.random.randn(len(rng)), rng)
print(ts)

2021-03-06   -1.465806
2021-03-07    0.938092
2021-03-08    2.811226
2021-03-09   -0.186533
2021-03-10   -1.048929
Freq: D, dtype: float64

ts_utc = ts.tz_localize("UTC") # UTC timezone
print("UTC :")
print(ts_utc)
print("\nUS/Eastern :")
print(ts_utc.tz_convert("US/Eastern")) # US/Eastern timezone

UTC :
2021-03-06 00:00:00+00:00   -1.465806
2021-03-07 00:00:00+00:00    0.938092
2021-03-08 00:00:00+00:00    2.811226
2021-03-09 00:00:00+00:00   -0.186533
2021-03-10 00:00:00+00:00   -1.048929
Freq: D, dtype: float64

US/Eastern :
2021-03-05 19:00:00-05:00   -1.465806
2021-03-06 19:00:00-05:00    0.938092
2021-03-07 19:00:00-05:00    2.811226
2021-03-08 19:00:00-05:00   -0.186533
2021-03-09 19:00:00-05:00   -1.048929
Freq: D, dtype: float64

rng = pd.date_range("1/1/2022", periods=5, freq="M")
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts) # DatetimeIndex
print()
ps = ts.to_period()
print(ps) # PeriodIndex
print()
print(ps.to_timestamp())

2022-01-31    0.438135
2022-02-28   -0.052516
2022-03-31   -1.491175
2022-04-30    0.708521
2022-05-31   -1.794057
Freq: M, dtype: float64
DatetimeIndex(['2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30',
               '2022-05-31'],
              dtype='datetime64[ns]', freq='M')

2022-01    0.438135
2022-02   -0.052516
2022-03   -1.491175
2022-04    0.708521
2022-05   -1.794057
Freq: M, dtype: float64
PeriodIndex(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05'], dtype='period[M]')

2022-01-01    0.438135
2022-02-01   -0.052516
2022-03-01   -1.491175
2022-04-01    0.708521
2022-05-01   -1.794057
Freq: MS, dtype: float64

prng = pd.period_range("1990Q1", "2000Q4", freq="Q-NOV") # quarterly frequency
ts = pd.Series(np.random.randn(len(prng)), prng)
print(ts.head())
print()

# month end & hour start
ts.index = (prng.asfreq("M", "e") + 1).asfreq("H", "s") + 9
print(ts.head())

1990Q1    1.479314
1990Q2    0.799696
1990Q3    0.148978
1990Q4   -0.571086
1991Q1    1.908509
Freq: Q-NOV, dtype: float64

1990-03-01 09:00    1.479314
1990-06-01 09:00    0.799696
1990-09-01 09:00    0.148978
1990-12-01 09:00   -0.571086
1991-03-01 09:00    1.908509
Freq: H, dtype: float64

Categoricals

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
                   "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
print(df)

   id raw_grade
0   1         a
1   2         b
2   3         b
3   4         a
4   5         a
5   6         e

df["grade"] = df["raw_grade"].astype("category")
df["grade"].cat.categories = ["very good", "good", "very bad"] # rename categories
df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad","medium","good","very good"]
)
df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

1	print(df.sort_values(by="grade").reset_index(drop=True))

   id raw_grade      grade
0   6         e   very bad
1   2         b       good
2   3         b       good
3   1         a  very good
4   4         a  very good
5   5         a  very good

1	print(df.groupby("grade").size())

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

Plotting

import matplotlib.pyplot as plt

ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2010", periods=1000))
ts = ts.cumsum()
ts.plot()
plt.show()

df = pd.DataFrame(np.random.randn(1000,4), index=ts.index, columns=["A","B","C","D"])
df = df.cumsum()
df.plot()
plt.legend(loc='best')
plt.show()

Getting data in/out

CSV

1	df.to_csv("foo.csv")

1	print(pd.read_csv("foo.csv"))

     Unnamed: 0         A          B          C          D
0    2010-01-01  0.015846   0.025822   0.443290  -0.724835
1    2010-01-02 -0.084776   0.330946  -0.669265  -0.949676
2    2010-01-03  1.325583  -0.697926  -0.204375  -1.145391
3    2010-01-04  2.293831  -2.097458  -1.260249  -0.555542
4    2010-01-05  1.334001  -2.392123  -2.659330  -0.325232
..          ...       ...        ...        ...        ...
995  2012-09-22 -7.868011 -59.520981 -49.220218 -54.072522
996  2012-09-23 -7.288884 -59.584693 -49.307716 -54.783148
997  2012-09-24 -6.835061 -59.508180 -49.608991 -55.780242
998  2012-09-25 -6.863802 -58.711503 -50.458917 -55.218363
999  2012-09-26 -7.075728 -59.524064 -49.924592 -55.653215

[1000 rows x 5 columns]

HDF5

1	df.to_hdf("foo.h5", "df")

1	print(pd.read_hdf("foo.h5", "df"))

                   A          B          C          D
2010-01-01  0.015846   0.025822   0.443290  -0.724835
2010-01-02 -0.084776   0.330946  -0.669265  -0.949676
2010-01-03  1.325583  -0.697926  -0.204375  -1.145391
2010-01-04  2.293831  -2.097458  -1.260249  -0.555542
2010-01-05  1.334001  -2.392123  -2.659330  -0.325232
...              ...        ...        ...        ...
2012-09-22 -7.868011 -59.520981 -49.220218 -54.072522
2012-09-23 -7.288884 -59.584693 -49.307716 -54.783148
2012-09-24 -6.835061 -59.508180 -49.608991 -55.780242
2012-09-25 -6.863802 -58.711503 -50.458917 -55.218363
2012-09-26 -7.075728 -59.524064 -49.924592 -55.653215

[1000 rows x 4 columns]

Excel

1	df.to_excel("foo.xlsx", sheet_name="Sheet1")

1	print(pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"]))

    Unnamed: 0         A          B          C          D
0   2010-01-01  0.015846   0.025822   0.443290  -0.724835
1   2010-01-02 -0.084776   0.330946  -0.669265  -0.949676
2   2010-01-03  1.325583  -0.697926  -0.204375  -1.145391
3   2010-01-04  2.293831  -2.097458  -1.260249  -0.555542
4   2010-01-05  1.334001  -2.392123  -2.659330  -0.325232
..         ...       ...        ...        ...        ...
995 2012-09-22 -7.868011 -59.520981 -49.220218 -54.072522
996 2012-09-23 -7.288884 -59.584693 -49.307716 -54.783148
997 2012-09-24 -6.835061 -59.508180 -49.608991 -55.780242
998 2012-09-25 -6.863802 -58.711503 -50.458917 -55.218363
999 2012-09-26 -7.075728 -59.524064 -49.924592 -55.653215

[1000 rows x 5 columns]

pythontutorial

Import Library

Object Creation

Creating Series

Creation DataFrame

Viewing Data

Basic Structure

DataFrame to NumPy array

Describe, Transpose, Sort

Selecting Data

Getting

Selection by Label

Selection by Position

Boolean Indexing

Setting

Missing Data

Operations

Stats

Apply

Histogramming

String Methods

Merge

Concat

Join

Append

Grouping

Reshaping

Stack

Pivot Table

Time Series

Categoricals

Plotting

Getting data in/out

CSV

HDF5

Excel