Pandas Groupby & Aggregations

Python
Pandas
Code Snippets
Author

Allison Day

Published

March 4, 2023

# IMPORTS
import pandas as pd
import numpy as np
import seaborn as sns  # get dataset

# Create Random DF
df = sns.load_dataset("flights")
df
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
... ... ... ...
139 1960 Aug 606
140 1960 Sep 508
141 1960 Oct 461
142 1960 Nov 390
143 1960 Dec 432

144 rows × 3 columns

My Most Used

Basic Aggregations all computations 🔗link

basic .agg dtypes notes
.mean() “mean”, np.mean nums
.min() “min”, np.min nums
.max() “max”, np.max nums
.count() “count” nums, str doesn’t include nan
.size() “size” nums, str includes nan values; default name is ‘0’, use .reset_index(name=<new_name>)
pd.Series.mode nums, str basically ‘mode’; returns most common value(s)
.first() nums, str
.rank() “rank” nums returns as a series, assign to (new) column
.pct_change() nums % change since last entry
df.groupby("year")["passengers"].mean()
df.groupby("year")["passengers"].min()
df.groupby("year")["passengers"].max()
df.groupby("year").count()
df.groupby("year").size().reset_index(name="size")
# mode can only be used in .agg
df.groupby("year").first()
month passengers
year
1949 Jan 112
1950 Jan 115
1951 Jan 145
1952 Jan 171
1953 Jan 196
1954 Jan 204
1955 Jan 242
1956 Jan 284
1957 Jan 315
1958 Jan 340
1959 Jan 360
1960 Jan 417

.rank() and .pct_change() both return series that need to be made into their own columns.

df["rank"] = df.groupby("year")["passengers"].rank()
df["pct_change"] = df.groupby("year")["passengers"].pct_change()
df
year month passengers rank pct_change
0 1949 Jan 112 2.0 NaN
1 1949 Feb 118 3.5 0.053571
2 1949 Mar 132 8.0 0.118644
3 1949 Apr 129 7.0 -0.022727
4 1949 May 121 6.0 -0.062016
... ... ... ... ... ...
139 1960 Aug 606 11.0 -0.025723
140 1960 Sep 508 9.0 -0.161716
141 1960 Oct 461 6.5 -0.092520
142 1960 Nov 390 1.0 -0.154013
143 1960 Dec 432 5.0 0.107692

144 rows × 5 columns

Also the best function to start with is .describe(), because it returns a multiindex table with the functions: count, mean, std, min, 25%, 50%, 75%, max.

# echo: False
# output: False

df.drop(["rank", "pct_change"], axis=1, inplace=True)
df.groupby("year").describe()
passengers
count mean std min 25% 50% 75% max
year
1949 12.0 126.666667 13.720147 104.0 118.00 125.0 135.25 148.0
1950 12.0 139.666667 19.070841 114.0 125.75 137.5 151.25 170.0
1951 12.0 170.166667 18.438267 145.0 159.00 169.0 179.50 199.0
1952 12.0 197.000000 22.966379 171.0 180.75 192.0 211.25 242.0
1953 12.0 225.000000 28.466887 180.0 199.75 232.0 238.50 272.0
1954 12.0 238.916667 34.924486 188.0 221.25 231.5 260.25 302.0
1955 12.0 284.000000 42.140458 233.0 260.75 272.0 312.75 364.0
1956 12.0 328.250000 47.861780 271.0 300.50 315.0 359.75 413.0
1957 12.0 368.416667 57.890898 301.0 330.75 351.5 408.50 467.0
1958 12.0 381.000000 64.530472 310.0 339.25 360.5 411.75 505.0
1959 12.0 428.333333 69.830097 342.0 387.50 406.5 465.25 559.0
1960 12.0 476.166667 77.737125 390.0 418.50 461.0 514.75 622.0

.agg function

df.groupby("year")["passengers"].agg("mean")
year
1949    126.666667
1950    139.666667
1951    170.166667
1952    197.000000
1953    225.000000
1954    238.916667
1955    284.000000
1956    328.250000
1957    368.416667
1958    381.000000
1959    428.333333
1960    476.166667
Name: passengers, dtype: float64

lets you pass in a list of functions

df.groupby("year")["passengers"].agg(["mean", "max", "rank"])
mean max rank
0 NaN NaN 2.0
1 NaN NaN 3.5
2 NaN NaN 8.0
3 NaN NaN 7.0
4 NaN NaN 6.0
... ... ... ...
1956 328.250000 413.0 NaN
1957 368.416667 467.0 NaN
1958 381.000000 505.0 NaN
1959 428.333333 559.0 NaN
1960 476.166667 622.0 NaN

156 rows × 3 columns

and assign functions to specific columns

df.groupby("year").agg({"passengers": ["mean", "max", "rank"], "month": "first"})
passengers month
mean max rank first
0 NaN NaN 2.0 NaN
1 NaN NaN 3.5 NaN
2 NaN NaN 8.0 NaN
3 NaN NaN 7.0 NaN
4 NaN NaN 6.0 NaN
... ... ... ... ...
1956 328.250000 413.0 NaN Jan
1957 368.416667 467.0 NaN Jan
1958 381.000000 505.0 NaN Jan
1959 428.333333 559.0 NaN Jan
1960 476.166667 622.0 NaN Jan

156 rows × 4 columns

dealing with multiindex headers

this article is the best at explaining 🔗link

# multi index header
multi = df.groupby("year").agg({"passengers": [np.min, np.mean, np.max]})

print(multi.columns)
display(multi)
MultiIndex([('passengers', 'amin'),
            ('passengers', 'mean'),
            ('passengers', 'amax')],
           )
passengers
amin mean amax
year
1949 104 126.666667 148
1950 114 139.666667 170
1951 145 170.166667 199
1952 171 197.000000 242
1953 180 225.000000 272
1954 188 238.916667 302
1955 233 284.000000 364
1956 271 328.250000 413
1957 301 368.416667 467
1958 310 381.000000 505
1959 342 428.333333 559
1960 390 476.166667 622

Notice how they columns seem to be layered, and the multi.columns is giving a list of tuples instead of the normal list of strings. To get rid of this there are a few ways - including the function .to_flat_index(). But my favorite way is to join the names with a underscore (_).

multi.columns = ["_".join(col) for col in multi.columns.values]
print(multi.columns)
display(multi)
Index(['passengers_amin', 'passengers_mean', 'passengers_amax'], dtype='object')
passengers_amin passengers_mean passengers_amax
year
1949 104 126.666667 148
1950 114 139.666667 170
1951 145 170.166667 199
1952 171 197.000000 242
1953 180 225.000000 272
1954 188 238.916667 302
1955 233 284.000000 364
1956 271 328.250000 413
1957 301 368.416667 467
1958 310 381.000000 505
1959 342 428.333333 559
1960 390 476.166667 622