includes nan values; default name is ‘0’, use .reset_index(name=<new_name>)
pd.Series.mode
nums, str
basically ‘mode’; returns most common value(s)
.first()
nums, str
.rank()
“rank”
nums
returns as a series, assign to (new) column
.pct_change()
nums
% change since last entry
df.groupby("year")["passengers"].mean()df.groupby("year")["passengers"].min()df.groupby("year")["passengers"].max()df.groupby("year").count()df.groupby("year").size().reset_index(name="size")# mode can only be used in .aggdf.groupby("year").first()
month
passengers
year
1949
Jan
112
1950
Jan
115
1951
Jan
145
1952
Jan
171
1953
Jan
196
1954
Jan
204
1955
Jan
242
1956
Jan
284
1957
Jan
315
1958
Jan
340
1959
Jan
360
1960
Jan
417
.rank() and .pct_change() both return series that need to be made into their own columns.
Also the best function to start with is .describe(), because it returns a multiindex table with the functions: count, mean, std, min, 25%, 50%, 75%, max.
Notice how they columns seem to be layered, and the multi.columns is giving a list of tuples instead of the normal list of strings. To get rid of this there are a few ways - including the function .to_flat_index(). But my favorite way is to join the names with a underscore (_).
multi.columns = ["_".join(col) for col in multi.columns.values]print(multi.columns)display(multi)