
Advanced Data Aggregation Techniques in Pandas
Learn about Pandas aggregation methods such as sum, mean, min, max, and std. Discover simple Series and DataFrame aggregations, as well as the describe method. Explore built-in aggregation functions and the groupby operation for conditional aggregations based on index labels. Understand the split, apply, and combine steps in Pandas data aggregation process.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Pandas Aggregation Aggregation refers to getting summary information about a data set Examples: sum, mean, min, max, std, Pandas contains built-in methods for doing aggregations Professor John Carelli Kutztown University Computer Science Department
Simple Series Aggregations A Series aggregation works in a similar fashion to NumPy The result is a simple numerical value Examples: >>> s = pd.Series(range(10)) >>> s.sum() 45 >>> s.mean() 4.5 Professor John Carelli Kutztown University Computer Science Department
DataFrame Aggregations By default, DataFrame Aggregations occur on Columns but can be made to operate on rows as well Professor John Carelli Kutztown University Computer Science Department
Pandas describe The describe method provides the results of a collection of aggregations in one command count, mean, std, min, 25%, 50%, 75%, max Professor John Carelli Kutztown University Computer Science Department
Built-in Aggregations Aggregation Description count total number of items first, last first and last item mean, median mean and median min, max minimum and maximum standard deviation and variance std, var mad mean absolute deviation prod product of all items sum sum of all items Professor John Carelli Kutztown University Computer Science Department
groupby The groupby operation provides a mechanism for doing conditional aggregations based on an index label The name is derived from a similar group by operation in SQL groupby can be though of as the combination of three distinct steps: split, apply, & combine Professor John Carelli Kutztown University Computer Science Department
split, apply, & combine The split step involves breaking up and grouping a DataFrame depending on the value of the specified key. The apply step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups. The combine step merges the results of these operations into an output array. Professor John Carelli Kutztown University Computer Science Department
groupby example df = pd.DataFrame( {'key': ['A', 'B', 'C', 'A', 'B', 'C'], 'data': range(1,7)}, columns=['key', 'data ] ) df.groupby( key ).sum() See PandasAggregation.ipynb (also contains examples for remaining slides) from Python Data Science Handbook (VanderPlas) Professor John Carelli Kutztown University Computer Science Department
DataFrameGroupBy object groupby returns a DataFrameGroupBy object a special abstract view into the DataFrame object grouping is based on supplied index aggregations, etc. can be performed on this object Methods supported by groupby: aggregate() filter() transform() apply() Professor John Carelli Kutztown University Computer Science Department
aggregate aggregate() method can be used to perform multiple aggregations at once supply the aggregations in a list, string, or function df.groupby('key').aggregate(['min', np.median, max]) can also supply a dictionary to map operations to specific columns df.groupby('key').aggregate({'data1': 'min', 'data2': 'max }) Professor John Carelli Kutztown University Computer Science Department
filter filter() method can be supplied a function that returns a Boolean value the filter function is applied to the group def filter_func(x): return x['data1 ].sum() > 5 df.groupby('key').filter(filter_func) Professor John Carelli Kutztown University Computer Science Department
transform transform() returns a transformed version of the full data # recenter data by subtracting group mean df.groupby('key').transform(lambda x: x - x.mean()) Note: key is removed from the output Professor John Carelli Kutztown University Computer Science Department
apply apply() applies an arbitrary function to the group # example normalize column 1 by sum of column 2 def norm_by_data2(x): x['data1'] /= x['data2'].sum() return x df.groupby('key').apply(norm_by_data2) Professor John Carelli Kutztown University Computer Science Department
key splitting More than one key can be used to group data The key can be: Any series or list that matches the length of the DataFrame A dictionary that maps index values to a group key A Python function A combination of the above used to group on a multi-index Professor John Carelli Kutztown University Computer Science Department
Pivoting >>> df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'], ... "bar": ['A', 'B', 'A', 'B'], ... "baz": [1, 2, 3, 4]}) >>> df foo bar baz 0 one A 1 1 one B 2 2 two A 3 3 two B 4 Another mechanism for grouping data Specify columns to use for index, columns, and values >>> df.pivot(index='foo', columns='bar', values='baz') bar A B foo one 1 2 two 3 4 Professor John Carelli Kutztown University Computer Science Department
pivot index error >>> df1 = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'], ... "bar": ['A', 'B', 'C', 'C'], ... "baz": [1, 2, 3, 4]}) >>> df1 foo bar baz 0 one A 1 1 one B 2 2 two C 3 3 two C 4 Grouping won't work if resulting index values are not unique Can't tell which value to put at the derived index >>> df1.pivot(index='foo', columns='bar', values='baz') ValueError: Index contains duplicate entries, cannot reshape Professor John Carelli Kutztown University Computer Science Department
pivot_table >>> df1 foo bar baz 0 one A 1 1 one B 2 2 two C 3 3 two C 4 Does both pivoting and an aggregation aggregation function can be specified default is mean >>> df1.pivot_table(index='foo', columns='bar', values='baz') bar A B C foo one 1.0 2.0 NaN two NaN NaN 3.5 >>> df1.pivot_table(index='foo', columns='bar', values='baz', aggfunc=max) bar A B C foo one 1.0 2.0 NaN two NaN NaN 4.0 Professor John Carelli Kutztown University Computer Science Department