Magic Analytics
  • Home
  • Python
    • Pandas
    • Matplotlib
    • Interactive Visualization
    • Folium
  • Spark
    • DataFrame
  • Machine Learning
    • Classification >
      • Logistic Regression
    • Dimension Reduction
    • Model Explaination
  • Blog
  • About

Aries Research Note

Pandas: group-by-aggregation deep dive

10/2/2016

0 Comments

 
A friend used to ask me one question: what is the function in Pandas that similar to R's summarize (as in dplyr)? Surprisingly,  I was not able to give a straight answer. However, after some digging, finally find a (somewhat) satisfactory answer.

First, let's look at how summarize in dplyr works (the code is borrowed from RStudio:

    
Other functions aside, focusing on the "summarise" function, one can easily specify the alias "arr" and "dep", logic function "mean", columns working over "arr_delay" and "dep_delay", and even conditional requirements "na.rm". This is very powerful.

While looking at the alternative in Pandas, let's only focusing on the "summarise" part and with the help of Titanic data set:

    
Picture
The functionality looks similar, but ... what about trying to have not only "mean", but also "std", "max" over the same column? also with different alias as if dplyr's "arr" and "dep"? Then we have to change the code into:

    
Picture
While ... what is this multiple level of columns? This is one concept in Pandas as "MultiIndex". Personally, I find MultiIndex over column hard to manipulate, so I prefer to drop it after the aggregation. The way to do this is:

    
Picture
However, is there a way to do EVERYTHING in one line? I don't like to define a "df1" and change its columns. Here is a trick, specify the columns after the "groupby", magic will happen :)

    
Picture
Now mission completed :)
0 Comments



Leave a Reply.

    Author

    Data Magician

    Archives

    October 2017
    April 2017
    November 2016
    October 2016
    September 2016

    Categories

    All
    Git
    Hive
    Machine Learning
    Matplotlib
    Pandas
    Plotly
    Python
    R
    Spark

    RSS Feed

Powered by Create your own unique website with customizable templates.
  • Home
  • Python
    • Pandas
    • Matplotlib
    • Interactive Visualization
    • Folium
  • Spark
    • DataFrame
  • Machine Learning
    • Classification >
      • Logistic Regression
    • Dimension Reduction
    • Model Explaination
  • Blog
  • About