A friend used to ask me one question: what is the function in Pandas that similar to R's summarize (as in dplyr)? Surprisingly, I was not able to give a straight answer. However, after some digging, finally find a (somewhat) satisfactory answer. First, let's look at how summarize in dplyr works (the code is borrowed from RStudio: Other functions aside, focusing on the "summarise" function, one can easily specify the alias "arr" and "dep", logic function "mean", columns working over "arr_delay" and "dep_delay", and even conditional requirements "na.rm". This is very powerful. While looking at the alternative in Pandas, let's only focusing on the "summarise" part and with the help of Titanic data set: The functionality looks similar, but ... what about trying to have not only "mean", but also "std", "max" over the same column? also with different alias as if dplyr's "arr" and "dep"? Then we have to change the code into: While ... what is this multiple level of columns? This is one concept in Pandas as "MultiIndex". Personally, I find MultiIndex over column hard to manipulate, so I prefer to drop it after the aggregation. The way to do this is: However, is there a way to do EVERYTHING in one line? I don't like to define a "df1" and change its columns. Here is a trick, specify the columns after the "groupby", magic will happen :) Now mission completed :)
0 Comments
Leave a Reply. |
AuthorData Magician Archives
October 2017
Categories
All
|