Group-by is frequently used in SQL for aggregation statistics. To get any big-data back into visualization, Group-by statement is almost essential. In my opinion, none of the above approach is "perfect". For Pandas, one need to do a "reset_index()" to get the "Survived" column back as a normal column; for Spark, the column name is changed into a descriptive, but very long one. For Spark, we can introduce the alias function for column to make things much nicer All above are for "simple" aggregations, like those already pre-exist in Pandas or Spark, what about complicated ones? Like some weighted average or square sum? The complicated cases could be considered as: 1. aggregation on single column (like square sum) 2. aggregation on multiple columns (like weighted average based on another column) Certainly, before we going to complicated on the aggregation, it is always easier to just create a new column (to do all the heavy lifting), and then simply aggregate on that specific column! While, here I just want to show that Pandas offer a few more flexibility
0 Comments
Leave a Reply. |
AuthorData Magician Archives
October 2017
Categories
All
|