Magic Analytics
  • Home
  • Python
    • Pandas
    • Matplotlib
    • Interactive Visualization
    • Folium
  • Spark
    • DataFrame
  • Machine Learning
    • Classification >
      • Logistic Regression
    • Dimension Reduction
    • Model Explaination
  • Blog
  • About

Aries Research Note

PySpark vs. Pandas (Part 3: group-by related operation)

10/23/2016

0 Comments

 
Group-by is frequently used in SQL for aggregation statistics. To get any big-data back into visualization, Group-by statement is almost essential.

    
In my opinion, none of the above approach is "perfect". For Pandas, one need to do a "reset_index()" to get the "Survived" column back as a normal column; for Spark, the column name is changed into a descriptive, but very long one. 

​For Spark, we can introduce the alias function for column to make things much nicer

    
All above are for "simple" aggregations, like those already pre-exist in Pandas or Spark, what about complicated ones? Like some weighted average or square sum? 

The complicated cases could be considered as:
1. aggregation on single column (like square sum)
2. aggregation on multiple columns (like weighted average based on another column)

Certainly, before we going to complicated on the aggregation, it is always easier to just create a new column (to do all the heavy lifting), and then simply aggregate on that specific column! While, here I just want to show that Pandas offer a few more flexibility

    
0 Comments



Leave a Reply.

    Author

    Data Magician

    Archives

    October 2017
    April 2017
    November 2016
    October 2016
    September 2016

    Categories

    All
    Git
    Hive
    Machine Learning
    Matplotlib
    Pandas
    Plotly
    Python
    R
    Spark

    RSS Feed

Powered by Create your own unique website with customizable templates.
  • Home
  • Python
    • Pandas
    • Matplotlib
    • Interactive Visualization
    • Folium
  • Spark
    • DataFrame
  • Machine Learning
    • Classification >
      • Logistic Regression
    • Dimension Reduction
    • Model Explaination
  • Blog
  • About