Magic Analytics
  • Home
  • Python
    • Pandas
    • Matplotlib
    • Interactive Visualization
    • Folium
  • Spark
    • DataFrame
  • Machine Learning
    • Classification >
      • Logistic Regression
    • Dimension Reduction
    • Model Explaination
  • Blog
  • About

Aries Research Note

PySpark vs. Pandas (Part 5: SQL-windows function)

10/31/2016

0 Comments

 
In SQL, by defining a specific "window", one can perform "calculation across a set of table rows that are somehow related to the current row" (from PostgreSQL). This greatly extend SQL's analytics power.

The implementation in PySpark is quite close (syntactically) to SQL, one have to define a "window" literally; while for Pandas, although it also has a "window" function, I found it is more like a rolling window, rather the SQL's window functionality. (correct me if I am wrong). 

My preferred way to replicate windows function in Pandas and PySpark is below:

    
Picture

    
Picture
From this comparison, looks like PySpark is more SQL-straightforward, with a nice syntax to do that. However, actually Pandas is more versatile and many different functionality could be defined from the "assign" keyword. Good or bad? I think it depends. 
0 Comments

PySpark vs. Pandas (Part 4: set related operation)

10/24/2016

0 Comments

 
The "set" related operation is more like considering the data frame as if it is a "set". Common set operations are: union, intersect, difference. Pandas and PySpark have different ways handling this. 

In Pandas, since it has the concept of Index, so sometimes the thinking for Pandas is a little bit different from the traditional Set operation. 

    
While for Spark, it is quite easy since Spark is so close to SQL, it directly has those keywords implemented

    
So in this round of comparison, Spark is more intuitive than Pandas to handle SQL set related operation. 
0 Comments

PySpark vs. Pandas (Part 3: group-by related operation)

10/23/2016

0 Comments

 
Group-by is frequently used in SQL for aggregation statistics. To get any big-data back into visualization, Group-by statement is almost essential.

    
In my opinion, none of the above approach is "perfect". For Pandas, one need to do a "reset_index()" to get the "Survived" column back as a normal column; for Spark, the column name is changed into a descriptive, but very long one. 

​For Spark, we can introduce the alias function for column to make things much nicer

    
All above are for "simple" aggregations, like those already pre-exist in Pandas or Spark, what about complicated ones? Like some weighted average or square sum? 

The complicated cases could be considered as:
1. aggregation on single column (like square sum)
2. aggregation on multiple columns (like weighted average based on another column)

Certainly, before we going to complicated on the aggregation, it is always easier to just create a new column (to do all the heavy lifting), and then simply aggregate on that specific column! While, here I just want to show that Pandas offer a few more flexibility

    
0 Comments

PySpark vs. Pandas (Part 2: join-related operation)

10/23/2016

0 Comments

 
Data is usually spread out in different tables, and insights are extracted when merging all information together: join related operators are very important to get this done. 

There are three kinds of join operators:
1. join by key(s)
2. join as set operator on Rows
3. join as set operator on Columns

    
The only difference (and potential problem) here is Pandas automatically change the same (non-key) column name with adding appendix to avoid name duplication, while Spark just keep the same name! Although there is a way to still referring the right "Survived" column, it is not quite convenient. So the following would be the recommended way: rename the collision column first. 

    
The second kind of join is more like set operator, basically considering two DFs as if two set, and take its "intersection", "difference", or "union"

    
The third kind of join is to extend the current data frame along the its index. It is similar (most time) as if joining the same key(s) with more extra column, but in Pandas, one can extend the column according to its index. 

    
0 Comments

PySpark vs. Pandas (Part 1: select and filter)

10/22/2016

0 Comments

 
As long as the data can be loaded fully into memory, Pandas is a great data analytic tool. However, with data amount much bigger Spark comes into the play. Pandas and PySpark DF have different APIs, and it is very easy to get confused or not knowing the best practice. I want to summarize my best practice so that others will take less detour.

    
0 Comments

    Author

    Data Magician

    Archives

    October 2017
    April 2017
    November 2016
    October 2016
    September 2016

    Categories

    All
    Git
    Hive
    Machine Learning
    Matplotlib
    Pandas
    Plotly
    Python
    R
    Spark

    RSS Feed

Powered by Create your own unique website with customizable templates.
  • Home
  • Python
    • Pandas
    • Matplotlib
    • Interactive Visualization
    • Folium
  • Spark
    • DataFrame
  • Machine Learning
    • Classification >
      • Logistic Regression
    • Dimension Reduction
    • Model Explaination
  • Blog
  • About