Magic Analytics
  • Home
  • Python
    • Pandas
    • Matplotlib
    • Interactive Visualization
    • Folium
  • Spark
    • DataFrame
  • Machine Learning
    • Classification >
      • Logistic Regression
    • Dimension Reduction
    • Model Explaination
  • Blog
  • About

Aries Research Note

PySpark vs. Pandas (Part 5: SQL-windows function)

10/31/2016

0 Comments

 
In SQL, by defining a specific "window", one can perform "calculation across a set of table rows that are somehow related to the current row" (from PostgreSQL). This greatly extend SQL's analytics power.

The implementation in PySpark is quite close (syntactically) to SQL, one have to define a "window" literally; while for Pandas, although it also has a "window" function, I found it is more like a rolling window, rather the SQL's window functionality. (correct me if I am wrong). 

My preferred way to replicate windows function in Pandas and PySpark is below:

    
Picture

    
Picture
From this comparison, looks like PySpark is more SQL-straightforward, with a nice syntax to do that. However, actually Pandas is more versatile and many different functionality could be defined from the "assign" keyword. Good or bad? I think it depends. 
0 Comments



Leave a Reply.

    Author

    Data Magician

    Archives

    October 2017
    April 2017
    November 2016
    October 2016
    September 2016

    Categories

    All
    Git
    Hive
    Machine Learning
    Matplotlib
    Pandas
    Plotly
    Python
    R
    Spark

    RSS Feed

Powered by Create your own unique website with customizable templates.
  • Home
  • Python
    • Pandas
    • Matplotlib
    • Interactive Visualization
    • Folium
  • Spark
    • DataFrame
  • Machine Learning
    • Classification >
      • Logistic Regression
    • Dimension Reduction
    • Model Explaination
  • Blog
  • About