In SQL, by defining a specific "window", one can perform "calculation across a set of table rows that are somehow related to the current row" (from PostgreSQL). This greatly extend SQL's analytics power. The implementation in PySpark is quite close (syntactically) to SQL, one have to define a "window" literally; while for Pandas, although it also has a "window" function, I found it is more like a rolling window, rather the SQL's window functionality. (correct me if I am wrong). My preferred way to replicate windows function in Pandas and PySpark is below: From this comparison, looks like PySpark is more SQL-straightforward, with a nice syntax to do that. However, actually Pandas is more versatile and many different functionality could be defined from the "assign" keyword. Good or bad? I think it depends.
0 Comments
Leave a Reply. |
AuthorData Magician Archives
October 2017
Categories
All
|