Spark DataFrame provides a great way to carry out data analytics with >100GB level data. Basically,
DataFrame = RDD + Schema + (some optimization)
As a data scientist, both Pandas and Spark DataFrame offer great way to play with data, and it is nice to compare both tools to know see what's the difference, and master both.
Here I wrote is a series of blogs to comparing them side-by-side on most frequently used data science operations:
Spark DataFrame vs. Pandas (Part 1. select and filter)
Spark DataFrame vs. Pandas (Part 2. join related operation)
Spark DataFrame vs. Pandas (Part 3. group-by related operation)
Spark DataFrame vs. Pandas (Part 4. set related operation)
Spark DataFrame vs. Pandas (Part 5: SQL-windows function)
DataFrame = RDD + Schema + (some optimization)
As a data scientist, both Pandas and Spark DataFrame offer great way to play with data, and it is nice to compare both tools to know see what's the difference, and master both.
Here I wrote is a series of blogs to comparing them side-by-side on most frequently used data science operations:
Spark DataFrame vs. Pandas (Part 1. select and filter)
Spark DataFrame vs. Pandas (Part 2. join related operation)
Spark DataFrame vs. Pandas (Part 3. group-by related operation)
Spark DataFrame vs. Pandas (Part 4. set related operation)
Spark DataFrame vs. Pandas (Part 5: SQL-windows function)