Data is usually spread out in different tables, and insights are extracted when merging all information together: join related operators are very important to get this done. There are three kinds of join operators: 1. join by key(s) 2. join as set operator on Rows 3. join as set operator on Columns The only difference (and potential problem) here is Pandas automatically change the same (non-key) column name with adding appendix to avoid name duplication, while Spark just keep the same name! Although there is a way to still referring the right "Survived" column, it is not quite convenient. So the following would be the recommended way: rename the collision column first. The second kind of join is more like set operator, basically considering two DFs as if two set, and take its "intersection", "difference", or "union" The third kind of join is to extend the current data frame along the its index. It is similar (most time) as if joining the same key(s) with more extra column, but in Pandas, one can extend the column according to its index.
0 Comments
Leave a Reply. |
AuthorData Magician Archives
October 2017
Categories
All
|