Magic Analytics
  • Home
  • Python
    • Pandas
    • Matplotlib
    • Interactive Visualization
    • Folium
  • Spark
    • DataFrame
  • Machine Learning
    • Classification >
      • Logistic Regression
    • Dimension Reduction
    • Model Explaination
  • Blog
  • About

Aries Research Note

Plotly: Basic Settings for Data Science

10/25/2016

1 Comment

 
There are million ways people can use one software, however, this is my preferred way (may not be optimal, but workable). 
As a data scientist, mostly I want to use Plotly for interactive exploratory analytics since it provides way to get better feeling about data. 

    

    
1 Comment

PySpark vs. Pandas (Part 4: set related operation)

10/24/2016

0 Comments

 
The "set" related operation is more like considering the data frame as if it is a "set". Common set operations are: union, intersect, difference. Pandas and PySpark have different ways handling this. 

In Pandas, since it has the concept of Index, so sometimes the thinking for Pandas is a little bit different from the traditional Set operation. 

    
While for Spark, it is quite easy since Spark is so close to SQL, it directly has those keywords implemented

    
So in this round of comparison, Spark is more intuitive than Pandas to handle SQL set related operation. 
0 Comments

PySpark vs. Pandas (Part 3: group-by related operation)

10/23/2016

0 Comments

 
Group-by is frequently used in SQL for aggregation statistics. To get any big-data back into visualization, Group-by statement is almost essential.

    
In my opinion, none of the above approach is "perfect". For Pandas, one need to do a "reset_index()" to get the "Survived" column back as a normal column; for Spark, the column name is changed into a descriptive, but very long one. 

​For Spark, we can introduce the alias function for column to make things much nicer

    
All above are for "simple" aggregations, like those already pre-exist in Pandas or Spark, what about complicated ones? Like some weighted average or square sum? 

The complicated cases could be considered as:
1. aggregation on single column (like square sum)
2. aggregation on multiple columns (like weighted average based on another column)

Certainly, before we going to complicated on the aggregation, it is always easier to just create a new column (to do all the heavy lifting), and then simply aggregate on that specific column! While, here I just want to show that Pandas offer a few more flexibility

    
0 Comments

PySpark vs. Pandas (Part 2: join-related operation)

10/23/2016

0 Comments

 
Data is usually spread out in different tables, and insights are extracted when merging all information together: join related operators are very important to get this done. 

There are three kinds of join operators:
1. join by key(s)
2. join as set operator on Rows
3. join as set operator on Columns

    
The only difference (and potential problem) here is Pandas automatically change the same (non-key) column name with adding appendix to avoid name duplication, while Spark just keep the same name! Although there is a way to still referring the right "Survived" column, it is not quite convenient. So the following would be the recommended way: rename the collision column first. 

    
The second kind of join is more like set operator, basically considering two DFs as if two set, and take its "intersection", "difference", or "union"

    
The third kind of join is to extend the current data frame along the its index. It is similar (most time) as if joining the same key(s) with more extra column, but in Pandas, one can extend the column according to its index. 

    
0 Comments

PySpark vs. Pandas (Part 1: select and filter)

10/22/2016

0 Comments

 
As long as the data can be loaded fully into memory, Pandas is a great data analytic tool. However, with data amount much bigger Spark comes into the play. Pandas and PySpark DF have different APIs, and it is very easy to get confused or not knowing the best practice. I want to summarize my best practice so that others will take less detour.

    
0 Comments

Pandas: reshape data frame

10/2/2016

1 Comment

 
A data frame has its index, columns, and values inside. Any selection operation usually not affect the table's structure, but only "take selected pieces" out. Other operations may change its structure, like "group-by" operation. So how to change the structure back?

In excel, there is a concept of pivot table, which convert one or more columns into index/column, and nicely present the data. This is quite a nice feature and very fast provide analytic insights. Does Pandas support this?

The answer is "for sure!". Here are a few functions very often used in Pandas to manipulate the "shape" of data frame.

1. reset_index / set_index
    Very self-explanatory ... while reset_index change the an index back to a column, set_index move a column into the index.

2. pivot and pivot_table
    It always get confusing (to me) how to do pivot table in Pandas, while Nikolay Grozev's blog provides a very intuitive visualization. I will use one of them here for easy illustration, and it is encouraged to go to his blog for more details.
Picture
Picture
As you can see, pivot_table could considered as a "advanced" pivot, where the table is created with more control on which aggregation function to use, while pivot provides a faster way to just "reshape" the data frame into the one needed.
Picture
3. stack and unstack
   In Nikolay Grozev's blog, this section is also very well illustrated. I borrow on figure here, and the reader is highly recommend to check the details in original blog.
Picture
Let's see how it works in the Titanic data set, this is how it looks like:
Picture
Picture
Picture
Now, changing a data frame shape should not be a problem any more.
1 Comment

Pandas: group-by-aggregation deep dive

10/2/2016

0 Comments

 
A friend used to ask me one question: what is the function in Pandas that similar to R's summarize (as in dplyr)? Surprisingly,  I was not able to give a straight answer. However, after some digging, finally find a (somewhat) satisfactory answer.

First, let's look at how summarize in dplyr works (the code is borrowed from RStudio:

    
Other functions aside, focusing on the "summarise" function, one can easily specify the alias "arr" and "dep", logic function "mean", columns working over "arr_delay" and "dep_delay", and even conditional requirements "na.rm". This is very powerful.

While looking at the alternative in Pandas, let's only focusing on the "summarise" part and with the help of Titanic data set:

    
Picture
The functionality looks similar, but ... what about trying to have not only "mean", but also "std", "max" over the same column? also with different alias as if dplyr's "arr" and "dep"? Then we have to change the code into:

    
Picture
While ... what is this multiple level of columns? This is one concept in Pandas as "MultiIndex". Personally, I find MultiIndex over column hard to manipulate, so I prefer to drop it after the aggregation. The way to do this is:

    
Picture
However, is there a way to do EVERYTHING in one line? I don't like to define a "df1" and change its columns. Here is a trick, specify the columns after the "groupby", magic will happen :)

    
Picture
Now mission completed :)
0 Comments

Pandas: data frame sub-selection

10/1/2016

0 Comments

 
Finding the right block data out from a data frame was not elegant: one has to use different functions: like query, selection, loc, iloc, or directly use something like "df['x'] > df['y']", which is totally not chain-able.

Here is a few new features under Pandas 0.18.1 to simplify the flow of doing data frame selection.

    
Regardless of how difficult the selection logic is, because of the newly improved .loc, things start look neat!
0 Comments

Pandas: data cleaning functions

10/1/2016

1 Comment

 
Data cleaning is a key part in Pandas. Usually it requires functions such as:
- drop columns
- fill/drop missing data
- drop duplicate rows
- replace data value

Doing this in Pandas is quite straightforward (based on Pandas 0.18.1)

    
Picture
Here are the functions used:
drop, dropna, fillna, drop_duplicates, replace

Simple, but powerful  :)
1 Comment

Hive: it's about TIME

9/12/2016

1 Comment

 
While using regular SQL for long time, I find Hive syntax is not easy to use, especially on the TIME part. In many existing SQL program, one can compare date time with string easily, while in Hive, you have to specifically set the data type right. So to make things easier, here is a summarized note on how to deal with Hive TIME related operations.
(Based on Hive 1.2.1)

Principle:
    1. TIME could be "seen" in multiple types:
        a. Integer (actually Big Int)
        b. Timestamp
        c. Date
        d. String
     2. Many times we see a string that looks like a "timestamp" or "date", but it still is a string!
    3. Hive has timestamp and date datetype, and they can both be compared with a string or an integer.

require comparison be made at either Integer or String level. The "Timestamp" or "Date" is only a string representation of the underlying Integer data

Key functions (guarantee a date type):
    cast(date as date)
    cast(timestamp as date)
    cast(string as date)
    cast(date as timestamp)
    cast(date as string)


Examples:

>> select cast('2016-09-01' as date) > 1203828;   
>> true

>> select cast('2016-09-01' as string) > 1203828; 
>> NULL

Key functions (maybe different date type):
    unix_timestamp()                                                                                        # 1474353532  (integer)
    unix_timestamp('2016-08-01:21:00:01', 'yyyy-MM-dd:hh:mm:ss')      # 1470085201 (integer)
    select datediff('2016-09-21', '2015-02-21');                                             # 578 (integer)

    from_unixtime(1474353532)            # '2016-09-20 06:38:52'   (string )
    to_date('2016-08-01 00:03:24');        # '2016-08-01' (string)
    date_add('2016-08-01', 1);                 # '2016-08-02' (string)
    date_sub('2016-08-01', 1);                 # '2016-07-31' (string)

    from_utc_timestamp('2016-09-19 16:40:16.0', 'PST');       # 2016-09-19 09:40:16.0  (timestamp)
    to_utc_timestamp('2016-09-19 09:40:16.0', 'PST');            # 2016-09-19 16:40:16.0 (timestamp)
    current_timestamp()                                                              # '2016-09-20 15:39:43.145' (timestamp)

    current_date()                                     # '2016-09-20' (date)


Examples:

>> select 1474353532 < unix_timestamp();  -- integer vs. integer
>> true
>> select 1474353532 < from_unixtime(unix_timestamp())  -- integer vs. string
>> NULL

>> select to_date(from_unixtime(unix_timestamp())) > '2015-08-20' -- string vs. string
>> true
>> select to_date(from_unixtime(unix_timestamp())) > unix_timestamp() -- string vs. int
>> NULL
1 Comment
Forward>>

    Author

    Data Magician

    Archives

    October 2017
    April 2017
    November 2016
    October 2016
    September 2016

    Categories

    All
    Git
    Hive
    Machine Learning
    Matplotlib
    Pandas
    Plotly
    Python
    R
    Spark

    RSS Feed

Powered by Create your own unique website with customizable templates.
  • Home
  • Python
    • Pandas
    • Matplotlib
    • Interactive Visualization
    • Folium
  • Spark
    • DataFrame
  • Machine Learning
    • Classification >
      • Logistic Regression
    • Dimension Reduction
    • Model Explaination
  • Blog
  • About