There are many different visualization packages in Python, and matplotlib is arguably the most popular one. It has been developed for around 14 years, provides almost all functionalities available to visualize static graph. Recently, the 2.0 version upgrade makes its graph much nicer even using default style, so it definitely worth learning.
Visualization in Matplotlib is not very easy, one need to know what each graph type is capable of, and their detailed parameters. So here, I wrote some "sample code" as if a "cheatsheet": use (sometimes redundantly) more parameters to generate fairly complicated graph. In this case, whenever we encounter any customization request later, referring to these code could usually be helpful. This post covers following graph type: Pie, Bar, Hist, Boxplot, Violinplot. The details can be seen in following three links: Pie-Bar, Hist, Box-Violin
0 Comments
Initially, in the plot_decision_boundary code, I used hard-code color schema to represent each class. If the input class # is higher than that, an error will be presented. This may not be the best approach. Then I start to think about: how to define class RGB representation in this specific use case? Basically, here is a few fact/requirement:
Fact: F-1. any data point has a probability associated with different class, they sum to 1. Requirement: R-1. If one point is 100% in class A, it should have a unique color, and cannot be constructed as a hybrid composition by other classes. R-2. Each color represents a unique combination of different classes composition. For R-2, after some thinking, I found it is impossible to get such no-duplication satisfied. This is because as long as the class # is larger than 4 (3 + 1), due to the dimensionality difference, there must be a duplication happens. So ignore this requirement. For condition R-1, it could be satisfied as long as the RGB value for all classes form a strict convex object within RGB 3D space. A simple example is: on a RG 2D space, if we use (0,0), (1,0), (0,1), (1,1) as class color definition, no class can be represented by a composition of other classes (based on coefficient sum to 1). However, if we add a new class like (0.5, 1.0), it can be written as: (0.5, 1.0) = 50% * (1, 1) + 50% * (0, 1) This is because the extra point is not making the class color geometry strict convex any more. A easier solution is to define a sphere within RGB 3D box, and taking the color representation over the sphere as each class color. In this case, R-1. should be satisfied. However, in practice, I found if using just the sphere, the contrast between classes will be smaller than ignore R-1. Here is the comparison between 20 colors and 100 colors between a sphere surface representation vs. RGB 3D surface representation. Currently, I set it as the default in plot_decision_boundary function. One great way to understanding how classifier works is through visualizing its decision boundary. In scikit-learn, there are several nice posts about visualizing decision boundary (plot_iris, plot_voting_decision_region); however, it usually require quite a few lines of code, and not directly usable. So I write the following function, hope it could serve as a general way to visualize 2D decision boundary for any classification models. (see Github, the notebook is Here) (Note. a few updates after my first publish, in current version: 1. the API is much simpler 2. add dimension reduction (PCA) to handle higher dimension cases 3. wrap the function into the package (pylib) ) The usage of this function is quite simple, here it is: In the random forest case, we see the decision boundary is not very continuous as the previous two models. This is because the decision boundary is calculated based on model prediction result: if the predict class changes on this grid, this grid will be identified as on decision boundary. However, if the model has strong volatile behavior in some space, it will be displayed as if decision boundary here.
Happy Thanksgiving! One question constantly persists: Python vs. R, which one is better for data scientist. I read many blogs and forum discussion, and although some answers are great, quite a few are subjective. Here I found one great, detailed comparison, and I totally agree with the author's conclusion, the original link is here: https://www.dataquest.io/blog/python-vs-r/ Convert this blog's summary into a table for easy reference:
Group-by in Pandas is widely used, and since Pandas is heavily using Index, it may be not very convenient to directly "chain" the group-by statement with downstream analytics statement (especially if one need to aggregate multiple statistics for the same column. Here is a demo To avoid this "friction", a nicer approach (for people not familiar using index, including me) is following approach. This would make the group-by statement in Pandas index-free (good for index-frustrated users :) ).
The original discussion was on Github: https://github.com/pandas-dev/pandas/issues/14581 The caveat in Part 1 is about Pie chart: if one trying to replace any go.Bar or go.Histogram with go.Pie chart, there will be an error showing (plotly version 1.12.9) The reason for this is because the "tools.make_subplots" function creates a set of subplots based on different xaxis and yaxis, while go.Pie does not require (and it does not have) an axis property. The way to overcome such challenge is to build up the graph from scratch, using "domain" and "anchor". To explain the concept, take a look at the "layout" in Part 1's example. Each layout axis (x and y) is attached with a specific domain on this figure, with its axis "anchored" with a specific data. subplot-1 is on upper left, so its xaxis domain is [0, 0.45] (left side), and yaxis domain is [0.625, 1] (upper side). For Pie plot, since it cannot have axis, the make_subplots approach fake, but following approach still works I have to say, I hate to use the Annotation as a way to make the subplot's title ... however, I currently didn't find any other way to directly assign subplot title. If you got any better approach to simplify the code in general, feel free to comment on this blog.
In Matplotlib, subplot can be easily pulled out as following: However, matplotlib is not quite smart to handle the axis ticks rotation (sometimes they collapse together and hard to visualize), and this could be troublesome for some automatic visualization process. In general, I found plotly offer a better automated layout. The way subplots in plotly and matplotlib are conceptually different on: 1. (matplotlib) figure --- axes --- artist, so that one figure can contain multiple axes, and each axes has their own set of artist. For example, legend is an artist, and each axes could have its own legend 2. (plotly) the main component in a figure is "data" and "layout", the way subplot works is to create multiple data, put into different axis. This is still one big figure. each subplot is just different axis(x,y) located on different locations in this big Figure. Here is a few action There are million ways people can use one software, however, this is my preferred way (may not be optimal, but workable). As a data scientist, mostly I want to use Plotly for interactive exploratory analytics since it provides way to get better feeling about data. A data frame has its index, columns, and values inside. Any selection operation usually not affect the table's structure, but only "take selected pieces" out. Other operations may change its structure, like "group-by" operation. So how to change the structure back? In excel, there is a concept of pivot table, which convert one or more columns into index/column, and nicely present the data. This is quite a nice feature and very fast provide analytic insights. Does Pandas support this? The answer is "for sure!". Here are a few functions very often used in Pandas to manipulate the "shape" of data frame. 1. reset_index / set_index Very self-explanatory ... while reset_index change the an index back to a column, set_index move a column into the index. 2. pivot and pivot_table It always get confusing (to me) how to do pivot table in Pandas, while Nikolay Grozev's blog provides a very intuitive visualization. I will use one of them here for easy illustration, and it is encouraged to go to his blog for more details. As you can see, pivot_table could considered as a "advanced" pivot, where the table is created with more control on which aggregation function to use, while pivot provides a faster way to just "reshape" the data frame into the one needed. 3. stack and unstack In Nikolay Grozev's blog, this section is also very well illustrated. I borrow on figure here, and the reader is highly recommend to check the details in original blog. Let's see how it works in the Titanic data set, this is how it looks like: Now, changing a data frame shape should not be a problem any more.
A friend used to ask me one question: what is the function in Pandas that similar to R's summarize (as in dplyr)? Surprisingly, I was not able to give a straight answer. However, after some digging, finally find a (somewhat) satisfactory answer. First, let's look at how summarize in dplyr works (the code is borrowed from RStudio: Other functions aside, focusing on the "summarise" function, one can easily specify the alias "arr" and "dep", logic function "mean", columns working over "arr_delay" and "dep_delay", and even conditional requirements "na.rm". This is very powerful. While looking at the alternative in Pandas, let's only focusing on the "summarise" part and with the help of Titanic data set: The functionality looks similar, but ... what about trying to have not only "mean", but also "std", "max" over the same column? also with different alias as if dplyr's "arr" and "dep"? Then we have to change the code into: While ... what is this multiple level of columns? This is one concept in Pandas as "MultiIndex". Personally, I find MultiIndex over column hard to manipulate, so I prefer to drop it after the aggregation. The way to do this is: However, is there a way to do EVERYTHING in one line? I don't like to define a "df1" and change its columns. Here is a trick, specify the columns after the "groupby", magic will happen :) Now mission completed :)
|
AuthorData Magician Archives
October 2017
Categories
All
|