There are many different visualization packages in Python, and matplotlib is arguably the most popular one. It has been developed for around 14 years, provides almost all functionalities available to visualize static graph. Recently, the 2.0 version upgrade makes its graph much nicer even using default style, so it definitely worth learning.
Visualization in Matplotlib is not very easy, one need to know what each graph type is capable of, and their detailed parameters. So here, I wrote some "sample code" as if a "cheatsheet": use (sometimes redundantly) more parameters to generate fairly complicated graph. In this case, whenever we encounter any customization request later, referring to these code could usually be helpful. This post covers following graph type: Pie, Bar, Hist, Boxplot, Violinplot. The details can be seen in following three links: Pie-Bar, Hist, Box-Violin
0 Comments
Initially, in the plot_decision_boundary code, I used hard-code color schema to represent each class. If the input class # is higher than that, an error will be presented. This may not be the best approach. Then I start to think about: how to define class RGB representation in this specific use case? Basically, here is a few fact/requirement:
Fact: F-1. any data point has a probability associated with different class, they sum to 1. Requirement: R-1. If one point is 100% in class A, it should have a unique color, and cannot be constructed as a hybrid composition by other classes. R-2. Each color represents a unique combination of different classes composition. For R-2, after some thinking, I found it is impossible to get such no-duplication satisfied. This is because as long as the class # is larger than 4 (3 + 1), due to the dimensionality difference, there must be a duplication happens. So ignore this requirement. For condition R-1, it could be satisfied as long as the RGB value for all classes form a strict convex object within RGB 3D space. A simple example is: on a RG 2D space, if we use (0,0), (1,0), (0,1), (1,1) as class color definition, no class can be represented by a composition of other classes (based on coefficient sum to 1). However, if we add a new class like (0.5, 1.0), it can be written as: (0.5, 1.0) = 50% * (1, 1) + 50% * (0, 1) This is because the extra point is not making the class color geometry strict convex any more. A easier solution is to define a sphere within RGB 3D box, and taking the color representation over the sphere as each class color. In this case, R-1. should be satisfied. However, in practice, I found if using just the sphere, the contrast between classes will be smaller than ignore R-1. Here is the comparison between 20 colors and 100 colors between a sphere surface representation vs. RGB 3D surface representation. Currently, I set it as the default in plot_decision_boundary function. One great way to understanding how classifier works is through visualizing its decision boundary. In scikit-learn, there are several nice posts about visualizing decision boundary (plot_iris, plot_voting_decision_region); however, it usually require quite a few lines of code, and not directly usable. So I write the following function, hope it could serve as a general way to visualize 2D decision boundary for any classification models. (see Github, the notebook is Here) (Note. a few updates after my first publish, in current version: 1. the API is much simpler 2. add dimension reduction (PCA) to handle higher dimension cases 3. wrap the function into the package (pylib) ) The usage of this function is quite simple, here it is: In the random forest case, we see the decision boundary is not very continuous as the previous two models. This is because the decision boundary is calculated based on model prediction result: if the predict class changes on this grid, this grid will be identified as on decision boundary. However, if the model has strong volatile behavior in some space, it will be displayed as if decision boundary here.
Happy Thanksgiving! |
AuthorData Magician Archives
October 2017
Categories
All
|