Initially, in the plot_decision_boundary code, I used hard-code color schema to represent each class. If the input class # is higher than that, an error will be presented. This may not be the best approach. Then I start to think about: how to define class RGB representation in this specific use case? Basically, here is a few fact/requirement:
Fact: F-1. any data point has a probability associated with different class, they sum to 1. Requirement: R-1. If one point is 100% in class A, it should have a unique color, and cannot be constructed as a hybrid composition by other classes. R-2. Each color represents a unique combination of different classes composition. For R-2, after some thinking, I found it is impossible to get such no-duplication satisfied. This is because as long as the class # is larger than 4 (3 + 1), due to the dimensionality difference, there must be a duplication happens. So ignore this requirement. For condition R-1, it could be satisfied as long as the RGB value for all classes form a strict convex object within RGB 3D space. A simple example is: on a RG 2D space, if we use (0,0), (1,0), (0,1), (1,1) as class color definition, no class can be represented by a composition of other classes (based on coefficient sum to 1). However, if we add a new class like (0.5, 1.0), it can be written as: (0.5, 1.0) = 50% * (1, 1) + 50% * (0, 1) This is because the extra point is not making the class color geometry strict convex any more. A easier solution is to define a sphere within RGB 3D box, and taking the color representation over the sphere as each class color. In this case, R-1. should be satisfied. However, in practice, I found if using just the sphere, the contrast between classes will be smaller than ignore R-1. Here is the comparison between 20 colors and 100 colors between a sphere surface representation vs. RGB 3D surface representation. Currently, I set it as the default in plot_decision_boundary function.
1 Comment
One great way to understanding how classifier works is through visualizing its decision boundary. In scikit-learn, there are several nice posts about visualizing decision boundary (plot_iris, plot_voting_decision_region); however, it usually require quite a few lines of code, and not directly usable. So I write the following function, hope it could serve as a general way to visualize 2D decision boundary for any classification models. (see Github, the notebook is Here) (Note. a few updates after my first publish, in current version: 1. the API is much simpler 2. add dimension reduction (PCA) to handle higher dimension cases 3. wrap the function into the package (pylib) ) The usage of this function is quite simple, here it is: In the random forest case, we see the decision boundary is not very continuous as the previous two models. This is because the decision boundary is calculated based on model prediction result: if the predict class changes on this grid, this grid will be identified as on decision boundary. However, if the model has strong volatile behavior in some space, it will be displayed as if decision boundary here.
Happy Thanksgiving! One question constantly persists: Python vs. R, which one is better for data scientist. I read many blogs and forum discussion, and although some answers are great, quite a few are subjective. Here I found one great, detailed comparison, and I totally agree with the author's conclusion, the original link is here: https://www.dataquest.io/blog/python-vs-r/ Convert this blog's summary into a table for easy reference:
Group-by in Pandas is widely used, and since Pandas is heavily using Index, it may be not very convenient to directly "chain" the group-by statement with downstream analytics statement (especially if one need to aggregate multiple statistics for the same column. Here is a demo To avoid this "friction", a nicer approach (for people not familiar using index, including me) is following approach. This would make the group-by statement in Pandas index-free (good for index-frustrated users :) ).
The original discussion was on Github: https://github.com/pandas-dev/pandas/issues/14581 |
AuthorData Magician Archives
October 2017
Categories
All
|