Git is a very popular version control system for tracking changes in computer files and coordinating work on those files among multiple people (Wikipedia). It is well used in Data Science projects to keep track of code and maintain parallel development. Git can be used in a very complicated way, however, for Data Scientist, we can keep it simple. In this post, I am going to walk through the main use cases if you are a "Solo Master".
Note. There are many awesome resources out there talking about "what is Git", and "the basic concept in Git", I would refer to the official Git website on this "Getting Started -- Git Basic" Now we can start with some cool project! First, let's go to Github to create an empty project, then start to config it properly on your local laptop. Case 1. One working space, nothing goes wrong This is the ideal and simplest situation, what you need to do is just add more files to one commit, commit the code, and then push to the remote master branch. Life is so easy under such situation. Case 2. One working space, mistake before "git add" This always happen ... you started playing with your idea, and added a few draft code in the file, and quickly figured out this idea does not work, and now you want to get back the clean slate. How to do that? Fortunately, if you didn't run any "git add" on the new file, this is very easy. For more details, please refer to "Git checkout". Case 3. One working space, mistake before "git commit" You thought the idea is going to work, added a few files, made some changes, did a few "git add", and finally, you figured out the result is not right. Now you want to get rid of the mess and back to the nice, right, old code. For more details, please refer to "Git reset". Case 4. One working space, mistake before "git push" You went even further this time, not only you did "git add", but also this modification took a few hours and you also did a few "git commit"! Ah, another huge mistake, what to do?! For more details, please refer to "Git reset". Case 5. One working space, mistake after "git push" You pushed the code to production, and other members found this is a big mistake/bug. Now you need to revert the code back to where it was. For more details, please refer to "Git revert". Case 6. Multiple working spaces You have two working spaces, one is in your company laptop, one is in your company work station. You develop feature 2 in one working space, and feature 3 in another working space. Now you see the problem, and the solution is to use "git pull" first. "git pull" = "git fetch" + "git merge" or "git fetch" + "git rebase" For the details, refer to "Git pull". Remember, now the remote branch looks like following Now, as long as you develop each individual features in each working space, this process would have no problem. This is considered a better practice than working on the same feature in different working space. Because if the same file is modified in different spaces, the "merge" process will have many conflicts and resolving that would be a huge deal for "solo masters".
Great, now after these simple case studies, you become the real "solo master" in Git. You will never lose any code (it will always be pushed to the cloud) or worry about code inconsistency in multiple working spaces (as long as "git pull" is used correctly). Enjoy using Git!
0 Comments
I love taking various online education resources to broaden my view and knowledge base. Recently, I finished the "Executive Data Science" specialization provided by Coursera, and found its quite helpful. Just write some thoughts on what I learned from this course. s These series courses does not provide technical knowledge for people to become data scientist, but offer the insights and toolkits to lead a data science project and manage a data science team. Although its discussion is mostly focusing on "statistical analytics insights - data scientist" work setting, I think quite a few concepts can still be transferable if working in a "machine learning product - data scientist" environment. The highlights in this specialization (for me) is following: 1. How to build a team: different focus and cooperation between "data engineer", "data scientist", and "business analyst". Although in reality, many times we play all three hats, it is nice to realize that intrinsically there are some difference so that to grow a team, we know what is the next stage hiring or knowledge sets required. 2. How to manage project: basically, using statistical sense to identify the top priority and move agilely along the right direction. I think most of the discussions make common sense in data science region, and it really helps to get a summarize view about how to prioritize data science focus, identify the right talent to do that, and continuously monitor/guide the project to produce end-result. Appreciate the faculties in Johns Hopkins to produce this great specialization. Happy learning! There are many different visualization packages in Python, and matplotlib is arguably the most popular one. It has been developed for around 14 years, provides almost all functionalities available to visualize static graph. Recently, the 2.0 version upgrade makes its graph much nicer even using default style, so it definitely worth learning.
Visualization in Matplotlib is not very easy, one need to know what each graph type is capable of, and their detailed parameters. So here, I wrote some "sample code" as if a "cheatsheet": use (sometimes redundantly) more parameters to generate fairly complicated graph. In this case, whenever we encounter any customization request later, referring to these code could usually be helpful. This post covers following graph type: Pie, Bar, Hist, Boxplot, Violinplot. The details can be seen in following three links: Pie-Bar, Hist, Box-Violin Initially, in the plot_decision_boundary code, I used hard-code color schema to represent each class. If the input class # is higher than that, an error will be presented. This may not be the best approach. Then I start to think about: how to define class RGB representation in this specific use case? Basically, here is a few fact/requirement:
Fact: F-1. any data point has a probability associated with different class, they sum to 1. Requirement: R-1. If one point is 100% in class A, it should have a unique color, and cannot be constructed as a hybrid composition by other classes. R-2. Each color represents a unique combination of different classes composition. For R-2, after some thinking, I found it is impossible to get such no-duplication satisfied. This is because as long as the class # is larger than 4 (3 + 1), due to the dimensionality difference, there must be a duplication happens. So ignore this requirement. For condition R-1, it could be satisfied as long as the RGB value for all classes form a strict convex object within RGB 3D space. A simple example is: on a RG 2D space, if we use (0,0), (1,0), (0,1), (1,1) as class color definition, no class can be represented by a composition of other classes (based on coefficient sum to 1). However, if we add a new class like (0.5, 1.0), it can be written as: (0.5, 1.0) = 50% * (1, 1) + 50% * (0, 1) This is because the extra point is not making the class color geometry strict convex any more. A easier solution is to define a sphere within RGB 3D box, and taking the color representation over the sphere as each class color. In this case, R-1. should be satisfied. However, in practice, I found if using just the sphere, the contrast between classes will be smaller than ignore R-1. Here is the comparison between 20 colors and 100 colors between a sphere surface representation vs. RGB 3D surface representation. Currently, I set it as the default in plot_decision_boundary function. One great way to understanding how classifier works is through visualizing its decision boundary. In scikit-learn, there are several nice posts about visualizing decision boundary (plot_iris, plot_voting_decision_region); however, it usually require quite a few lines of code, and not directly usable. So I write the following function, hope it could serve as a general way to visualize 2D decision boundary for any classification models. (see Github, the notebook is Here) (Note. a few updates after my first publish, in current version: 1. the API is much simpler 2. add dimension reduction (PCA) to handle higher dimension cases 3. wrap the function into the package (pylib) ) The usage of this function is quite simple, here it is: In the random forest case, we see the decision boundary is not very continuous as the previous two models. This is because the decision boundary is calculated based on model prediction result: if the predict class changes on this grid, this grid will be identified as on decision boundary. However, if the model has strong volatile behavior in some space, it will be displayed as if decision boundary here.
Happy Thanksgiving! One question constantly persists: Python vs. R, which one is better for data scientist. I read many blogs and forum discussion, and although some answers are great, quite a few are subjective. Here I found one great, detailed comparison, and I totally agree with the author's conclusion, the original link is here: https://www.dataquest.io/blog/python-vs-r/ Convert this blog's summary into a table for easy reference:
Group-by in Pandas is widely used, and since Pandas is heavily using Index, it may be not very convenient to directly "chain" the group-by statement with downstream analytics statement (especially if one need to aggregate multiple statistics for the same column. Here is a demo To avoid this "friction", a nicer approach (for people not familiar using index, including me) is following approach. This would make the group-by statement in Pandas index-free (good for index-frustrated users :) ).
The original discussion was on Github: https://github.com/pandas-dev/pandas/issues/14581 In SQL, by defining a specific "window", one can perform "calculation across a set of table rows that are somehow related to the current row" (from PostgreSQL). This greatly extend SQL's analytics power. The implementation in PySpark is quite close (syntactically) to SQL, one have to define a "window" literally; while for Pandas, although it also has a "window" function, I found it is more like a rolling window, rather the SQL's window functionality. (correct me if I am wrong). My preferred way to replicate windows function in Pandas and PySpark is below: From this comparison, looks like PySpark is more SQL-straightforward, with a nice syntax to do that. However, actually Pandas is more versatile and many different functionality could be defined from the "assign" keyword. Good or bad? I think it depends.
The caveat in Part 1 is about Pie chart: if one trying to replace any go.Bar or go.Histogram with go.Pie chart, there will be an error showing (plotly version 1.12.9) The reason for this is because the "tools.make_subplots" function creates a set of subplots based on different xaxis and yaxis, while go.Pie does not require (and it does not have) an axis property. The way to overcome such challenge is to build up the graph from scratch, using "domain" and "anchor". To explain the concept, take a look at the "layout" in Part 1's example. Each layout axis (x and y) is attached with a specific domain on this figure, with its axis "anchored" with a specific data. subplot-1 is on upper left, so its xaxis domain is [0, 0.45] (left side), and yaxis domain is [0.625, 1] (upper side). For Pie plot, since it cannot have axis, the make_subplots approach fake, but following approach still works I have to say, I hate to use the Annotation as a way to make the subplot's title ... however, I currently didn't find any other way to directly assign subplot title. If you got any better approach to simplify the code in general, feel free to comment on this blog.
In Matplotlib, subplot can be easily pulled out as following: However, matplotlib is not quite smart to handle the axis ticks rotation (sometimes they collapse together and hard to visualize), and this could be troublesome for some automatic visualization process. In general, I found plotly offer a better automated layout. The way subplots in plotly and matplotlib are conceptually different on: 1. (matplotlib) figure --- axes --- artist, so that one figure can contain multiple axes, and each axes has their own set of artist. For example, legend is an artist, and each axes could have its own legend 2. (plotly) the main component in a figure is "data" and "layout", the way subplot works is to create multiple data, put into different axis. This is still one big figure. each subplot is just different axis(x,y) located on different locations in this big Figure. Here is a few action |
AuthorData Magician Archives
October 2017
Categories
All
|