2018.12.15 Update guide

2022-05-07 18:26:02 +03:00 · 2018-12-15 13:10:44 +08:00
parent baee2e14d6
commit 60f04ca903
2 changed files with 3 additions and 0 deletions
--- a/Selection.md
+++ b/Selection.md
@@ -359,6 +359,7 @@ Below is some additional resource on this topic:
 #### 3.2.1 Why Discretize Matters

 - help to improve model performance by grouping of similar attributes with similar predictive strengths
+- bring into non-linearity and thus improve fitting power of the model
 - enhance interpretability with grouped values
 - minimize the impact of **extreme values/seldom reversal patterns**
 - prevent overfitting possible with numerical variables
@@ -401,6 +402,8 @@ We must transform strings of categorical variables into numbers so that algorith

 **Note**: if we are using one-hot encoding in linear regression, we should keep k-1 binary variable to avoid multicollinearity. This is true for any algorithms that look at all features at the same time during training. Including SVM, neural network and clustering. Tree-based algorithm, on the other hand, need the entire set of binary variable to select the best split.

+**Note**: it is not recommended to use one-hot encoding with tree algorithms. One-hot will cause the split be highly imbalanced (as each label of the original categorical feature will now be a new feature), and the result is that neither of the two child nodes will have a good gain in purity. The prediction power of the one-hot feature will be weaker than the original feature as they have been broken into many pieces.
+
 An in-detail intro to WOE can be found [here](http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview).


--- a/Selection.pdf
+++ b/Selection.pdf