Overview of Tree-based Algorithms in under 10 Minutes
Machine learning is not all about AI, but it is a big part of it.
Machine Learning has emerged as a new way of communicating your wishes to a computer. It’s exciting because it allows you to automate the ineffable. Machine learning is powering most of the recent advancements in AI, including Computer Vision, NLP (Natural Language Processing), Predictive Analytics, Chatbots, and a wide range of applications. To move up the data value chain from the information level to the knowledge level, we need to apply machine learning that will enable systems to identify patterns in data and learn from those patterns to apply to new, never-seen data.
Difference between Supervised Learning, Unsupervised Learning, and Semi-Supervised Learning
Machine Learning use cases primarily fall into 3 categories: Supervised Learning, Unsupervised Learning, and Semi-Supervised Learning.
1) Supervised Learning
In the Supervised Learning use case, we have a known set of input data and the corresponding set of responses for it. The aim is to build a model which aims to learn the patterns and relationships between different input features (called as training phase) to generate a reasonable prediction as a response to the new input data (called as testing phase). Most of the real-world problems at scale, addressed by different organizations, fall in this category.
2) Unsupervised Learning
Unlike the Supervised Learning use case, Unsupervised learning involves finding inferences and hidden patterns from input data without references to any labeled responses or outcomes. This ability to discover patterns in information makes it an ideal solution for Cross-Selling strategies, Customer Segmentation, Image and Pattern recognition.
3) Semi-Supervised Learning
Semi-Supervised Learning, on the other hand, offers solutions that use the best of both worlds: Supervised and Unsupervised Learning. This approach is mostly adopted when there is an absence of good quantity or quality input data with labeled outcomes to train a supervised learning model. The smaller labeled data set is used to identify hidden patterns and the larger unlabelled dataset is used to extract features to improve upon the lack of quantity of labeled dataset.
What are Tree-based Algorithms?
Tree-based algorithms are one of the best and most used supervised learning methods, this is mainly due to the reason that these predictive models offer high accuracy, stability, and ease of interpretation as they map non-linear relationships quite well. This series is a journey designed to help readers not only to get a peek under the hood of popularly used state of the art tree-based algorithms but also to help readers to reach a level of proficiency so that one can make a better choice at adopting them to solve the problem statement at hand. This blog specifically provides an overview of the following Tree-Based Algorithms, which are popularly used in the ML community:
- Decision Trees
- Random Forests
- Gradient Boosting Machines
1) Understanding Decision Trees
Decision Trees are one of the simplest and intuitive ways of predictive modeling. As the name suggests, we construct a tree-like model of decisions. It’s a flowchart-like structure that performs several tests on different features of the data point. Based on the tests, the data point will either be assigned to a class or a continuous value outcome. That is to say, Decision Trees can be used both for Classification and Regression use cases.
But how are Decision Trees constructed, in the first place? There are various algorithmic approaches to construct a decision tree-like ID3, C4.5, CART, etc. The main parameter which separates these different algorithms is the Splitting Criterion which is used while selecting the decision nodes of the Tree. The splitting criterion helps in deciding which feature to use and the threshold value to perform the decision-making.
We want the Decision Tree Model to be small and generalized to the training set, hence the main intent while performing the splits using Decision Nodes is to obtain the purest child nodes. It’s the different ways of measuring this purity which has let us come up with different splitting criterion.
Information Gain is one such criterion. For each node of the tree, we measure how much information the feature gives us about the class. The split with the highest information gain will be taken first and the process continues until all the child nodes are pure, or the information gain is 0. Information Gain is calculated by measuring the difference between the entropy of the original dataset before and after the split. ID3 uses information gained for constructing the decision trees.
Decision Trees essentially combine complex rules to give the correct predictions, because of this design, it is prone to Overfitting. In other words, A small change in the training set can cause a large change in the structure of the decision tree causing instability. In Addition, using a Decision Tree is relatively expensive, compared to other tree-based models, when training a large dataset.
Hence, a Decision Tree becomes a good choice when we have a small to medium-sized dataset and model interpretability is required. In other cases, it proves to be inefficient and hence other tree-based algorithms are preferred which we will explore in the next section.
2) Understanding Random Forest
Random Forest is essentially inspired to solve the overfitting problems in Decision Trees. As the name suggests, this model consists of many individual Decision Trees whose predictions are aggregated (Averaging in case of Regression and Max Voting in case of Classification).
The idea behind aggregating the prediction is quite simple, it is based on the Wisdom of Crowds. It says that A large crowd with uncorrelated opinions combined provides more wisdom than an individual opinion. In Data Science terminology, A large number of relatively uncorrelated models operation as a group will outperform any individual constituents. Uncorrelated models protect from propagating the errors of the base model. While some trees might be wrong, other trees might be correct, and overall as a group, they learn the correct patterns from the data instead of overfitting like a decision tree.
But how does the Random Forest Ensure to train individual decision trees which are uncorrelated? A simple answer to this would be, to supply different samples of data to each Decision Tree and grow each Decision Tree using a specific subset of features. This technique to randomly sample with replacement to build individual trees is also known as Bootstrap Aggregation or Bagging. Bagging in combination with Feature Randomness ensures that each tree is trained on different data and a different subset of features, ensuring that results are uncorrelated.
Although Random Forest was successful in mitigating the overfitting problem (or High Variance issue) in Decision Trees it is at a cost of increased compute and resources. Since multiple Decision Trees need to be built and maintained for predictions, it tends to take longer time, higher compute, larger resources, and lacks interpretability because of the ensemble nature. Despite these drawbacks, the model’s efficiency in giving correct predictions and ability to be used both for Regression and Classification problems has led to it being a popular algorithm that is used across multiple domains.
3) Understanding Gradient Boosting Machines
To understand GBM, we first need to touch upon the concept of Boosting. Unlike Bagging, where predictions of base models are aggregated to come up with a final prediction, Boosting focuses on building a strong learner by iteratively or sequentially learning patterns from the data using a consecutive chain of weak learners. Each tree, in this chain, focuses on correcting the net error generated from the previous tree. The first tree is built on the features with the highest predictive power which then passes on the predictions and error (difference between the actual and prediction) as an input to subsequent tree along with a weight parameter. The weights are used to control the second tree to utilize only those features which will fine-tune the combined predictions of the first and second tree to be close to the target as much as possible. The final prediction of the GBM is the weighted sum of all the individual predictions.
Now that we know how Boosting works, the question arises on how we come up with the optimal weights to be assigned to the predictions of the individual weak learner. This is where the Gradient Descent Algorithm comes in place. In the context of GBM, we want to build an additive model, wherein Trees are added one at a time while existing trees in the model are not changed. Gradient Descent aims to add trees that reduce the loss function (and follow the gradient). The weights assigned to each tree are updated to minimize the error. We can limit the number of trees to be added by fixing a number or acceptable threshold of the loss function to stop the training.
Because of this design, Gradient Boosting is a Greedy Algorithm and hence will keep on improving to minimize all errors leading to overfitting. Additionally, because trees must be trained sequentially it is computationally expensive and memory exhaustive for a larger dataset. That said, GBM is great at handling complex, non-linear relationships and are more powerful and accurate as compared to Random Forest or Decision Trees. In addition, Data Pre-processing steps to be done before using GBM are minimal, as it Handles the Missing Data and works great with Categorical Data as well.
4) Understanding XGBoost
XGBoost or Extreme Gradient Boosting is a modification of the GBM Architecture designed for Speed and Performance. It is an effort that is directed to push the limits of computations resources for boosted tree algorithms. Because of the Execution Speed and Model Performance, this is one of the go-to models in hackathons and has been recently dominating the applied machine learning community and Kaggle competitions which requires building ML models on structured or tabular data.
But what makes XGBoost so special? It’s the System Optimization and Algorithmic Enhancements on top of the existing GBM framework, which makes it stands out from the rest ML algorithm. Firstly, XGBoost approaches the Sequential Tree Building using Parallelized implementation. In addition, it addresses the overfitting problem in GBM by using the “depth-first” approach to pruning trees backward which adds to the computational performance. Secondly, The library is designed with cache awareness and out-of-core computing which optimizes the disk space while handling large datasets. Lastly, The Algorithmic Enhancements such as built-in Cross-Validation, Sparsity Awareness, and Regularization make it robust and enable it to deliver good model performance.
5) Understanding LightGBM
Like XGBoost, this is an open-source library that provides more efficient and effective implementation of the Gradient Boosting Algorithm. This Tree-Based model can be used both for Classification and Regression. This extends the idea of GBM by adding in a type of Auto Feature Selection as well as focusing on data points with large gradients. This results in a dramatic speed-up of training (Hence, it is called a lighter version of GBM) and predictive performance.
The high efficiency in terms of computing and high model performance can be attributed to two novel techniques GOSS or Gradient-Based One-Side Sampling and EFB or Exclusive Feature Bundling. GOSS is a modification of the Gradient Boosting method which gives more weightage to those examples which results in a larger gradient, which speeds up the learning process. EFB is an approach to bundle sparse (mostly zero) categorical features which have been one-hot encoded, thus acting as an Automatic Feature selection method.
But how does it compare against XGBoost? LightGBM has many of the XGBoost’s advantages such as Sparse Optimization, Parallel Training, Regularization, Early Stopping, etc. But a major difference lies in the way the trees are constructed. Light GBM grows trees Leaf-wise, unlike other Tree Ensemble methods which grow trees level-wise row by row. The leaf which leads to the largest decrease in loss is selected and a Tree is grown from the output of that Leaf.
Although it’s not fair to compare LightGBM and XGBoost, prior works which tried to benchmark the optimizations in these modified versions of GBM suggest XGBoost as a powerful algorithm that reduces the maximum training time ( when used on a GPU). It was also seen that LightGBM is not fast enough to converge to a good set of hyperparameters, as compared to XGBoost. That said, both these algorithms enjoy a fair bit of popularity in the ML community when it comes to Hackathons and working on use cases with a large dataset.
6) Understanding CatBoost
The name CatBoost comes from two words “Category” and “Boosting”. CatBoost is an open-source machine learning algorithm developed by Yandex. This algorithm is built on top of a Gradient Boosting Architecture, wherein consecutive trees are spawned to decrease the net error in prediction, the only difference being CatBoost uses oblivious decision trees to grow a balanced tree. That is to say that it uses the same features to make left and right splits for each level of the tree. This design helps in more efficient usage of CPU, in turn reducing the training time.
One of the key highlights of this algorithm is the ease of dealing with categorical features in the dataset. Almost all the ML models which have been built require all the training data to be numeric. That means, if we have a categorical feature, we will have to employ Label Encoding or One Hot Encoding technique, else it would fail during the model building stage. CatBoost, on the other hand, doesn’t require explicit pre-processing to convert the categories, instead it internally uses various statistical measures to combine different categorical features and numerical features to achieve the same. It is also important to note that providing One Hot Encoded inputs to CatBoost models can decrease its efficiency in terms of model performance.
As far as the performance is concerned, this model provides state-of-the-art results and stands in the same league as other Boosting Algorithms like XGBoost, LGBM, etc. The ease of use and the robustness towards overfitting leads to building a more generalized model which performs well for both regression and classification problems.
I hope by now, you have developed a fair bit of understanding about the evolution of Tree-Based Algorithms in the Machine Learning Landscape along with the key features which differentiate each of them. Stay tuned for more such “under 10 minutes” series of blogs, in the same space, where we will delve deeper into each of the algorithms with an ML use case to understand the intricacies to be taken care of while using them.
Build machine learning models within minutes
Anirudha Nayak is a Data Scientist at Subex AI Labs. His work focuses on Building AI Solutions for Telecom Service Providers to help them with the timely detection of Fraud occurring in the system. Apart from this, he is also a Backend Developer for “Hypersense AI Studio” – a no-code AI Automation platform offered by Subex. He holds a Post Graduate Diploma in Data Science certification from the Praxis Business School with a major focus on Machine Learning, Recommendation Systems, and Deep Learning. In his leisure time, he enjoys reading books on Self-Improvement, Personal Finance, and Philosophy.