A look into Regularized Greedy Forest

Bijula Ratheesh
6 min readNov 15, 2020

As the name suggests Regularized Greedy Forest is a forest, greedy and regularized :-). Let’s delve into more details.

Introduction

Decision tree is one of most commonly used technique for classification and also regression problems. Decision tree does not make any assumptions of the underlying variables and hence this is the simplest algorithm to deal with both linear and non-linear classifications. They have root node, internal node and terminal node or leaves. Decision trees makes decision based on entropy or Gini impurity at the nodes and we calculate its accuracy based on misclassification rate.

Although decision trees are easy to interpret, requires few pre-processing of the data etc., it suffers from many disadvantages such as,

1. Sensitive to noisy data. It can over fit noisy data.

2. Variance in data can result in the different decision tree.

3. Decision trees are biased with imbalance dataset

4. Training is very expensive in the case of large samples

To address the above issues, several ensemble models were created from decision trees called Boosting, bagging and stacking.

Boosting technique makes several weak learners (or small decision trees) out of the whole data in an order such that the error rate in the subsequent trees are reduced. Boosting reduces the bias or under-fitting. As such, boosting can over-fit. E.g. Adaboost, Gradient boosting, XGBoost, Light GBM etc. Boost is nothing but step.

Bagging or bootstrap aggregating, is about bootstrapping the dataset into multiple samples with replacement and apply weak learners to each and the final result is the aggregation of results from each tree. Bagging reduces the variance or over-fitting. E.g. Random forest

Both the method do sometimes suffer from over-fitting and under-fitting. However, by tuning the hyperparameters these can be largely avoided.

Stacking is a mix of both boosting and bagging in order to trade-off bias and variance.

Now a regularized decision tree, as per the research paper “Learning Nonlinear Functions Using Regularized Greedy Forest” they propose a method that directly learns decision forests via fully-corrective regularized greedy search using the underlying forest structure. This is a modification of Gradient boosting algorithm.

Gradient Boosting Architecture

Gradient boosting is a boosting algorithm that uses gradient descent method to find the minima of the loss function. Let look at the problem set-up and its step by step implementation.

Let h(x) nonlinear function on some data = [xn,ym]

Our training goal is to find a nonlinear prediction function h’(x) that minimizes a risk function

Step1:

Make one single decision tree by fitting all the data and find the target value using by optimizing the function

Where L(h(x), Y) is the loss function.

By using the gradient descent we can find the value of Y(Mean squared error is the loss function for regression and log loss is the loss function for classification problems)

Here we will get a single constant value for Y, and thus h’(x) also becomes a constant value say f0.

Step 2: (iterative process through number of trees)

2.1 Specify the maximum number of trees required K (max 100) and

2.2 For k = 1 to K

2.3 Find the pseudo residuals rim of each record using the below equation

2.4 Fit a regression tree to the residual values and create leaf nodes

2.5 For each leaf nodes compute predicted value

2.6 Update the prediction using the function below, were s is the learning rate. This has been introduce to control the loss function from becoming extremely large.

Step 3:

After the iterations are complete, we get the final predicted value for the target variable.

Few advantages are –

1. The method can automatically find nonlinear interactions via decision tree learning and

2. It has relatively few tuning parameters for a nonlinear learning scheme (the main tuning parameters are the shrinkage parameter s, number of terminals per tree, and the number of trees)

Disadvantages are

1. No explicit regularization in the algorithm

2. We have to keep a small learning rate and use of small learning parameter could make a huge model and hence high computational cost.

The algorithm itself is not necessarily the most effective method to construct the decision forest since the procedure separates tree learning and forest learning

Fully corrective GBDT

A fully corrective gradient boosting decision tree removes the need for using a small learning rate by choosing a sequence sk such that

This algorithm however has a greedy search thus causing overfitting.

Regularized Greedy Forest (RGF) Architecture

RGF is an improvisation to fully corrective gradient boosting by adding an explicit regularization parameter and an additive model over leaf nodes. Three key changes are:-

1. Nonlinear function h(x) is explicitly defined as an additive model on forest nodes (rather than trees) consistent with the underlying forest structure. In this framework, it is also possible to build a forest by growing multiple trees simultaneously. (also the structure sparsity)

2. RGF introduces an explicit regularization function R(h) on the nonlinear function h(x) and optimize the loss function:

1. Employ Fully-corrective greedy algorithm on the new non-linear function h’(x) which repeatedly re-optimizes the coefficients of all the decision rules obtained so far while rules are added into the forest by greedy search.

Definition

Let F represent a forest, and each node v of F is associated with (bv,αv). Here bv is the basis function that this node represents and αv is the weight or coefficient assigned to this node. The additive model of the forest is given by

And we need to minimize the loss function ((Mean squared error is the loss function for regression and log loss is the loss function for classification problems)

Where R(hF) is the regularization function.

The training objective of RGF is to build a forest that minimizes Q(F) Since the exact optimum solution is difficult to find, we greedily select the basis functions and optimize the weights. It essentially has two main components as follows.

· Fix the weights, and change the structure of the forest (which changes basis functions) so that the loss Q(F)is reduced the most.

· Fix the structure of the forest, and change the weights so that loss Q(F) is minimized.

There are 2 types of regularization considered here:

1) L2 regularization on leaf-only models: this simply chooses the given leaf-only model as the unique representation and uses the standard L2 regularization.

2) Minimum-penalty regularization: choose the model that minimizes some penalty as the unique representative of all the equivalent models.

Hyperparameters for RGF

max_leaf: Training will be terminated when the number of leaf nodes in the forest reaches this value. It should be large enough so that a good model can be obtained at some point of training, whereas a smaller value makes training time shorter.

Loss: Loss function

LS: square loss

Expo: exponential loss

Log: logistic loss

Algorithm:

RGF: RGF with L2 regularization on leaf-only models

RGF Opt: RGF with min-penalty regularization

RGF Sib: RGF with min-penalty regularization with the sum-to-zero sibling constraints

--

--