发布网友 发布时间:2023-03-17 15:40
共1个回答
热心网友 时间:2023-11-04 05:03
In the previous passage, I talked about the conception of Decision Treeand its use. Although being very powerful model that can handle both regression and classification tasks, decision trees usually suffer from high variance. This means that if we split the dataset into two parts at random and fit a decision tree to the two halves, we will get quite different results. Thus we need one approach to rece variance at the expense of bias.
Bagging, which is designed for the context, is just the procere for recing the variance of weak models. In bagging, a random sample of data in the training set is selected with replacement - which means indivial data points can be chosen more than once - and then we fit a weak learner, such as a decision tree, to each of the sample data. Finally, we aggregate the predictions of base learners to get a more accurate estimate.
We build B distinct sample datasets from the train set using bootstrapped training data sets, and calculate the prediction using B separate training sets, and average them in order to obtain a low-variance statistical model:
While bagging can rece the variance for many models, it is particularly useful for decision trees. To apply bagging, we simply construct B separate bootstrapped training sets and train B indivial decision trees on these training sets. Each tree can grow very deep and not be pruned, thus they have high variance but low bias. Hence, averaging these trees can rece variance.
Bagging has three steps to complete: bootstraping, parallel training, and aggregating.
There are a number of key benefits for bagging, including:
The key disadvantages of bagging are:
Now we practice how to use bagging to improve the performance of models. The scikit-learn Python machine learning library provides easy access to the bagging method.
First, we use make_classification function to construct the classification dataset for practice of the bagging problem.
Here, we make a binary problem dataset with 1000 observations and 30 input features.
(2250, 30) (750, 30) (2250,) (750,)
To demonstrate the benefits of bagging model, we first build one decision tree and compare it to bagging model.
Now we begin construct an ensemble model using bagging technique.
Based on the result, we can easily find that the ensemble model reces both bias(higher accuracy) and variance(lower std). Bagging model's accuracy is 0.066 higher than that of one single decision tree.
Make Prediction
BaggingClasifier can make predictions for new cases using the function predict .
Then we build a bagging model for the regression model. Similarly, we use make_regression function to make a dataset about the regression problem.
As we did before, we still use repeated k-fold cross-validation to evaluate the model. But one thing is different than the case of classification. The cross-validation feature expects a utility function rather than a cost function. In other words, the function thinks being greater is better rather than being smaller.
The scikit-learn package will make the metric, such as neg_mean_squared_erro negative so that is maximized instead of minimized. This means that a larger negative MSE is better. We can add one "+" before the score.
The mean squared error for decision treeis and variance is .
On the other hand, a bagging regressor performs much better than one single decision tree. The mean squared error is and variance is . The bagging reces both bias and variance.
In this section, we explore how to tune the hyperparameters for the bagging model.
We demonstrate this by performing a classification task.
Recall that the bagging is implemented by building a number of bootstrapped samples, and then building a weak learner for each sample data. The number of models we build corresponds to the parameter n_estimators .
Generally, the number of estimators can increase constantly until the performance of the ensemble model converges. And it is worth noting that using a very large number of n_estimators will not lead to overfitting.
Now let's try a different number of trees and examine the change in performance of the ensemble model.
Number of Trees 10: 0.862 0.038
Number of Trees 50: 0.887 0.025
Number of Trees 100: 0.888 0.027
Number of Trees 200: 0.89 0.027
Number of Trees 300: 0.888 0.027
Number of Trees 500: 0.888 0.028
Number of Trees 1000: 0.892 0.027
Number of Trees 2000: 0.889 0.029
Let's look at the distribution of scores
In this case, we can see that the performance of the bagging model converges to 0.888 when we grow 100 trees. The accuracy becomes flat after 100.
Now let's explore the number of samples in bootstrapped dataset. The default is to create the same number of samples as the original train set.
Number of Trees 0.1: 0.801 0.04
Number of Trees 0.2: 0.83 0.039
Number of Trees 0.30000000000000004: 0.849 0.029
Number of Trees 0.4: 0.842 0.031
Number of Trees 0.5: 0.856 0.039
Number of Trees 0.6: 0.866 0.037
Number of Trees 0.7000000000000001: 0.856 0.033
Number of Trees 0.8: 0.868 0.036
Number of Trees 0.9: 0.866 0.025
Number of Trees 1.0: 0.865 0.035
Similarly, look at the distribution of scores
The rule of thumb is that we set the max_sample to 1, but this does not mean all training observations will be selected from the train set. Since we leverage bootstrapping technique to select data from the training set at random with replacement, only about 63% of training instances are sampled on average on each predictor, while the remaining 37% of training instances are not sampled and thus called out-of-baginstances.
Since the ensemble predictor never sees the oob samples ring training, it can be evaluated on these instances, without additional need for cross-validation after training. We can use out-of-bag evaluation in scikit-learn by setting oob_score=True .
Let's try to use the out-of-bag score to evaluate a bagging model.
According to this oob evaluation, this BaggingClassifier is likely to achieve about 87.6% accuracy on the test set. Let’s verify this:
The BaggingClassifier class supports sampling the features as well. This is controlled by two hyperparameters: max_features and bootstrap_features. They work the same way as max_samples and bootstrap, but for feature sampling instead of instance sampling. Thus, each predictor will be trained on a random subset of the input features.
The random sampling of features is particularly useful for high-dimensional inputs, such as images. Randomly sampling both features and instances is called Random Patches. On the other hand, keeping all instances( bootstrap=False,max_sample=1.0 ) and sampling features( bootstrap_features=True,max_features smaller than 1.0 ) is called Random Subspaces.
Random subspaces ensemble is an extension to bagging ensemble model. It is created by a subset of features in the training set. Very similar to Random Forest, random subspace ensemble is different from it in only two aspects:
Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.
Reference: