集成学习方法

对 \( t = 1, 2, \ldots, T \)：
- 从 \( D \) 中有放回地随机抽取 \( n \) 个样本得到 \( D_t \)
- 在 \( D_t \) 上训练基学习器 \( h_t \)
输出集成预测：
- 分类：\( H(x) = \arg\max_y \sum_{t=1}^T \mathbb{I}(h_t(x) = y) \)
- 回归：\( H(x) = \frac{1}{T}\sum_{t=1}^T h_t(x) \)

方差减少分析

若基学习器的方差为 \( \sigma^2 \)，两两之间的相关系数为 \( \rho \)，则集成后的方差为：

\[ \text{Var}\left(\frac{1}{T}\sum_{t=1}^T h_t\right) = \frac{\rho \sigma^2 (T-1) + \sigma^2}{T} = \rho\sigma^2 + \frac{(1-\rho)\sigma^2}{T} \]

当 \( T \to \infty \) 时，方差趋近于 \( \rho\sigma^2 \)。因此，降低基学习器之间的相关性是Bagging成功的关键。

Python实现

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=1.0,      # 每次抽样比例
    max_features=1.0,     # 每次使用的特征比例
    bootstrap=True,       # 有放回抽样
    oob_score=True,
    random_state=42,
    n_jobs=-1,
)

bagging_model.fit(X_train, y_train)
print(f"OOB准确率: {bagging_model.oob_score_:.4f}")

Boosting方法

Boosting基本思想

Boosting是一族将弱学习器提升为强学习器的方法。其核心思想是：每个新的基学习器重点关注前一轮被错误分类的样本，通过逐步纠正错误来提升整体性能。

与Bagging的并行训练不同，Boosting采用串行的方式依次训练基学习器。

AdaBoost算法

算法流程

AdaBoost（Adaptive Boosting）由Freund和Schapire于1995年提出。

输入：训练集 \( D = {(x_i, y_i)}_{i=1}^n \)，其中 \( y_i \in {-1, +1} \)

初始化样本权重：\( w_1(i) = \frac{1}{n} \)，对所有 \( i = 1, \ldots, n \)

对 \( t = 1, 2, \ldots, T \)：

使用权重分布 \( w_t \) 训练基学习器 \( h_t(x) \)
计算加权误差率： \[ \epsilon_t = \sum_{i=1}^n w_t(i) \cdot \mathbb{I}(h_t(x_i) \neq y_i) \]
计算学习器权重： \[ \alpha_t = \frac{1}{2} \ln\frac{1 - \epsilon_t}{\epsilon_t} \]
更新样本权重： \[ w_{t+1}(i) = \frac{w_t(i) \cdot \exp(-\alpha_t y_i h_t(x_i))}{Z_t} \] 其中 \( Z_t \) 是归一化常数。

输出：\( H(x) = \text{sign}\left(\sum_{t=1}^T \alpha_t h_t(x)\right) \)

理论分析

AdaBoost的训练误差上界：

\[ \frac{1}{n}\sum_{i=1}^n \mathbb{I}(H(x_i) \neq y_i) \leq \prod_{t=1}^T Z_t = \prod_{t=1}^T 2\sqrt{\epsilon_t(1-\epsilon_t)} \]

若每个基学习器的误差率 \( \epsilon_t \leq \frac{1}{2} - \gamma \)，则：

\[ \prod_{t=1}^T Z_t \leq \exp(-2\gamma^2 T) \]

即训练误差以指数速率下降。

Python实现

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ada_model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # 决策树桩
    n_estimators=200,
    learning_rate=0.1,
    algorithm='SAMME.R',
    random_state=42,
)

ada_model.fit(X_train, y_train)
y_pred_ada = ada_model.predict(X_test)
print(f"AdaBoost准确率: {ada_model.score(X_test, y_test):.4f}")

梯度提升决策树（GBDT）

算法原理

GBDT（Gradient Boosting Decision Tree）由Friedman于2001年提出，将Boosting问题视为函数空间中的梯度下降。

对于损失函数 \( L(y, F(x)) \)，GBDT在每一步拟合负梯度（伪残差）：

\[ r_{ti} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]{F=F{t-1}} \]

算法流程：

初始化：\( F_0(x) = \arg\min_\gamma \sum_{i=1}^n L(y_i, \gamma) \)
对 \( t = 1, 2, \ldots, T \)：
- 计算伪残差：\( r_{ti} = -\frac{\partial L(y_i, F_{t-1}(x_i))}{\partial F_{t-1}(x_i)} \)
- 对伪残差拟合决策树 \( h_t(x) \)
- 通过线搜索确定步长：\( \rho_t = \arg\min_\rho \sum_{i=1}^n L(y_i, F_{t-1}(x_i) + \rho h_t(x_i)) \)
- 更新模型：\( F_t(x) = F_{t-1}(x) + \eta \cdot \rho_t h_t(x) \)
输出：\( F_T(x) \)

其中 \( \eta \) 为学习率（收缩系数）。

常用损失函数

任务类型	损失函数	负梯度
回归（平方损失）	\( \frac{1}{2}(y-F)^2 \)	\( y - F \)
回归（绝对损失）	\(	y-F
回归（Huber损失）	混合	混合
分类（对数损失）	\( \log(1+e^{-yF}) \)	\( \frac{y}{1+e^{yF}} \)

Python实现

from sklearn.ensemble import GradientBoostingClassifier

gbdt_model = GradientBoostingClassifier(
    n_estimators=300,
    learning_rate=0.1,
    max_depth=5,
    min_samples_split=5,
    min_samples_leaf=3,
    subsample=0.8,         # 行采样比例
    max_features='sqrt',   # 列采样
    random_state=42,
)

gbdt_model.fit(X_train, y_train)
y_pred_gbdt = gbdt_model.predict(X_test)
print(f"GBDT准确率: {gbdt_model.score(X_test, y_test):.4f}")

XGBoost

核心改进

XGBoost（eXtreme Gradient Boosting）由陈天奇于2016年提出，在GBDT基础上做了多项优化：

正则化目标函数： \[ \text{Obj}^{(t)} = \sum_{i=1}^n L(y_i, \hat{y}i^{(t-1)} + f_t(x_i)) + \Omega(f_t) \] 其中正则化项 \( \Omega(f_t) = \gamma T + \frac{1}{2}\lambda\sum{j=1}^T w_j^2 \)，\( T \) 为叶节点数，\( w_j \) 为叶节点权重。
二阶泰勒展开： \[ \text{Obj}^{(t)} \approx \sum_{i=1}^n \left[g_i f_t(x_i) + \frac{1}{2}h_i f_t^2(x_i)\right] + \Omega(f_t) \] 其中 \( g_i = \partial_{\hat{y}^{(t-1)}} L(y_i, \hat{y}^{(t-1)}) \)，\( h_i = \partial^2_{\hat{y}^{(t-1)}} L(y_i, \hat{y}^{(t-1)}) \)。
最优叶节点权重： \[ w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda} \]
分裂增益： \[ \text{Gain} = \frac{1}{2}\left[\frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} h_i + \lambda} + \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i + \lambda} - \frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i + \lambda}\right] - \gamma \]

工程优化

列块并行：预排序特征值，支持并行计算分裂点
缓存感知访问：优化内存访问模式
稀疏感知：自动处理缺失值

Python实现

import xgboost as xgb
from sklearn.metrics import accuracy_score

# 创建DMatrix（XGBoost优化的数据结构）
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# 设置参数
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 5,
    'reg_alpha': 0.1,        # L1正则化
    'reg_lambda': 1.0,       # L2正则化
    'gamma': 0.1,            # 最小分裂增益
    'tree_method': 'hist',   # 直方图加速
    'random_state': 42,
}

# 训练（带早停）
evals = [(dtrain, 'train'), (dtest, 'eval')]
xgb_model = xgb.train(
    params,
    dtrain,
    num_boost_round=500,
    evals=evals,
    early_stopping_rounds=50,
    verbose_eval=50,
)

# 预测
y_pred_xgb = (xgb_model.predict(dtest) > 0.5).astype(int)
print(f"XGBoost准确率: {accuracy_score(y_test, y_pred_xgb):.4f}")

# 特征重要性
xgb.plot_importance(xgb_model, max_num_features=10)

LightGBM

核心创新

LightGBM由微软于2017年发布，主要创新包括：

基于直方图的分裂算法（Histogram-based）：
- 将连续特征离散化为 \( k \) 个箱（bins）
- 时间复杂度从 \( O(n \cdot p) \) 降为 \( O(k \cdot p) \)
- 内存占用大幅减少
带深度限制的叶子优先生长策略（Leaf-wise）：
- 每次选择增益最大的叶节点进行分裂
- 相比层级生长（Level-wise），能更快降低损失
- 需要限制最大深度防止过拟合
梯度单侧采样（GOSS）：保留梯度大的样本，对梯度小的样本随机采样，在保持精度的同时加速训练。
互斥特征绑定（EFB）：将互斥的稀疏特征绑定为一个特征，减少特征数量。

Python实现

import lightgbm as lgb

# 创建数据集
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

# 设置参数
params = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'min_child_samples': 20,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'verbose': -1,
}

# 训练
lgb_model = lgb.train(
    params, train_data, num_boost_round=1000,
    valid_sets=[test_data],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(50)],
)

# 预测
y_pred_lgb = (lgb_model.predict(X_test) > 0.5).astype(int)
print(f"LightGBM准确率: {accuracy_score(y_test, y_pred_lgb):.4f}")

Stacking方法

基本原理

Stacking（Stacked Generalization）由Wolpert于1992年提出，其核心思想是使用一个元学习器（meta-learner）来组合多个基学习器的预测结果。

两层结构

第一层（基学习器层）：训练多个不同类型的基学习器
第二层（元学习器层）：以第一层学习器的输出作为输入，训练元学习器

训练流程

为避免过拟合，Stacking通常采用K折交叉验证生成元特征：

将训练集 \( D \) 分为 \( K \) 折
对每个基学习器 \( h_m \)（\( m = 1, \ldots, M \)）：
- 对每折 \( k \)：
  - 用除第 \( k \) 折外的数据训练 \( h_m \)
  - 对第 \( k \) 折数据做预测，得到元特征
拼接所有折的预测，得到完整的元特征矩阵 \( Z \in \mathbb{R}^{n \times M} \)
用 \( (Z, y) \) 训练元学习器 \( g \)

最终预测：

\[ H(x) = g(h_1(x), h_2(x), \ldots, h_M(x)) \]

Python实现

from sklearn.ensemble import (
    StackingClassifier,
    RandomForestClassifier,
    GradientBoostingClassifier,
)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# 定义基学习器
base_estimators = [
    ('rf', RandomForestClassifier(n_estimators=200, random_state=42)),
    ('gbdt', GradientBoostingClassifier(n_estimators=200, random_state=42)),
    ('svm', SVC(kernel='rbf', probability=True, random_state=42)),
    ('knn', KNeighborsClassifier(n_neighbors=5)),
]

# 定义元学习器
meta_learner = LogisticRegression(random_state=42)

# 构建Stacking模型
stacking_model = StackingClassifier(
    estimators=base_estimators,
    final_estimator=meta_learner,
    cv=5,                    # 5折交叉验证生成元特征
    stack_method='predict_proba',
    n_jobs=-1,
)

stacking_model.fit(X_train, y_train)
y_pred_stack = stacking_model.predict(X_test)
print(f"Stacking准确率: {stacking_model.score(X_test, y_test):.4f}")

多层Stacking

在实际竞赛中，有时会使用多层Stacking结构。第一层由多个基学习器生成元特征，第二层由中间学习器进一步组合，第三层由最终元学习器输出预测。关键是每层都需要通过交叉验证生成元特征以避免信息泄漏。

方法对比

核心特征比较

特征	Bagging	Boosting	Stacking
训练方式	并行	串行	分层
样本使用	Bootstrap抽样	加权全样本	交叉验证
基学习器	同质（通常）	同质	异质（通常）
降低什么	方差	偏差	偏差和方差
过拟合风险	低	中等	较高
计算复杂度	中等（可并行）	较高（串行）	最高

Boosting方法内部比较

特征	AdaBoost	GBDT	XGBoost	LightGBM
提出时间	1995	2001	2016	2017
优化方式	指数损失	一阶梯度	二阶梯度	二阶梯度
正则化	无显式	有限	完善	完善
树生长策略	-	Level-wise	Level-wise	Leaf-wise
缺失值处理	不支持	不支持	自动处理	自动处理
大数据支持	一般	一般	好	最好
训练速度	快	中等	快	最快
内存占用	低	中等	中等	低

选择建议

数据量小、特征少：AdaBoost或GBDT
数据量中等：XGBoost
数据量大：LightGBM
追求稳定性和鲁棒性：Bagging/随机森林
追求最高精度：Stacking或Boosting系列
需要快速迭代：LightGBM

实际案例分析

案例：房价预测中的集成方法对比

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import (
    RandomForestRegressor, GradientBoostingRegressor,
    BaggingRegressor, StackingRegressor,
)
from sklearn.linear_model import Ridge
import xgboost as xgb
import lightgbm as lgb

# 生成回归数据
X, y = make_regression(
    n_samples=2000, n_features=20, n_informative=10,
    noise=10, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 定义并评估多个模型
models = {
    'Bagging': BaggingRegressor(n_estimators=200, random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=200, random_state=42),
    'GBDT': GradientBoostingRegressor(
        n_estimators=300, learning_rate=0.1, max_depth=5, random_state=42
    ),
    'XGBoost': xgb.XGBRegressor(
        n_estimators=300, learning_rate=0.1, max_depth=6, random_state=42
    ),
    'LightGBM': lgb.LGBMRegressor(
        n_estimators=300, learning_rate=0.1, num_leaves=31, random_state=42
    ),
}

for name, model in models.items():
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    test_r2 = r2_score(y_test, y_pred)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"{name:15s}: CV R2={cv_scores.mean():.4f}, "
          f"Test R2={test_r2:.4f}, RMSE={test_rmse:.2f}")

# Stacking集成
stacking_model = StackingRegressor(
    estimators=[
        ('rf', RandomForestRegressor(n_estimators=200, random_state=42)),
        ('gbdt', GradientBoostingRegressor(n_estimators=200, random_state=42)),
        ('xgb', xgb.XGBRegressor(n_estimators=200, random_state=42)),
    ],
    final_estimator=Ridge(alpha=1.0),
    cv=5,
)
stacking_model.fit(X_train, y_train)
y_pred_stack = stacking_model.predict(X_test)
print(f"{'Stacking':15s}: Test R2={r2_score(y_test, y_pred_stack):.4f}")

可视化比较

import matplotlib.pyplot as plt

def plot_model_comparison(names, r2_scores, rmse_scores):
    """绘制模型对比图"""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    axes[0].barh(names, r2_scores, color='steelblue', alpha=0.8)
    axes[0].set_xlabel('R2 Score')
    axes[0].set_title('各模型 R2 得分对比')
    axes[1].barh(names, rmse_scores, color='coral', alpha=0.8)
    axes[1].set_xlabel('RMSE')
    axes[1].set_title('各模型 RMSE 对比')
    plt.tight_layout()
    plt.savefig("ensemble_comparison.png", dpi=150)
    plt.show()

应用注意事项与局限性

通用注意事项

过拟合防控：
- Boosting方法需要早停（early stopping）
- 使用正则化参数（L1、L2、gamma等）
- 控制树的深度和叶节点数
学习率与迭代次数的权衡：
- 较小的学习率需要更多的迭代次数
- 通常 \( \eta \in [0.01, 0.3] \)
- 经验法则：\( \eta \times T \approx \text{const} \)
特征工程：
- 树模型对特征尺度不敏感，无需标准化
- 但特征交叉和组合仍可能提升性能
- 类别特征的编码方式会影响效果
不平衡数据处理：
- 使用scale_pos_weight（XGBoost）或is_unbalance（LightGBM）
- 过采样/欠采样
- 使用合适的评估指标（AUC、F1等）

各方法的局限性

Bagging的局限性：对偏差高的模型改善有限；无法捕捉样本间的顺序关系。

Boosting的局限性：对噪声和异常值敏感；训练时间较长（串行）；需要仔细调参防止过拟合。

Stacking的局限性：计算成本最高；容易过拟合；实现复杂度高；需要大量数据才能发挥优势。

实践建议

基线模型：先用随机森林建立基线，再尝试Boosting方法
超参数调优：使用Optuna或Hyperopt进行贝叶斯优化
交叉验证：始终使用交叉验证评估，避免信息泄漏
模型融合：竞赛中可尝试多种方法的融合
计算资源：大数据优先使用LightGBM，小数据可用XGBoost

总结

集成学习是机器学习中最强大的技术范式之一，其核心在于通过多个学习器的协作来超越单一学习器的性能上限：

Bagging：通过降低方差提升稳定性，适合高方差模型
Boosting：通过降低偏差提升精度，适合追求最优性能的场景
Stacking：通过异质集成最大化多样性，适合竞赛冲榜

在数学建模实践中，建议根据问题特点和计算资源选择合适的集成策略，同时注意模型的可解释性和部署效率之间的权衡。

Keyboard shortcuts

数学建模指南