Ensemble Methods (앙상블 기법)¶

개요¶

앙상블 기법은 여러 개의 학습 모델을 결합하여 단일 모델보다 더 좋은 예측 성능을 얻는 방법이다.

핵심 원리¶

"군중의 지혜 (Wisdom of Crowds)"

개별 모델이 서로 다른 오차를 만든다면, 이들을 결합하면 오차가 상쇄되어 성능이 향상됨.

Bias-Variance Tradeoff¶

\[\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}\]

앙상블 유형	주요 효과	기반 모델
Bagging	Variance 감소	고분산 (복잡한) 모델
Boosting	Bias 감소	고편향 (단순한) 모델
Stacking	둘 다 개선	다양한 모델

앙상블 분류¶

                        Ensemble Methods
                              │
            ┌─────────────────┼─────────────────┐
            │                 │                 │
        Bagging           Boosting          Stacking
            │                 │                 │
    ┌───────┴───────┐   ┌─────┴─────┐         Meta-
    │               │   │           │         Learner
 Random         Pasting  AdaBoost   Gradient
 Forest                             Boosting
                                        │
                                   ┌────┴────┐
                                XGBoost  LightGBM

기법 비교¶

기법	학습 방식	결합 방법	병렬화	과적합 경향
Bagging	독립 (병렬)	평균/투표	가능	낮음
Boosting	순차	가중 합	제한적	있음
Stacking	독립 -> 순차	메타 모델	부분적	주의 필요

다양성 (Diversity)¶

앙상블 성능의 핵심은 다양성:

\[\text{Ensemble Error} = \bar{E} - \bar{A}\]

\(\bar{E}\): 개별 모델의 평균 오차
\(\bar{A}\): 모델 간 불일치도 (Ambiguity)

다양성 확보 방법¶

방법	설명	예시
데이터 샘플링	서로 다른 부분집합 학습	Bootstrap (Bagging)
특성 샘플링	서로 다른 특성 사용	Random Subspace
모델 다양성	다른 알고리즘 사용	Stacking
하이퍼파라미터	다른 설정 사용	다양한 max_depth
초기화	다른 랜덤 시드	Neural Network 앙상블

투표 방식¶

분류¶

Hard Voting: 다수결

\[\hat{y} = \text{mode}(h_1(x), h_2(x), ..., h_n(x))\]

Soft Voting: 확률 평균

\[\hat{y} = \arg\max_k \frac{1}{n} \sum_{i=1}^{n} P_i(y=k|x)\]

회귀¶

평균:

\[\hat{y} = \frac{1}{n} \sum_{i=1}^{n} h_i(x)\]

가중 평균:

\[\hat{y} = \sum_{i=1}^{n} w_i h_i(x), \quad \sum_i w_i = 1\]

기본 사용법 (scikit-learn)¶

Voting Classifier¶

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score

# 데이터
X, y = load_iris(return_X_y=True)

# 개별 모델
clf1 = LogisticRegression(max_iter=1000)
clf2 = DecisionTreeClassifier()
clf3 = SVC(probability=True)

# Hard voting
hard_voting = VotingClassifier(
    estimators=[('lr', clf1), ('dt', clf2), ('svc', clf3)],
    voting='hard'
)

# Soft voting
soft_voting = VotingClassifier(
    estimators=[('lr', clf1), ('dt', clf2), ('svc', clf3)],
    voting='soft'
)

# 평가
for clf, name in [(clf1, 'LR'), (clf2, 'DT'), (clf3, 'SVC'), 
                   (hard_voting, 'Hard'), (soft_voting, 'Soft')]:
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print(f"{name}: {scores.mean():.4f} (+/- {scores.std():.4f})")

Voting Regressor¶

from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True)
X, y = X[:2000], y[:2000]  # 샘플 축소

reg1 = Ridge()
reg2 = DecisionTreeRegressor()
reg3 = SVR()

voting_reg = VotingRegressor(
    estimators=[('ridge', reg1), ('dt', reg2), ('svr', reg3)]
)

scores = cross_val_score(voting_reg, X, y, cv=5, scoring='r2')
print(f"Voting R2: {scores.mean():.4f}")

참고¶

Dietterich, T.G. (2000). "Ensemble Methods in Machine Learning"
Zhou, Z.H. (2012). "Ensemble Methods: Foundations and Algorithms"
scikit-learn Ensemble: https://scikit-learn.org/stable/modules/ensemble.html

주제	링크
Bagging	bagging.md
Boosting	boosting.md
Stacking	stacking.md
Random Forest	../supervised/classification/random-forest.md
XGBoost	../supervised/classification/xgboost.md
LightGBM	../supervised/classification/lightgbm.md