Decision Tree (의사결정나무)¶

개요¶

문제 정의¶

데이터를 재귀적으로 분할하여 트리 구조의 예측 모델을 구축:

각 내부 노드: 특성에 대한 조건 (예: age > 30)
각 리프 노드: 예측값 (클래스 또는 수치)
루트에서 리프까지의 경로가 결정 규칙

핵심 아이디어¶

탐욕적 분할 (Greedy Splitting):

각 노드에서 불순도(impurity)를 최대한 감소시키는 분할 선택:

Information Gain = Impurity(parent) - weighted_avg(Impurity(children))

가장 높은 Information Gain을 주는 특성과 분할점 선택.

알고리즘/수식¶

불순도 측정 (분류)¶

Gini Impurity:

Gini(t) = 1 - sum_{k=1}^{K} p_k^2

여기서 p_k는 노드 t에서 클래스 k의 비율.

Gini = 0: 완벽한 순수 (한 클래스만 존재)
Gini = 0.5: 최대 불순도 (이진 분류 시)

Entropy (정보 엔트로피):

Entropy(t) = -sum_{k=1}^{K} p_k * log_2(p_k)

Entropy = 0: 완벽한 순수
Entropy = 1: 최대 불순도 (이진 분류 시)

Information Gain:

IG(t, a) = Entropy(t) - sum_{v in values(a)} (|t_v| / |t|) * Entropy(t_v)

불순도 측정 (회귀)¶

MSE (Mean Squared Error):

MSE(t) = 1/n * sum_{i in t} (y_i - y_bar)^2

MAE (Mean Absolute Error):

MAE(t) = 1/n * sum_{i in t} |y_i - y_median|

분할 알고리즘¶

Algorithm: CART (Classification and Regression Trees)

Input: 데이터셋 D, 최대 깊이 max_depth
Output: 의사결정나무 T

function BuildTree(D, depth):
    # 종료 조건
    if depth >= max_depth or |D| < min_samples or D is pure:
        return LeafNode(prediction=majority_class(D) or mean(D))

    # 최적 분할 탐색
    best_gain = -inf
    for each feature f in features:
        for each threshold t in unique_values(D[f]):
            D_left = {x in D : x[f] <= t}
            D_right = {x in D : x[f] > t}

            gain = Impurity(D) - (|D_left|/|D|)*Impurity(D_left) 
                                - (|D_right|/|D|)*Impurity(D_right)

            if gain > best_gain:
                best_gain = gain
                best_split = (f, t)

    # 분할 적용
    if best_gain <= min_impurity_decrease:
        return LeafNode(prediction=majority_class(D))

    D_left, D_right = split(D, best_split)

    node = InternalNode(feature=best_split[0], threshold=best_split[1])
    node.left = BuildTree(D_left, depth + 1)
    node.right = BuildTree(D_right, depth + 1)

    return node

시간 복잡도¶

단계	복잡도
학습	O(n * d * n * log(n)) = O(n^2 * d * log(n))
최적화 시	O(n * d * log(n)) (정렬된 특성 사용)
예측	O(log(n)) ~ O(n) (트리 깊이)

하이퍼파라미터 가이드¶

파라미터	설명	권장 범위	기본값
max_depth	최대 트리 깊이	3 ~ 20	None
min_samples_split	분할 최소 샘플 수	2 ~ 20	2
min_samples_leaf	리프 최소 샘플 수	1 ~ 10	1
max_features	분할 시 고려할 특성 수	'sqrt', 'log2', int	None
criterion	불순도 측정 방법	'gini', 'entropy'	'gini'
min_impurity_decrease	최소 불순도 감소	0.0 ~ 0.1	0.0
class_weight	클래스 가중치	'balanced', dict	None
ccp_alpha	비용 복잡도 가지치기	0.0 ~ 0.1	0.0

과적합 방지 우선순위: 1. max_depth 제한 2. min_samples_leaf 증가 3. ccp_alpha로 가지치기 4. min_samples_split 증가

Python 코드 예시¶

기본 사용법¶

import numpy as np
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt

# 데이터 로드
data = load_iris()
X, y = data.data, data.target

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 모델 학습
tree = DecisionTreeClassifier(
    max_depth=4,
    min_samples_leaf=5,
    criterion='gini',
    random_state=42
)
tree.fit(X_train, y_train)

# 예측 및 평가
y_pred = tree.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Tree depth: {tree.get_depth()}")
print(f"Number of leaves: {tree.get_n_leaves()}")

# 트리 시각화
plt.figure(figsize=(20, 10))
plot_tree(tree, 
          feature_names=data.feature_names,
          class_names=data.target_names,
          filled=True, 
          rounded=True,
          fontsize=10)
plt.savefig('decision_tree.png', dpi=150, bbox_inches='tight')
plt.show()

특성 중요도¶

# 특성 중요도 (Gini importance)
feature_importance = pd.DataFrame({
    'feature': data.feature_names,
    'importance': tree.feature_importances_
}).sort_values('importance', ascending=False)

print("\nFeature Importance:")
print(feature_importance.to_string(index=False))

# 시각화
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('Importance')
plt.title('Decision Tree Feature Importance')
plt.gca().invert_yaxis()
plt.savefig('dt_importance.png', dpi=150)
plt.show()

비용 복잡도 가지치기 (CCP)¶

# 비용 복잡도 경로 계산
path = tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
impurities = path.impurities

# 각 alpha 값에 대해 트리 학습
trees = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)
    clf.fit(X_train, y_train)
    trees.append(clf)

# Train/Test 정확도 계산
train_scores = [clf.score(X_train, y_train) for clf in trees]
test_scores = [clf.score(X_test, y_test) for clf in trees]

# 최적 alpha 선택
best_idx = np.argmax(test_scores)
best_alpha = ccp_alphas[best_idx]

print(f"\nBest ccp_alpha: {best_alpha:.6f}")
print(f"Best test score: {test_scores[best_idx]:.4f}")
print(f"Tree depth at best alpha: {trees[best_idx].get_depth()}")

# 시각화
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(ccp_alphas, train_scores, marker='o', label='Train', drawstyle='steps-post')
ax.plot(ccp_alphas, test_scores, marker='s', label='Test', drawstyle='steps-post')
ax.axvline(best_alpha, color='r', linestyle='--', label=f'Best alpha={best_alpha:.4f}')
ax.set_xlabel('ccp_alpha')
ax.set_ylabel('Accuracy')
ax.set_title('Accuracy vs Alpha for Cost Complexity Pruning')
ax.legend()
plt.savefig('ccp_pruning.png', dpi=150)
plt.show()

회귀 트리¶

from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error, r2_score

# 데이터 로드
data = fetch_california_housing()
X, y = data.data[:5000], data.target[:5000]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 회귀 트리 학습
reg_tree = DecisionTreeRegressor(
    max_depth=6,
    min_samples_leaf=10,
    random_state=42
)
reg_tree.fit(X_train, y_train)

y_pred = reg_tree.predict(X_test)

print(f"R2 Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

하이퍼파라미터 튜닝¶

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")

결정 규칙 추출¶

from sklearn.tree import export_text

# 텍스트 형태로 규칙 출력
rules = export_text(tree, feature_names=list(data.feature_names))
print("\nDecision Rules:")
print(rules)

언제 쓰나?¶

적합한 상황: - 해석 가능한 모델이 필요할 때 (규제, 의료) - 특성 간 상호작용이 중요할 때 - 비선형 관계 학습이 필요할 때 - 특성 선택/중요도 파악이 목적일 때 - 범주형 특성이 많을 때

부적합한 상황: - 고성능이 필수인 경우 (앙상블 사용 권장) - 특성 간 선형 관계가 주요한 경우 - 데이터 변화에 안정적인 모델이 필요한 경우 (분산이 높음) - 연속적인 확률 추정이 중요한 경우

장단점¶

장점	단점
직관적 해석 가능	과적합 경향 (높은 분산)
스케일링 불필요	불안정 (데이터 변화에 민감)
비선형 관계 학습	외삽 불가능
특성 상호작용 자동 탐지	축에 평행한 결정 경계만 학습
범주형/수치형 혼합 처리	클래스 불균형에 편향
빠른 예측	최적 트리 찾기는 NP-hard