Linear Regression (선형 회귀)¶

개요¶

문제 정의¶

연속형 타겟 변수를 입력 특성의 선형 결합으로 예측:

y = w_0 + w_1*x_1 + w_2*x_2 + ... + w_d*x_d + epsilon
  = w^T * x + epsilon

여기서: - y: 타겟 변수 - x: 입력 특성 벡터 - w: 가중치 (회귀 계수) - epsilon: 오차항 (정규분포 가정)

핵심 아이디어¶

최소제곱법 (Ordinary Least Squares, OLS):

잔차 제곱합(RSS)을 최소화하는 가중치 찾기:

RSS = sum_{i=1}^{n} (y_i - w^T * x_i)^2 = ||y - Xw||_2^2

해석적 해 (Normal Equation):

w* = (X^T * X)^(-1) * X^T * y

알고리즘/수식¶

모델 가정¶

선형성: E[Y|X] = w^T * X
독립성: 관측치들이 독립
등분산성: Var(epsilon_i) = sigma^2 (동일)
정규성: epsilon ~ N(0, sigma^2)
다중공선성 없음: 특성 간 완전한 선형 관계 없음

해석적 해 (Closed-form Solution)¶

손실 함수: L(w) = ||y - Xw||_2^2

미분:
dL/dw = -2 * X^T * (y - Xw) = 0
X^T * Xw = X^T * y

해:
w* = (X^T * X)^(-1) * X^T * y

조건: X^T * X가 역행렬을 가져야 함 (full rank)

경사하강법 (Gradient Descent)¶

대규모 데이터나 (X^T X)^(-1) 계산이 어려울 때:

Gradient: dL/dw = (2/n) * X^T * (Xw - y)

Update: w := w - alpha * dL/dw

변형	배치 크기	특징
Batch GD	전체 데이터	안정적, 느림
Stochastic GD	1개 샘플	빠름, 노이즈
Mini-batch GD	m개 샘플	균형

평가 지표¶

지표	수식	해석
MSE	(1/n) * sum(y_i - y_hat_i)^2	오차 제곱 평균
RMSE	sqrt(MSE)	원래 스케일
MAE	(1/n) * sum(\|y_i - y_hat_i\|)	절대 오차 평균, 이상치에 강건
R^2	1 - SS_res/SS_tot	설명된 분산 비율 (0~1)
Adjusted R^2	1 - (1-R^2)*(n-1)/(n-d-1)	특성 수 보정

시간 복잡도¶

방법	복잡도
Normal Equation	O(n*d^2 + d^3)
Gradient Descent	O(nditer)

하이퍼파라미터 가이드¶

기본 OLS는 하이퍼파라미터가 없음. 최적화 방법 관련:

파라미터	설명	권장값
fit_intercept	절편 학습 여부	True
normalize (deprecated)	특성 정규화	StandardScaler 별도 적용
copy_X	입력 복사 여부	True
n_jobs	병렬 처리	-1

Python 코드 예시¶

기본 사용법¶

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt

# 데이터 로드
data = fetch_california_housing()
X, y = data.data, data.target

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 스케일링 (선택사항, 해석 편의)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 모델 학습
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# 예측
y_pred = model.predict(X_test_scaled)

# 평가
print("=== Linear Regression Performance ===")
print(f"R2 Score: {r2_score(y_test, y_pred):.4f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")

계수 해석¶

# 회귀 계수
coef_df = pd.DataFrame({
    'feature': data.feature_names,
    'coefficient': model.coef_
}).sort_values('coefficient', key=abs, ascending=False)

print(f"\nIntercept: {model.intercept_:.4f}")
print("\nCoefficients (standardized):")
print(coef_df.to_string(index=False))

# 시각화
plt.figure(figsize=(10, 6))
plt.barh(coef_df['feature'], coef_df['coefficient'])
plt.xlabel('Coefficient')
plt.title('Linear Regression Coefficients')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.tight_layout()
plt.savefig('lr_coefficients.png', dpi=150)
plt.show()

잔차 분석¶

residuals = y_test - y_pred

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Residuals vs Fitted
axes[0, 0].scatter(y_pred, residuals, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')

# Q-Q Plot
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot')

# Histogram of residuals
axes[1, 0].hist(residuals, bins=50, edgecolor='black', density=True)
x_norm = np.linspace(residuals.min(), residuals.max(), 100)
axes[1, 0].plot(x_norm, stats.norm.pdf(x_norm, residuals.mean(), residuals.std()), 'r-')
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_title('Residuals Distribution')

# Actual vs Predicted
axes[1, 1].scatter(y_test, y_pred, alpha=0.5)
axes[1, 1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
axes[1, 1].set_xlabel('Actual')
axes[1, 1].set_ylabel('Predicted')
axes[1, 1].set_title('Actual vs Predicted')

plt.tight_layout()
plt.savefig('lr_diagnostics.png', dpi=150)
plt.show()

다중공선성 검사 (VIF)¶

from statsmodels.stats.outliers_influence import variance_inflation_factor

# VIF 계산
vif_data = pd.DataFrame()
vif_data['feature'] = data.feature_names
vif_data['VIF'] = [
    variance_inflation_factor(X_train_scaled, i) 
    for i in range(X_train_scaled.shape[1])
]

print("\nVariance Inflation Factors:")
print(vif_data.sort_values('VIF', ascending=False).to_string(index=False))
print("\nVIF > 10: 다중공선성 의심")

Statsmodels로 상세 분석¶

import statsmodels.api as sm

# 상수항 추가
X_train_sm = sm.add_constant(X_train_scaled)

# OLS 적합
ols_model = sm.OLS(y_train, X_train_sm).fit()

# 상세 요약
print(ols_model.summary())

교차 검증¶

from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(
    LinearRegression(),
    X_train_scaled, y_train,
    cv=5,
    scoring='r2'
)

print(f"\nCross-validation R2:")
print(f"  Mean: {cv_scores.mean():.4f}")
print(f"  Std: {cv_scores.std():.4f}")
print(f"  Scores: {cv_scores}")

다항 회귀 (Polynomial Regression)¶

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# 비선형 관계를 위한 다항 특성
degrees = [1, 2, 3]

for degree in degrees:
    pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('scaler', StandardScaler()),
        ('regressor', LinearRegression())
    ])

    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='r2')
    print(f"Degree {degree}: R2 = {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

언제 쓰나?¶

적합한 상황: - 특성과 타겟 간 선형 관계가 있을 때 - 계수 해석이 중요할 때 (비즈니스 인사이트) - 베이스라인 모델로 빠른 검증 - 특성 수가 샘플 수보다 적을 때 (n > d) - 추론/통계적 검정이 필요할 때

부적합한 상황: - 비선형 관계가 강할 때 - 이상치가 많을 때 - 다중공선성이 심할 때 - 특성 수가 샘플 수보다 많을 때 (n < d)

장단점¶

장점	단점
해석 용이 (계수 의미)	선형 관계만 학습
학습/예측 빠름	이상치에 민감
Closed-form 해 존재	다중공선성에 취약
통계적 추론 가능 (p-value, CI)	특성 간 상호작용 미반영
과적합 위험 낮음	고차원에서 성능 저하