K-means++¶

논문 정보¶

항목	내용
제목	k-means++: The Advantages of Careful Seeding
저자	David Arthur, Sergei Vassilvitskii
학회/저널	ACM-SIAM Symposium on Discrete Algorithms (SODA)
연도	2007
링크	https://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf

개요¶

문제 정의¶

표준 k-means 알고리즘의 초기 중심점(centroid) 선택 문제:

k-means는 초기 중심점에 민감하여 수렴 결과가 크게 달라짐
무작위 초기화는 최적해 대비 임의로 나쁜 결과를 생성할 수 있음
k-means 문제 자체가 NP-hard이므로 근사 알고리즘 필요

핵심 아이디어¶

데이터 포인트를 초기 중심점으로 선택하되, 기존 중심점에서 먼 포인트일수록 높은 확률로 선택:

첫 번째 중심점: 균등 분포로 무작위 선택
이후 중심점: 거리의 제곱에 비례하는 확률로 선택
이 방식으로 O(log k)-competitive 근사 보장

알고리즘/방법론¶

수식¶

데이터 포인트 x를 다음 중심점으로 선택할 확률:

P(x) = D(x)^2 / sum_{x' in X} D(x')^2

여기서 D(x)는 x에서 가장 가까운 기존 중심점까지의 거리.

단계별 설명¶

Algorithm: k-means++ Initialization

Input: 데이터셋 X, 클러스터 수 k
Output: 초기 중심점 집합 C

1. C <- {c_1} where c_1 is chosen uniformly at random from X

2. For i = 2 to k:
   a. For each x in X, compute D(x) = min_{c in C} ||x - c||
   b. Choose c_i = x with probability D(x)^2 / sum_{x'} D(x')^2
   c. C <- C union {c_i}

3. Proceed with standard k-means using C as initial centers

4. Return final clustering

시간 복잡도¶

초기화: O(nkd) where n=데이터 수, k=클러스터 수, d=차원
전체 k-means++: O(nkdi) where i=반복 횟수

주요 결과¶

이론적 보장¶

항목	결과
근사 비율	O(log k)-competitive
기대 비용	E[phi] <= 8(ln k + 2) * phi_opt

여기서 phi는 클러스터 내 제곱 거리 합, phi_opt는 최적해.

실험 결과¶

논문의 벤치마크 데이터셋 실험:

데이터셋	표준 k-means 평균 비용	k-means++ 평균 비용	개선율
Cloud	4.22	3.79	10.2%
Intrusion	40.12	35.61	11.2%
Spam	21.34	18.88	11.5%

평균적으로 초기화만으로 최종 비용 10-15% 감소
수렴 속도 향상: 반복 횟수 평균 2-3배 감소

실전 적용 가이드¶

언제 사용하나¶

대부분의 k-means 적용 상황에서 기본 선택으로 권장
클러스터가 불균형하거나 밀도 차이가 클 때 특히 효과적
초기화 품질이 중요한 대규모 데이터셋

하이퍼파라미터¶

파라미터	설명	권장값
n_clusters (k)	클러스터 수	Elbow method 또는 Silhouette score로 결정
n_init	전체 알고리즘 반복 횟수	10 (sklearn 기본값)
max_iter	k-means 반복 한도	300
tol	수렴 임계값	1e-4

주의사항¶

고차원 데이터: 차원의 저주로 거리 기반 방법 성능 저하 가능. PCA 등 차원 축소 선행 권장.
이상치 민감성: 이상치가 초기 중심점으로 선택될 가능성 높음. 전처리로 이상치 제거 필요.
구형 클러스터 가정: k-means 자체의 한계. 비구형 클러스터에는 DBSCAN, GMM 등 고려.
스케일링: 특성 간 스케일 차이가 크면 표준화 필수.

Python 코드 예시¶

import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# 샘플 데이터 생성
X, y_true = make_blobs(
    n_samples=300,
    centers=4,
    cluster_std=0.60,
    random_state=42
)

# 스케일링
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# k-means++ (sklearn 기본값)
kmeans = KMeans(
    n_clusters=4,
    init='k-means++',  # 기본값
    n_init=10,
    max_iter=300,
    random_state=42
)
y_pred = kmeans.fit_predict(X_scaled)

# 결과 출력
print(f"Inertia (within-cluster sum of squares): {kmeans.inertia_:.2f}")
print(f"Number of iterations: {kmeans.n_iter_}")
print(f"Cluster centers shape: {kmeans.cluster_centers_.shape}")

# 시각화
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_true, cmap='viridis', alpha=0.6)
plt.title('True Labels')

plt.subplot(1, 2, 2)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_pred, cmap='viridis', alpha=0.6)
plt.scatter(
    kmeans.cluster_centers_[:, 0],
    kmeans.cluster_centers_[:, 1],
    c='red', marker='X', s=200, edgecolors='black'
)
plt.title('K-means++ Clustering')

plt.tight_layout()
plt.savefig('kmeans_result.png', dpi=150)
plt.show()

k-means++ 초기화 직접 구현¶

import numpy as np

def kmeans_plusplus_init(X, k, random_state=None):
    """
    K-means++ initialization algorithm.

    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
    k : int, number of clusters
    random_state : int, optional

    Returns:
    --------
    centers : ndarray of shape (k, n_features)
    """
    rng = np.random.default_rng(random_state)
    n_samples, n_features = X.shape
    centers = np.empty((k, n_features))

    # Step 1: Choose first center uniformly at random
    center_idx = rng.integers(n_samples)
    centers[0] = X[center_idx]

    # Step 2: Choose remaining centers
    for i in range(1, k):
        # Compute distances to nearest center
        distances = np.min([
            np.sum((X - centers[j])**2, axis=1)
            for j in range(i)
        ], axis=0)

        # Compute probabilities proportional to D(x)^2
        probs = distances / distances.sum()

        # Sample next center
        center_idx = rng.choice(n_samples, p=probs)
        centers[i] = X[center_idx]

    return centers

# 사용 예시
centers = kmeans_plusplus_init(X_scaled, k=4, random_state=42)
print(f"Initial centers:\n{centers}")

기법	장점	단점
Random init	빠름	결과 불안정, 수렴 느림
k-means++	이론적 보장, 빠른 수렴	초기화 비용 O(nk)
k-means\|\|	병렬화 가능, 대규모 적합	구현 복잡