Transformer Architecture¶

개요¶

Transformer는 2017년 Google의 "Attention Is All You Need" 논문에서 제안된 아키텍처로, RNN이나 CNN 없이 어텐션 메커니즘만으로 시퀀스를 처리한다. 현대 LLM의 기반이 되는 핵심 아키텍처다.

핵심 개념¶

Self-Attention¶

입력 시퀀스의 각 위치가 다른 모든 위치를 참조하여 문맥 정보를 집계하는 메커니즘.

수식:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

여기서: - Q (Query): 현재 위치에서 "무엇을 찾을지" - K (Key): 각 위치가 "무엇을 제공하는지"
- V (Value): 실제로 전달할 정보 - d_k: Key의 차원 (스케일링 팩터)

Multi-Head Attention¶

여러 개의 어텐션을 병렬로 수행하여 다양한 관점에서 정보를 수집.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O

where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)

Position Encoding¶

Transformer는 순서 정보가 없으므로 위치 인코딩을 추가:

Sinusoidal (원본):

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

RoPE (Rotary Position Embedding):

f(x, m) = R_m * x

where R_m is a rotation matrix based on position m

아키텍처 다이어그램¶

Encoder-Decoder (원본 Transformer)¶

                    [Output Probabilities]
                           |
                    +------v------+
                    |   Linear    |
                    |   Softmax   |
                    +------^------+
                           |
              +============+============+
              |         DECODER         |
              |  +-------------------+  |
              |  | Masked Multi-Head |  |
              |  |    Attention      |  |
              |  +--------+----------+  |
              |           |             |
              |  +--------v----------+  |
              |  | Cross-Attention   |<----+
              |  | (Encoder-Decoder) |  |   |
              |  +--------+----------+  |   |
              |           |             |   |
              |  +--------v----------+  |   |
              |  |   Feed Forward    |  |   |
              |  +-------------------+  |   |
              |        x N layers       |   |
              +============+============+   |
                           ^                |
                    [Target Embedding]      |
                                            |
              +============+============+   |
              |         ENCODER         |   |
              |  +-------------------+  |   |
              |  |   Multi-Head      |  |   |
              |  |   Attention       |  |   |
              |  +--------+----------+  |   |
              |           |             |   |
              |  +--------v----------+  |   |
              |  |   Feed Forward    |------+
              |  +-------------------+  |
              |        x N layers       |
              +============+============+
                           ^
                    [Input Embedding]
                           +
                    [Positional Encoding]

Decoder-Only (GPT 계열)¶

                    [Output Logits]
                          |
                    +-----v-----+
                    |  LM Head  |
                    +-----^-----+
                          |
        +-----------------+------------------+
        |            DECODER BLOCK           |
        |  +------------------------------+  |
        |  |    Layer Normalization       |  |
        |  +-------------+----------------+  |
        |                |                   |
        |  +-------------v----------------+  |
        |  |   Causal Self-Attention      |  |
        |  |   (Masked)                   |  |
        |  +-------------+----------------+  |
        |                |                   |
        |  +-------------v----------------+  |
        |  |    Layer Normalization       |  |
        |  +-------------+----------------+  |
        |                |                   |
        |  +-------------v----------------+  |
        |  |    Feed Forward Network      |  |
        |  |    (MLP: up -> act -> down)  |  |
        |  +------------------------------+  |
        |             x N layers             |
        +-----------------+------------------+
                          ^
                    +-----+-----+
                    | Embedding |
                    +-----^-----+
                          |
                    [Input Tokens]

Attention 연산 상세¶

    Q           K           V
    |           |           |
    v           v           v
[n x d_k]   [n x d_k]   [n x d_v]
    |           |           |
    |     +-----+           |
    |     |                 |
    v     v                 |
  [Q @ K^T]                 |
  [n x n]                   |
    |                       |
    v                       |
  / sqrt(d_k)               |
    |                       |
    v                       |
  [Mask] (optional)         |
    |                       |
    v                       |
  softmax                   |
    |                       |
    v                       v
  [n x n] @ [n x d_v] = [n x d_v]
              |
              v
         [Output]

대표 모델¶

모델	유형	파라미터	특징
GPT-4	Decoder-only	~1.8T (추정)	OpenAI의 플래그십 모델
Claude 3 Opus	Decoder-only	비공개	긴 컨텍스트 (200K)
Llama 3	Decoder-only	8B/70B/405B	Meta의 오픈소스 모델
Gemini	Decoder-only	비공개	멀티모달 지원
T5	Encoder-Decoder	60M~11B	Text-to-Text 패러다임
BERT	Encoder-only	110M/340M	양방향 인코딩

장단점¶

장점¶

높은 병렬화: 시퀀스 전체를 동시에 처리 가능
장거리 의존성: 어텐션으로 먼 위치 간 직접 연결
유연한 구조: 다양한 태스크에 적용 가능
잘 확립된 생태계: 학습, 추론, 최적화 도구 풍부

단점¶

이차 복잡도: 시퀀스 길이에 대해 O(n^2) 시간/메모리
KV 캐시 메모리: 긴 시퀀스에서 메모리 사용량 급증
위치 일반화 제한: 학습 길이를 넘어선 외삽 어려움

코드 예시¶

PyTorch Self-Attention 구현¶

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
        super().__init__()
        assert d_model % n_heads == 0

        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape

        # Linear projections
        Q = self.W_q(x)  # (B, N, D)
        K = self.W_k(x)
        V = self.W_v(x)

        # Reshape for multi-head: (B, N, H, D_k) -> (B, H, N, D_k)
        Q = Q.view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Apply attention to values
        context = torch.matmul(attn_weights, V)  # (B, H, N, D_k)

        # Concatenate heads
        context = context.transpose(1, 2).contiguous()
        context = context.view(batch_size, seq_len, self.d_model)

        return self.W_o(context)


class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()

        self.attention = MultiHeadAttention(d_model, n_heads, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )

    def forward(self, x, mask=None):
        # Pre-norm architecture (modern style)
        x = x + self.attention(self.norm1(x), mask)
        x = x + self.ffn(self.norm2(x))
        return x

Causal Mask 생성¶

def create_causal_mask(seq_len: int, device: torch.device) -> torch.Tensor:
    """
    Create causal (autoregressive) attention mask.
    Returns mask where mask[i][j] = 1 if j <= i, else 0
    """
    mask = torch.tril(torch.ones(seq_len, seq_len, device=device))
    return mask.unsqueeze(0).unsqueeze(0)  # (1, 1, N, N)

효율적 어텐션 변형¶

기법	복잡도	설명
Flash Attention	O(n^2)	IO-aware, 메모리 효율적 구현
Multi-Query Attention	O(n^2)	K, V를 헤드 간 공유
Grouped-Query Attention	O(n^2)	MQA와 MHA의 중간
Sliding Window	O(n*w)	고정 윈도우 내 어텐션
Linear Attention	O(n)	커널 근사 기반

참고 논문¶

Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS 2017.
arXiv: https://arxiv.org/abs/1706.03762
Radford, A., et al. (2018). "Improving Language Understanding by Generative Pre-Training." (GPT)
Radford, A., et al. (2019). "Language Models are Unsupervised Multitask Learners." (GPT-2)
Brown, T., et al. (2020). "Language Models are Few-Shot Learners." (GPT-3)
arXiv: https://arxiv.org/abs/2005.14165
Dao, T., et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention."
arXiv: https://arxiv.org/abs/2205.14135
Su, J., et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding."
arXiv: https://arxiv.org/abs/2104.09864
Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." (MQA)
arXiv: https://arxiv.org/abs/1911.02150