Convolutional Neural Network (CNN)¶

논문 정보¶

항목	내용
제목	Gradient-based learning applied to document recognition
저자	Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner
학회/저널	Proceedings of the IEEE
연도	1998
링크	http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

개요¶

문제 정의¶

기존 완전 연결 신경망(Fully Connected)의 이미지 처리 한계:

파라미터 폭발: 224x224x3 이미지 -> 150,528개 입력 노드
공간 정보 손실: 2D 구조를 1D로 평탄화
이동 불변성 부재: 객체 위치가 바뀌면 재학습 필요

핵심 아이디어¶

원리	설명
지역 연결성 (Local Connectivity)	인접 픽셀만 연결하여 지역 패턴 학습
가중치 공유 (Weight Sharing)	동일 필터를 전체 이미지에 적용
계층적 특성 추출	저수준(엣지) -> 고수준(객체) 특성 학습
이동 불변성	Pooling으로 위치 변화에 강건

Convolution 연산¶

2D Convolution¶

입력 \(\mathbf{X} \in \mathbb{R}^{H \times W}\)와 커널 \(\mathbf{K} \in \mathbb{R}^{k \times k}\)의 합성곱:

\[(\mathbf{X} * \mathbf{K})[i, j] = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} \mathbf{X}[i+m, j+n] \cdot \mathbf{K}[m, n]\]

출력 크기 계산¶

\[H_{out} = \left\lfloor \frac{H_{in} + 2p - k}{s} \right\rfloor + 1\]

기호	의미
\(H_{in}\)	입력 높이
\(k\)	커널 크기
\(p\)	패딩
\(s\)	스트라이드

다채널 Convolution¶

입력: \(C_{in}\) 채널, 출력: \(C_{out}\) 채널

\[\mathbf{Y}[c_{out}] = \sum_{c_{in}=1}^{C_{in}} \mathbf{X}[c_{in}] * \mathbf{K}[c_{out}, c_{in}] + b[c_{out}]\]

파라미터 수: \(C_{out} \times C_{in} \times k \times k + C_{out}\)

Convolution 시각화¶

Input (5x5)          Kernel (3x3)         Output (3x3)
+--+--+--+--+--+     +--+--+--+
| 1| 1| 1| 0| 0|     | 1| 0| 1|           +--+--+--+
+--+--+--+--+--+     +--+--+--+           | 4| 3| 4|
| 0| 1| 1| 1| 0|     | 0| 1| 0|    ==>    +--+--+--+
+--+--+--+--+--+     +--+--+--+           | 2| 4| 3|
| 0| 0| 1| 1| 1|     | 1| 0| 1|           +--+--+--+
+--+--+--+--+--+     +--+--+--+           | 2| 3| 4|
| 0| 0| 1| 1| 0|                          +--+--+--+
+--+--+--+--+--+
| 0| 1| 1| 0| 0|
+--+--+--+--+--+

계산 예시 (좌상단):
1*1 + 1*0 + 1*1 + 0*0 + 1*1 + 1*0 + 0*1 + 0*0 + 1*1 = 4

Pooling¶

Max Pooling¶

각 영역의 최댓값 선택:

\[\text{MaxPool}(\mathbf{X})[i, j] = \max_{(m, n) \in R_{ij}} \mathbf{X}[m, n]\]

Average Pooling¶

각 영역의 평균값 계산:

\[\text{AvgPool}(\mathbf{X})[i, j] = \frac{1}{|R_{ij}|} \sum_{(m, n) \in R_{ij}} \mathbf{X}[m, n]\]

Global Average Pooling (GAP)¶

전체 공간을 하나의 값으로 축소 (FC 레이어 대체):

\[\text{GAP}(\mathbf{X})[c] = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} \mathbf{X}[c, i, j]\]

Pooling 비교¶

유형	장점	단점	사용
Max Pooling	강한 특성 보존	정보 손실	일반적
Avg Pooling	부드러운 축소	약한 특성 희석	초기 층
Global Avg	FC 대체, 파라미터 감소	공간 정보 손실	최종 층

주요 아키텍처¶

LeNet-5 (1998)¶

최초의 실용적 CNN. 손글씨 숫자 인식용.

Input(32x32x1) -> Conv(5x5) -> Pool(2x2) -> Conv(5x5) -> Pool(2x2) -> FC -> FC -> Output(10)

Architecture:
Layer      | Output    | Params
-----------|-----------|--------
Input      | 32x32x1   | -
Conv1(5x5) | 28x28x6   | 156
Pool1(2x2) | 14x14x6   | -
Conv2(5x5) | 10x10x16  | 2,416
Pool2(2x2) | 5x5x16    | -
FC1        | 120       | 48,120
FC2        | 84        | 10,164
Output     | 10        | 850
-----------|-----------|--------
Total      |           | ~62K

VGGNet (2014)¶

깊이의 힘을 증명. 3x3 커널만 사용.

특징	설명
깊이	16-19 layers
커널 크기	모두 3x3
채널 증가	64 -> 128 -> 256 -> 512
파라미터	138M (VGG-16)

두 개의 3x3 Conv는 하나의 5x5 Conv와 동일한 수용 영역을 가지지만, 비선형성 증가 + 파라미터 감소:

\[2 \times (3 \times 3) = 18 < 5 \times 5 = 25\]

ResNet (2015)¶

Skip Connection으로 깊은 네트워크 학습 가능.

Residual Block:

\[\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}\]

        x
        │
        ├───────────────┐
        │               │
        ▼               │
    ┌───────┐           │
    │ Conv  │           │
    └───────┘           │
        │               │
        ▼               │
    ┌───────┐           │
    │  BN   │           │
    └───────┘           │
        │               │
        ▼               │
    ┌───────┐           │
    │ ReLU  │           │ (identity)
    └───────┘           │
        │               │
        ▼               │
    ┌───────┐           │
    │ Conv  │           │
    └───────┘           │
        │               │
        ▼               │
    ┌───────┐           │
    │  BN   │           │
    └───────┘           │
        │               │
        ▼               │
       (+)◄─────────────┘
        │
        ▼
    ┌───────┐
    │ ReLU  │
    └───────┘
        │
        ▼
        y

Bottleneck Block (ResNet-50+):

1x1 Conv (256 -> 64)  : 차원 축소
3x3 Conv (64 -> 64)   : 공간 연산
1x1 Conv (64 -> 256)  : 차원 복원

아키텍처 비교¶

모델	연도	깊이	Params	Top-1 Acc	혁신
LeNet-5	1998	5	62K	-	최초 CNN
AlexNet	2012	8	60M	63.3%	ReLU, Dropout
VGG-16	2014	16	138M	74.4%	3x3 커널
GoogLeNet	2014	22	6.8M	74.8%	Inception
ResNet-50	2015	50	25.6M	76.1%	Skip Connection
ResNet-152	2015	152	60M	77.8%	깊은 네트워크
EfficientNet-B7	2019	-	66M	84.3%	Compound Scaling

PyTorch 구현¶

기본 CNN 모델¶

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # Convolutional layers
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)

        self.pool = nn.MaxPool2d(2, 2)
        self.dropout = nn.Dropout(0.5)

        # Fully connected layers
        self.fc1 = nn.Linear(128 * 4 * 4, 512)
        self.fc2 = nn.Linear(512, num_classes)

    def forward(self, x):
        # Block 1: 32x32 -> 16x16
        x = self.pool(F.relu(self.bn1(self.conv1(x))))
        # Block 2: 16x16 -> 8x8
        x = self.pool(F.relu(self.bn2(self.conv2(x))))
        # Block 3: 8x8 -> 4x4
        x = self.pool(F.relu(self.bn3(self.conv3(x))))

        # Flatten
        x = x.view(x.size(0), -1)

        # FC layers
        x = self.dropout(F.relu(self.fc1(x)))
        x = self.fc2(x)
        return x

model = SimpleCNN()
print(f"Total params: {sum(p.numel() for p in model.parameters()):,}")

ResNet Residual Block¶

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 1, 1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # Skip connection (identity or projection)
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)  # Skip connection
        out = F.relu(out)
        return out

class ResNet(nn.Module):
    def __init__(self, block, num_blocks, num_classes=10):
        super().__init__()
        self.in_channels = 64

        self.conv1 = nn.Conv2d(3, 64, 3, 1, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)

        self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)

    def _make_layer(self, block, out_channels, num_blocks, stride):
        strides = [stride] + [1] * (num_blocks - 1)
        layers = []
        for stride in strides:
            layers.append(block(self.in_channels, out_channels, stride))
            self.in_channels = out_channels
        return nn.Sequential(*layers)

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = self.avgpool(out)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out

def ResNet18():
    return ResNet(ResidualBlock, [2, 2, 2, 2])

def ResNet34():
    return ResNet(ResidualBlock, [3, 4, 6, 3])

사전학습 모델 활용¶

import torchvision.models as models
from torchvision import transforms

# 사전학습된 ResNet50 로드
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# 전이 학습: 마지막 층만 교체
num_classes = 100
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Feature extraction: 마지막 층만 학습
for param in model.parameters():
    param.requires_grad = False
model.fc.requires_grad = True

# Fine-tuning: 일부 층 학습
for param in model.layer4.parameters():
    param.requires_grad = True

이미지 분류 예시¶

CIFAR-10 학습¶

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# 데이터 전처리
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
])

# 데이터 로드
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_train)
test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform_test)

train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=4)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False, num_workers=4)

# 모델, 손실함수, 옵티마이저
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ResNet18().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=200)

# 학습
def train(epoch):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (inputs, targets) in enumerate(train_loader):
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()

    acc = 100. * correct / total
    print(f'Epoch {epoch}: Train Loss={total_loss/len(train_loader):.3f}, Acc={acc:.2f}%')

# 평가
@torch.no_grad()
def test():
    model.eval()
    correct = 0
    total = 0

    for inputs, targets in test_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        outputs = model(inputs)
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()

    acc = 100. * correct / total
    print(f'Test Accuracy: {acc:.2f}%')
    return acc

# 실행
for epoch in range(200):
    train(epoch)
    test()
    scheduler.step()

언제 쓰나?¶

적합한 상황: - 이미지 분류, 객체 탐지, 세그멘테이션 - 공간적 구조가 중요한 데이터 - 대규모 이미지 데이터셋 - 전이 학습 활용 가능 시

부적합한 상황: - 순차적 데이터 (시계열, 텍스트) - 테이블 데이터 - 매우 작은 데이터셋 (전이 학습 없이) - 이미지 크기가 일정하지 않은 경우

참고¶

LeCun, Y. et al. (1998). "Gradient-based learning applied to document recognition"
Simonyan, K. & Zisserman, A. (2014). "Very Deep Convolutional Networks for Large-Scale Image Recognition"
He, K. et al. (2015). "Deep Residual Learning for Image Recognition"
PyTorch Vision: https://pytorch.org/vision/stable/models.html

주제	링크
딥러닝 기초	README.md
RNN/LSTM	rnn-lstm.md
Attention	attention.md
Transformer	../../architecture/transformer.md