PaliGemma / PaliGemma 2¶

1. 개요¶

항목	내용
개발사	Google DeepMind
공개일	PaliGemma: 2024.05, PaliGemma 2: 2024.12
모델 타입	Open Source (Apache 2.0 / Gemma License)
접근 방식	Hugging Face, Kaggle, Vertex AI

PaliGemma는 Google의 오픈소스 Vision-Language Model로, SigLIP Vision Encoder와 Gemma LLM을 결합한 구조이다. 경량화 설계로 연구 및 엣지 배포에 적합하다.

2. 모델 패밀리¶

2.1 PaliGemma (v1)¶

모델	파라미터	Vision Encoder	LLM
paligemma-3b	3B	SigLIP-400M	Gemma-2B

2.2 PaliGemma 2¶

모델	파라미터	Vision Encoder	LLM
paligemma2-3b	3B	SigLIP-400M	Gemma 2 2B
paligemma2-10b	10B	SigLIP-400M	Gemma 2 9B
paligemma2-28b	28B	SigLIP-400M	Gemma 2 27B

3. 아키텍처¶

3.1 구조¶

[이미지 224/448/896px]
        |
        v
[SigLIP Vision Encoder (400M)]
        |
        v
[Linear Projection]
        |
        v
[Gemma 2 LLM] <-- [텍스트 토큰]
        |
        v
[출력 토큰]

3.2 핵심 컴포넌트¶

컴포넌트	사양
Vision Encoder	SigLIP-So400m (400M 파라미터)
이미지 토큰	256개 (224px), 1024개 (448px), 4096개 (896px)
Projection	Linear layer
LLM	Gemma 2 (2B/9B/27B)

3.3 SigLIP 특징¶

Sigmoid loss 기반 대조 학습
CLIP 대비 개선된 성능
다국어 지원 강화

4. 이미지 처리¶

4.1 해상도 옵션¶

해상도	이미지 토큰	용도
224x224	256	빠른 추론, 저해상도
448x448	1024	균형 (기본 권장)
896x896	4096	고해상도, 문서 OCR

4.2 해상도별 성능 (PaliGemma 2)¶

벤치마크	224px	448px	896px
TextVQA	65.2	73.1	76.8
DocVQA	52.3	78.4	87.2
ChartQA	58.7	72.4	81.3

5. 벤치마크 성능¶

5.1 PaliGemma 2-10B (448px)¶

벤치마크	점수
MMMU (val)	49.2%
MathVista	52.1%
AI2D	72.3%
ChartQA	72.4%
DocVQA	78.4%
TextVQA	73.1%
VQAv2	79.8%
COCO Caption	141.2 CIDEr

5.2 모델 크기별 성능 비교¶

벤치마크	3B	10B	28B
MMMU	42.1	49.2	53.4
DocVQA	71.2	78.4	83.6
TextVQA	68.3	73.1	76.2

5.3 특화 태스크 성능¶

태스크	점수	비고
Object Detection	45.3 mAP	COCO
Segmentation	44.2 mIoU	ADE20K
OCR	82.1%	Scene Text

6. 사용 방법¶

6.1 Hugging Face Transformers¶

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import torch

model_id = "google/paligemma2-10b-pt-448"

processor = AutoProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

image = Image.open("image.jpg")
prompt = "Describe this image in detail."

inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
result = processor.decode(outputs[0], skip_special_tokens=True)
print(result)

6.2 특화 태스크 프롬프트¶

# Object Detection
prompt = "detect person ; car ; dog"

# Image Captioning
prompt = "caption en"  # 영어 캡션
prompt = "caption ko"  # 한국어 캡션

# OCR
prompt = "ocr"

# Visual Question Answering
prompt = "What color is the car?"

# Referring Expression Segmentation
prompt = "segment the red car"

6.3 양자화 사용¶

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

6.4 Fine-tuning 예시¶

from transformers import Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none"
)

model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir="./paligemma-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)

trainer.train()

7. VRAM 요구량¶

7.1 PaliGemma 2¶

모델	FP32	BF16	INT8	INT4
3B (224px)	12GB	6GB	3GB	2GB
3B (448px)	14GB	7GB	4GB	2.5GB
10B (448px)	40GB	20GB	10GB	6GB
28B (448px)	112GB	56GB	28GB	14GB

7.2 추론 속도 (A100 40GB, BF16)¶

모델	토큰/초	첫 토큰 지연
3B (224px)	85	120ms
3B (448px)	72	180ms
10B (448px)	45	350ms
28B (448px)	18	800ms

8. 장점¶

장점	설명
오픈소스	Apache 2.0 / Gemma 라이선스
경량화	3B 모델로도 실용적 성능
다양한 태스크	Detection, Segmentation 지원
해상도 선택	224/448/896px 옵션
Fine-tuning 용이	LoRA, QLoRA 지원
엣지 배포	모바일/Jetson 가능

9. 단점¶

단점	설명
성능 한계	폐쇄형 모델 대비 낮음
영어 중심	다국어 성능 제한적
복잡한 추론	다단계 추론 약함
비디오 미지원	이미지만 처리
Gemma 라이선스	상업적 사용 조건 확인 필요

10. 사용 사례¶

10.1 적합한 사용 사례¶

문서 OCR (고해상도 모드)
제품 이미지 분석
시각 질의응답
이미지 캡셔닝
객체 탐지/분할 (연구용)
엣지 디바이스 배포
VLM 연구/실험

10.2 부적합한 사용 사례¶

복잡한 다단계 추론
비디오 분석
프로덕션 챗봇 (성능 제한)

11. 모델 체크포인트¶

11.1 PaliGemma 2 Checkpoints¶

체크포인트	설명
paligemma2-Xb-pt-{res}	Pretrained (범용)
paligemma2-Xb-ft-docci-{res}	DOCCI Fine-tuned
paligemma2-Xb-mix-{res}	Mixed tasks

11.2 Hugging Face 모델 ID¶

google/paligemma2-3b-pt-224
google/paligemma2-3b-pt-448
google/paligemma2-3b-pt-896
google/paligemma2-10b-pt-448
google/paligemma2-28b-pt-448

PaliGemma / PaliGemma 2¶

1. 개요¶

2. 모델 패밀리¶

2.1 PaliGemma (v1)¶

2.2 PaliGemma 2¶

3. 아키텍처¶

3.1 구조¶

3.2 핵심 컴포넌트¶

3.3 SigLIP 특징¶

4. 이미지 처리¶

4.1 해상도 옵션¶

4.2 해상도별 성능 (PaliGemma 2)¶

5. 벤치마크 성능¶

5.1 PaliGemma 2-10B (448px)¶

5.2 모델 크기별 성능 비교¶

5.3 특화 태스크 성능¶

6. 사용 방법¶

6.1 Hugging Face Transformers¶

6.2 특화 태스크 프롬프트¶

6.3 양자화 사용¶

6.4 Fine-tuning 예시¶

7. VRAM 요구량¶

7.1 PaliGemma 2¶

7.2 추론 속도 (A100 40GB, BF16)¶

8. 장점¶

9. 단점¶

10. 사용 사례¶

10.1 적합한 사용 사례¶

10.2 부적합한 사용 사례¶

11. 모델 체크포인트¶

11.1 PaliGemma 2 Checkpoints¶

11.2 Hugging Face 모델 ID¶

12. 참고 자료¶