Transformer Architecture: Self-Attention and Multi-Head Attention Explained

Source: transformer-architecture-huggingface Type: Article (Hugging Face Transformers Course) Valid as of: 2026-04-26

핵심 Takeaway

Transformer의 혁신 — 2017년 “Attention is All You Need” 논문에서 제시. RNN/LSTM의 순차 처리 문제 해결 → Self-Attention으로 시퀀스의 모든 토큰 간 관계를 동시에 계산 (출처: transformer-architecture-huggingface > transformer의-탄생-배경)
Self-Attention 원리 — Query-Key-Value 벡터를 사용해 각 토큰이 다른 토큰들과의 관련성을 계산. 공식: Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V (출처: transformer-architecture-huggingface > self-attention의-핵심-원리)
Multi-Head Attention — 단일 attention head는 한 가지 관점만 학습. 8개 이상의 병렬 head로 문법적·의미적·구조적 관계를 동시에 학습 (출처: transformer-architecture-huggingface > multi-head-attention)
Positional Encoding — Self-Attention은 입력 순서를 무시하므로, sinusoidal 함수로 위치 정보 명시적 추가 (출처: transformer-architecture-huggingface > positional-encoding)
최신 개선사항 — Flash Attention (속도 3배), GQA (메모리 40% 감소), Rotary Position Embedding으로 더 긴 시퀀스 처리 가능 (출처: transformer-architecture-huggingface > 최신-llm의-개선-사항-2026년-기준)

상세 요약

Transformer의 탄생 배경

2017년 “Attention is All You Need” 논문에서 처음 제시됨.

RNN/LSTM의 문제점

순차 처리: 각 토큰을 순차적으로 처리 → 느린 학습 속도
병렬화 불가: 이전 토큰 처리 완료 전까지 다음 토큰 처리 불가
장거리 의존성 약함: 먼 토큰 간 관계 학습 어려움

Transformer의 해결책

Self-Attention 메커니즘: 시퀀스의 모든 토큰 간 관계를 동시에 계산
병렬 처리: 모든 토큰을 동시에 처리 → 훨씬 빠름
장거리 의존성: 토큰 거리에 관계없이 attention으로 관계 포착

결과

GPT, BERT, Claude 등 최신 LLM의 핵심 구성 요소

Self-Attention의 핵심 원리

1. Query-Key-Value 행렬 (QKV)

문장: “The cat sat on the mat”

각 단어는 3개의 벡터로 표현:

Query (Q): “다른 단어들과 비교할 때 나는 누구인가”
Key (K): “내가 다른 단어들에게 비춰지는 방식”
Value (V): “내가 최종 출력에 기여할 정보”

예시 (d_k = 4 차원):

단어: "cat"
  Q = [0.5, 0.2, 0.8, 0.1]
  K = [0.6, 0.1, 0.7, 0.3]
  V = [0.9, 0.4, 0.2, 0.5]

2. Attention 가중치 계산 (Softmax)

공식:

Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V

Step-by-step 예시 (“the” → “cat” 관계):

1) Q·K^T 계산 (내 Query와 상대방 Key의 유사도)
   "the"의 Q × "cat"의 K^T = 0.78

2) √d_k로 정규화 (√4 = 2)
   0.78 / 2 = 0.39

3) 모든 단어와의 점수 계산
   "the" → "the": 0.46
   "the" → "cat": 0.39
   "the" → "sat": 0.325
   ... (6개 단어 모두)

4) Softmax 적용 (모든 가중치 합 = 1.0)
   exp(0.46) / Σ = 0.28 (자신에게 28% 주목)
   exp(0.39) / Σ = 0.24 ("cat"에 24% 주목)
   exp(0.325) / Σ = 0.18 ("sat"에 18% 주목)
   ... (합 = 1.0)

5) Value 벡터로 가중 평균
   Output = 0.28×V("the") + 0.24×V("cat") + 0.18×V("sat") + ...

결과: "the"는 자신(28%)과 "cat"(24%)에 가장 높은 주목도

Multi-Head Attention

왜 필요한가?

단일 attention head는 한 가지 관점만 학습:

Head 1: 문법적 관계 (주어-동사 연결)
Head 2: 의미적 관계 (대명사 지시 대상)
Head 3: 구조적 관계 (전치사구의 범위)

병렬 구조 (8-head example)

입력 문장 (n_words = 6, d_model = 512)
        ↓
[Head 1] → Attention(Q₁, K₁, V₁) → 64-dim output
[Head 2] → Attention(Q₂, K₂, V₂) → 64-dim output
[Head 3] → Attention(Q₃, K₃, V₃) → 64-dim output
...
[Head 8] → Attention(Q₈, K₈, V₈) → 64-dim output
        ↓
Concatenate 8개 → 512-dim
        ↓
Linear(W_o) → 최종 출력 (512-dim)

Attention Visualization (시각화)

“it” 대명사 해결 예시

문장: “The animal didn’t cross the street because it was tired”

"it" (위치 9)의 주목도 분포:

단어             주목도
animal   ████████████ 0.42  ← 가장 높음 (정답!)
street   █████ 0.16
was      ███ 0.12
it(자신)  ██ 0.10
...기타  █ 0.20

→ 모델이 "it"이 "animal"을 가리킨다는 것을 학습함

이를 통해 대명사 해결(pronoun resolution), 지시 대상 파악이 가능해짐.

Positional Encoding

Self-Attention은 입력 순서를 무시하므로 위치 정보 명시적 추가 필요.

공식:

Position Encoding(pos, 2i) = sin(pos / 10000^(2i/d_model))
Position Encoding(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

예시:

단어 위치 0 ("The"):     [0.00, 1.00, 0.00, 1.00, ...]
단어 위치 1 ("cat"):      [0.84, 0.54, 0.41, 0.91, ...]
단어 위치 2 ("sat"):      [0.91, -0.42, 0.68, 0.73, ...]

최종 임베딩 = Word Embedding + Positional Encoding

Transformer 전체 구조

입력 텍스트: "The cat sat on the mat"
        ↓
[Tokenization] → ["The", "cat", "sat", "on", "the", "mat"]
        ↓
[Embedding] → (6 tokens, 512 dimensions)
        ↓
[Positional Encoding] + Embedding
        ↓
[6개 Encoder Layers]
  각 layer:
    - Multi-Head Attention (8 heads)
    - Feed-Forward Network (2개 Linear + ReLU)
    - Residual Connection + Layer Norm
        ↓
[Decoder Layer] (번역/생성 시에만)
        ↓
[Linear + Softmax] → 다음 토큰 확률

핵심 개념 정리

개념	설명	목적
Query (Q)	“내가 찾는 정보는?”	다른 토큰 검색
Key (K)	“내가 가진 정보의 특징”	검색 키 역할
Value (V)	“내가 기여할 정보”	최종 합산에 기여
Attention Weight	Q·K의 유사도 → softmax	각 토큰의 중요도 (0~1)
Multi-Head	8개 이상 병렬 head	다양한 관점 학습
Positional Encoding	토큰의 위치 정보	순서 정보 보존

실무 계산 예제

문장: “Claude is an AI assistant”

Q(Claude) = [0.7, 0.2, 0.5, ...]
K(Claude) = [0.6, 0.3, 0.4, ...]
K(is) = [0.2, 0.8, 0.1, ...]
K(AI) = [0.8, 0.1, 0.6, ...]

Attention Score:
Claude → Claude: 0.71
Claude → is: 0.31  
Claude → AI: 0.75 ← 높음

Softmax 후:
Claude → Claude: 0.35
Claude → is: 0.15
Claude → AI: 0.50  ← "Claude"는 "AI"에 주목

결론: 모델이 Claude가 AI라는 것을 이해함

개선	설명	효과
Flash Attention	메모리 효율적 계산	속도 3배 ↑
GQA (Grouped Query Attention)	키/값 공유	메모리 40% ↓
ALiBi (Attention with Linear Biases)	위치 인코딩 대체	더 긴 시퀀스 가능
Rotary Position Embedding	절대→상대 위치	외삽(extrapolation) 성능 ↑

연결되는 위키 페이지

attention-mechanism — Attention 메커니즘 개념
nlp-fundamentals — NLP 기초
llm-architecture — LLM 아키텍처
huggingface — Hugging Face 프레임워크
prompt-engineering-techniques — 프롬프트 기법
rag-langchain-implementation-datacamp — RAG에서 Transformer 활용
genai-design-patterns-devto-skala — 생성형 AI 서비스 패턴

JYP Garden

탐색기

Transformer Architecture: Self-Attention and Multi-Head Attention Explained

Transformer Architecture: Self-Attention and Multi-Head Attention Explained

핵심 Takeaway

상세 요약

Transformer의 탄생 배경

RNN/LSTM의 문제점

Transformer의 해결책

결과

Self-Attention의 핵심 원리

1. Query-Key-Value 행렬 (QKV)

2. Attention 가중치 계산 (Softmax)

Multi-Head Attention

왜 필요한가?

병렬 구조 (8-head example)

Attention Visualization (시각화)

“it” 대명사 해결 예시

Positional Encoding

Transformer 전체 구조

핵심 개념 정리

실무 계산 예제

최신 LLM의 개선 사항 (2026년 기준)

연결되는 위키 페이지

그래프 뷰

목차