JYP Garden

❯

❯

Sparse Attention

Sparse Attention

Properties1

tags	llm아키텍처, attention, long-context, deepseek, knowledge

2026년 5월 07일2 min read

Sparse Attention

Attention 연산에서 모든 이전 토큰을 참조하는 dense attention 대신, 소수의 중요한 토큰만 선택적으로 참조하는 방식. Long-context 처리 비용(연산량 + KV cache 메모리)을 대폭 줄이는 핵심 기술.

핵심 아이디어

Context가 길어질수록 모든 토큰이 동등하게 중요하지 않으므로, 중요한 소수 토큰만 참조(희소 참조)해도 성능을 유지할 수 있다.

학습의 어려움

top-k 선택 연산 자체가 미분 불가능 → gradient 전달 단절
From-scratch sparse attention 학습은 학습 불안정성 유발
중국 여러 랩의 공통 결론: “dense attention 없이 sparse attention 단독 학습은 매우 어렵다”

DeepSeek-V4의 해결책

초반 1T 토큰은 dense attention으로 warm-up 후, 나머지 30T+ 토큰에서 sparse attention 학습. → from-scratch에 가깝지만 완전히 from-scratch는 아닌 절충점.

DeepSeek-V4 3-component 구조

Sliding Window Attention: 최근 ~500 토큰만 참조 (지역 문맥)
Block-sparse Attention: 전체를 100:1 압축 후 full attention (전역 요약)
Compressed Sparse Attention: 4:1 압축 + Lightning Indexer top-k 선택 (중요도 기반)

관련 노트

DeepSeek-V4-아키텍처
yt-rJEMaldMyLE-DeepSeek-V4-논문읽기

그래프 뷰

Sparse Attention
핵심 아이디어
학습의 어려움
DeepSeek-V4의 해결책
DeepSeek-V4 3-component 구조
관련 노트

백링크

DeepSeek-V4 아키텍처
EP 95. DeepSeek-V4 논문 읽기 — 노정석

Created with Quartz v5.0.0 © 2026