Computational and Communication Modeling of LLM Serving System

2025-11-18 4823 words 23 minutes

Contents

Notation and Basic Parameters

Basic Parameters

Parameter	Symbol	Example	Meaning	Performance Impact
Batch Size	$B$	16	Number of concurrent requests	$B$↑ → Hardware utilization↑, KV usage↑
Sequence Length	$S$	4096	Number of tokens in input	$S$↑ → Computation↑(Prefill), Memory↑(Decode)
Hidden Dimension	$d$	7168	Transformer internal representation dimension	$d$↑ → Computation∝$d^2$, Memory∝$d$
Number of Layers	$L$	61	Number of Transformer blocks	$L$↑ → Total params∝$L$, Inference time∝$L$
Attention Heads	$H$	128	Number of multi-head attention heads	$H$↑ → TP upper limit↑

Parallelism Strategy Parameters

Parameter	Symbol	Example	Meaning	Key Relationships
Tensor Parallelism	$TP$	8	Split single tensor	$H \% TP = 0$, each GPU processes $H/TP$ heads
Pipeline Parallelism	$PP$	4	Split layers across GPUs	Each stage processes $L/PP$ layers, has bubble overhead
Data Parallelism	$DP$	2	Split batch	Each GPU processes $B/DP$ requests, replicas infer independently
Expert Parallelism	$EP$	64	Split experts across GPUs	Each GPU stores $E/EP$ experts, requires All-to-All communication
Total GPUs	$N_{GPU}$	512	Total GPUs in cluster	$N_{GPU} = TP \times PP \times DP \times EP$
Communication Group Size	$TP_{comm}$	8	Number of GPUs in single communication	TP requires All-Reduce, EP requires All-to-All
Parameters per GPU	$P_{GPU}$	~50GB	Parameters stored per GPU	$P_{GPU} \approx P_{total}/(TP \times PP \times EP)$, DP doesn’t affect param count

Continuous Batching Parameters (Real Serving Scenario)

Parameter	Symbol	Meaning	Relationship to Simplified Model
Prefill Requests	$B_p$	Number of Prefill requests	Dynamically changing
Decode Requests	$B_d$	Number of Decode requests	Dynamically changing
Total Requests	$B = B_p + B_d$	Total requests in current batch	$B$ is fixed in simplified model
Total Prefill Tokens	$T_p = \sum_{i=1}^{B_p} S_i$	Total tokens in Prefill requests	Each request has different sequence length $S_i$
Decode Tokens	$T_d = B_d$	Tokens generated by Decode requests	Each request generates 1 token
Total Tokens	$T_{total} = T_p + T_d$	Actual total tokens processed	Communication∝$T_{total}$, not $B \times S$

Detailed Parallelism Strategies (Simplified Model)

Note: The following analysis is a simplified model, assuming all requests in a batch are in the same stage (pure Prefill or pure Decode). Real serving scenarios use continuous batching, detailed in later sections.

TP Impact on Attention

TP All-Reduce Communication Derivation

Complete Computation (No TP)

$$Y = X \times W$$

$X: (N, d_{in})$ - Input
$W: (d_{in}, d_{out})$ - Weight matrix
$Y: (N, d_{out})$ - Output

TP Partitioning (Row-wise Weight Split)

Input Split: Each GPU holds part of input $(N, d_{in}/TP)$
Row-wise Weight Split:

1
2
3
4


Complete weight: W = (d_in, d_out)

Split into TP parts:
GPU i: W_i = (d_in/TP, d_out)  <- Rows i×(d_in/TP) to (i+1)×(d_in/TP)-1

Local Computation: Each GPU computes: $Y_i = X_i \times W_i = (N, d_{in}/TP) \times (d_{in}/TP, d_{out}) \to (N, d_{out})$

Note: Each GPU produces $(N, d_{out})$ output, but these are partial results!
Why Partial Results?

Complete matrix multiplication should be:
$$Y = \begin{bmatrix} X_0 & X_1 & \cdots & X_{TP-1} \end{bmatrix} \times \begin{bmatrix} W_0 \\ W_1 \\ \vdots \\ W_{TP-1} \end{bmatrix} = \sum_{i=0}^{TP-1} X_i W_i$$
Each GPU only computes one term, requires All-Reduce summation to get final result.
All-Reduce Communication:
- Data size to synchronize: $(N \times d_{out})$ elements
- Actual communication (Ring All-Reduce algorithm, per GPU): $$\text{Send} + \text{Receive} = 2 \times \frac{TP-1}{TP} \times N \times d_{out} \text{ elements}$$
- When $TP=8$: Actually transfers $1.75 \times N \times d_{out}$ elements

General Rule

All layers using TP with row-wise weight split (output projection, FFN second layer, etc.), All-Reduce communication depends on output shape:

$$\text{To sync} = N \times d_{out}, \quad \text{Actual transfer (per GPU)} = 2 \times \frac{TP-1}{TP} \times N \times d_{out}$$

Simplified Convention

For brevity, “communication volume” in following text refers to data size to synchronize (i.e., output size), actual transfer needs to multiply by factor $2(TP-1)/TP$.

Essential Difference: Prefill vs Decode

Although both are matrix-matrix multiplication (when batch size $B > 1$), their properties are completely different:

Dimension	Prefill	Decode
Input Shape	$(B, S, d)$, $S$ large (e.g., 2048)	$(B, 1, d)$, sequence length is 1
Matrix Mult	$(B, S, d) \times (d, d)$, large matrix	$(B, d) \times (d, d)$, small matrix
Computation	$2BSd^2$	$2Bd^2$
Arithmetic Intensity	$\frac{2BSd^2}{2d^2} \approx BS$ (high)	$\frac{2Bd^2}{2d^2} \approx B$ (low)
Bottleneck	Compute-bound	Memory-bound
GPU Utilization	High (80-90%)	Low (10-30%)
Optimization	Increase parallelism (TP), compute optimization	Reduce memory access (quantization, KV Cache compression)
Communication Impact	Can be hidden by computation	Latency directly exposed

Standard MHA (Multi-Head Attention)

Prefill Phase

Step 1: QKV Projection

Computation:

Complete computation: $Q, K, V = X \times W_Q, X \times W_K, X \times W_V$
- Input $X: (B, S, d)$
- Weights $W_Q, W_K, W_V: (d, d)$
- Output $Q, K, V: (B, S, d)$ → reshape to $(B, S, H, d_h)$
TP Split: Each GPU processes $H/TP$ heads, weight matrices column-wise split
Shape per GPU:
- Input $X$: $(B, S, d)$ (broadcast to all GPUs)
- Weights $W_Q, W_K, W_V$: $(d, \frac{d}{TP})$ (column split)
- Output $Q, K, V$: $(B, S, \frac{d}{TP})$ → reshape to $(B, S, \frac{H}{TP}, d_h)$
Computation per GPU: $3 \times \frac{2BS \times d \times d}{TP} = \frac{6BSd^2}{TP}$ (Q, K, V three projections)
KV Cache Usage:
- Per GPU per layer: $2 \times B \times S \times \frac{d}{TP}$ ($\frac{H}{TP}$ heads, K and V each $B \times S \times d_h$)
- Total across all GPUs: $2 \times B \times S \times d$

Communication: None (TP column split, each GPU computes independently)

Step 2: Attention Computation

Computation: Scaled Dot-Product Attention

Shape per GPU:
- Input Q, K, V: $(B, S, \frac{H}{TP}, d_h)$ (each GPU processes $\frac{H}{TP}$ heads)
- Merge batch and sequence dimensions: $(B \times S, \frac{H}{TP}, d_h)$ or view as each head processes $(B, S, d_h)$
Computation flow (each head independent):
1. $\text{Score} = \frac{Q \times K^T}{\sqrt{d_h}}$: $(B, S, d_h) \times (B, d_h, S) \to (B, S, S)$ (attention score matrix)
2. $\text{AttentionWeights} = \text{Softmax}(\text{Score})$: $(B, S, S)$ (softmax per row)
3. $\text{Output} = \text{AttentionWeights} \times V$: $(B, S, S) \times (B, S, d_h) \to (B, S, d_h)$
Output merge:
- Each GPU output: $(B, S, \frac{H}{TP}, d_h)$
- After concat heads: $(B, S, \frac{d}{TP})$
  - Concatenate $\frac{H}{TP}$ heads’ dimension $d_h$: $\frac{H}{TP} \times d_h = \frac{d}{TP}$
Computation per GPU: $\frac{2BS^2 \times d}{TP}$ ($Q \times K^T$ and $\text{AttentionWeights} \times V$ each half)

Communication: None (each GPU computes its own heads independently)

Step 3: Output Projection

Computation:

Complete computation: $Y = \text{Concat}(Q, K, V) \times W_O$
- Input (concatenated attention output): $(B, S, d)$
- Weight $W_O$: $(d, d)$
- Output $Y$: $(B, S, d)$
TP Split: Weight matrix row-wise split
Shape per GPU:
- Input: $(B, S, \frac{d}{TP})$ (each GPU’s attention output)
- Weight $W_O$: $(\frac{d}{TP}, d)$ (row split)
- Local output: $(B, S, \frac{d}{TP}) \times (\frac{d}{TP}, d) \to (B, S, d)$ (partial result)
Computation per GPU: $\frac{2BS \times d \times d}{TP}$

Communication:

All-Reduce: Merge partial results from all GPUs $(B, S, d)$
Communication volume: $B \times S \times d$ elements (per layer)
Large communication pressure when $S$ is large, but can be hidden by computation
Communication/Computation ratio: $\frac{BSd}{BSd^2/TP} = \frac{TP}{d} \approx 0.1\%$ (when $d=7168, TP=8$)

Decode Phase

Step 1: QKV Projection

Computation:

Complete computation: $Q, K, V = X_{new} \times W_Q, X_{new} \times W_K, X_{new} \times W_V$
- Input $X_{new}$ (new token): $(B, d)$
- Weights $W_Q, W_K, W_V$: $(d, d)$
- Output $Q, K, V$: $(B, d)$ → reshape to $(B, H, d_h)$
- K and V appended to KV Cache
TP Split: Weight matrices column-wise split
Shape per GPU:
- Input $X_{new}$: $(B, d)$ (broadcast to all GPUs)
- Weights $W_Q, W_K, W_V$: $(d, \frac{d}{TP})$ (column split)
- Output $Q, K, V$: $(B, \frac{d}{TP})$ → reshape to $(B, \frac{H}{TP}, d_h)$
Computation: $3 \times \frac{2B \times d^2}{TP} = \frac{6Bd^2}{TP}$ (Q, K, V three projections)
KV Cache Usage ($S$ gradually growing):
- Per GPU per layer: $2 \times B \times S \times \frac{d}{TP}$
- Total across all GPUs: $2 \times B \times S \times d$

Communication: None

Step 2: Attention Computation

Computation: New token’s Q attends to all historical KV

Shape per GPU:
- $Q$ (new token): $(B, \frac{H}{TP}, d_h)$ (sequence length is 1)
- $K$ (history, from KV Cache): $(B, S, \frac{H}{TP}, d_h)$ ($S$ is history length, including new token)
- $V$ (history, from KV Cache): $(B, S, \frac{H}{TP}, d_h)$
Computation flow (each head independent, each GPU processes $\frac{H}{TP}$ heads):
1. $\text{Score} = \frac{Q \times K^T}{\sqrt{d_h}}$: $(B, 1, d_h) \times (B, d_h, S) \to (B, 1, S)$
2. $\text{Weights} = \text{Softmax}(\text{Score})$: $(B, 1, S)$
3. $\text{Output} = \text{Weights} \times V$: $(B, 1, S) \times (B, S, d_h) \to (B, 1, d_h)$
Output merge:
- Each GPU output: $(B, \frac{H}{TP}, d_h)$
- Concat heads dimension: $(B, \frac{d}{TP})$
Computation per GPU: $\frac{2B \times S \times d}{TP}$ ($S$ is history length)

Memory access (main bottleneck): Need to read all KV Cache

Memory access per GPU: $2 \times B \times S \times L \times \frac{d}{TP}$ (K and V for L layers)
Total across all GPUs: $2BSLd$

Communication: None

Step 3: Output Projection

Computation:

Complete computation: $Y = \text{Concat}(\text{Output heads}) \times W_O$
- Input (concatenated attention output): $(B, d)$
- Weight $W_O$: $(d, d)$
- Output $Y$: $(B, d)$
TP Split: Weight matrix row-wise split
Shape per GPU:
- Input: $(B, \frac{d}{TP})$ (each GPU’s attention output)
- Weight $W_O$: $(\frac{d}{TP}, d)$ (row split)
- Local output: $(B, \frac{d}{TP}) \times (\frac{d}{TP}, d) \to (B, d)$ (partial result)
Computation per GPU: $\frac{2B \times d \times d}{TP}$

Communication:

All-Reduce: Merge partial results from all GPUs
Communication volume: $B \times d$ elements (less than Prefill by $S$ times)
Key issue: Communication latency becomes bottleneck
- Prefill: Large communication but larger computation, communication can be hidden
- Decode: Fast computation (memory-bound), communication latency exposed
Communication/Computation ratio same as Prefill: $\frac{Bd}{Bd^2/TP} = \frac{TP}{d} \approx 0.1\%$

MLA (Multi-head Latent Attention, DeepSeek V3)

Core Idea

Through low-rank joint compression, K and V are compressed to a low-dimensional latent space for storage, while Q is also compressed to reduce training activation memory. During inference, matrix absorption optimization is adopted to avoid explicitly decompressing huge matrices, achieving extreme KV Cache and memory savings. In this analysis, we simplify these details; specific details can be found in another blog post.

DeepSeek V3 Key Parameters:

$d = 7168$ (Model Hidden Size)
$H = 128$ (Attention Heads count, in true multi-head structure)
$d_v^h = 128$ (Per-head V Dimension)
$d_{qk}^h = 192$ (Per-head Q/K Dimension)
- Where $d_{qk}^h = d_C^h + d_R^h = 128 + 64$
$d_C^h = 128$ (Q, K Non-RoPE head dimension)
$d_R^h = 64$ (Q, K RoPE head dimension, Decoupled)
$d^c_{kv} = 512$ (KV compression dimension, in latent space)
$d^c_q = 1536$ (Query compression dimension, in latent space)

Latent Vector

Position: In compressed dimension ($d^c_{kv} = 512$, $d^c_q = 1536$)
Form: One-dimensional vector, no explicit multi-head tensor shape
Storage: Exists in KV Cache, but because it’s compressed not in head form, each GPU in TP must have a complete copy.
Example: $c_{KV}: (B, S, 512)$ doesn’t show information of 128 heads

Attention Head Division and Parallelism

Position: During Up-Projection, attention heads reappear, at which point TP can be used to split matrices.

Prefill Phase

Step 1: Q Down-Projection (Query Compression, No Head Concept)

Computation: Project input to low-dimensional latent space

$$c^Q_t = h_t \times W^{DQ}$$

Input $h_t$: $(B, S, d) = (B, S, 7168)$
Weight $W^{DQ}$: $(d, d^c_q) = (7168, 1536)$
Output $c^Q_t$: $(B, S, d^c_q) = (B, S, 1536)$

Key Features:

This step has no “head” concept
$c^Q_t$ is a one-dimensional vector, information compressed in 1536 dimensions, each GPU has a copy of this latent vector.

Computation per GPU: $2BS \times d \times d^c_q$

Step 2: Q Up-Projection and Multi-head Expansion

First: Linear projection expand dimension

$$q_t^C = c^Q_t \times W^{UQ}$$

Input $c^Q_t$: $(B, S, 1536)$
Weight $W^{UQ}$: $(d^c_q, H \times d_{C}^h) = (1536, 128 \times 128) = (1536, 24576)$
Output $q_t^C$: $(B, S, 24576)$

Second: Reshape to multi-head structure

$$q_t^C = \text{reshape}(q_t^C, (B, S, H, d_C^h)) = (B, S, 128, 128)$$

Now has $H=128$ heads’ explicit structure
Each head’s Non-RoPE dimension: $d_C^h = 128$

Third: RoPE part

$$q^R_t = \text{RoPE}(c^Q_t \times W^{QR})$$

Input $c^Q_t$: $(B, S, 1536)$
Weight $W^{QR}$: $(d^c_q, H \times d_R^h) = (1536, 128 \times 64) = (1536, 8192)$
Output $q^R_t$: $(B, S, 8192)$ → reshape to $(B, S, 128, 64)$
Apply RoPE positional encoding

Fourth: Concatenate to complete Q

$$q_{t} = [q_t^C; q_t^R] = (B, S, 128, 192)$$

TP Split (column-wise split):

Input $c^Q_t$: $(B, S, d^c_q)$
Weights $W^{UQ}$: $(d^c_q, \frac{H}{TP} \times d_C^h)$, $W^{QR}$: $(d^c_q, \frac{H}{TP} \times d_R^h)$
Output $q_t$: $(B, S, \frac{H \times d_{qk}^h}{TP})$ → reshape to $(B, S, \frac{H}{TP}, d_{qk}^h)$

Computation per GPU: $\frac{2BS \times d^c_q \times H \times d_{qk}^h}{TP}$

Step 3: KV Down-Projection (K, V Joint Compression, No Head Concept)

Computation:

$$c^{KV}_t = h_t \times W^{DKV}$$

Input $h_t$: $(B, S, d) = (B, S, 7168)$
Weight $W^{DKV}$: $(d, d^c_{kv}) = (7168, 512)$
Output $c^{KV}_t$: $(B, S, d^c_{kv}) = (B, S, 512)$

Key Features:

This step also has no “head” concept
K, V information of 128 heads all compressed into 512-dimensional vector

KV Cache Storage:

Each TP rank must store duplicate $c^{KV}_t$: $(B, S, 512)$
Due to no head structure, can no longer partition by heads to TP

Per GPU compared to standard MHA:

MHA KV Cache: $\frac{2 \times B \times S \times d }{TP} = \frac{2 \times B \times S \times 7168}{TP}$
MLA KV Cache: $B \times S \times d^c_{kv} = B \times S \times 512$
Larger TP scale, more KV Cache duplication, lower MLA advantage

Computation per GPU: $2 \times B \times S \times d \times d^c_{kv}$

Step 4: K Up-Projection

First: Linear projection expand dimension

$$k_t^C = c^{KV}_t \times W^{UK}$$

Input $c^{KV}_t$: $(B, S, 512)$
Weight $W^{UK}$: $(d^c_{kv}, H \times d_C^h) = (512, 128 \times 128) = (512, 16384)$
Output $k_t^C$: $(B, S, 16384)$

Second: Reshape to multi-head structure

$$k_t^C = \text{reshape}(k_t^C, (B, S, H, d_C^h)) = (B, S, 128, 128)$$

Third: RoPE part

$$k^R_t = \text{RoPE}(h_t \times W^{KR})$$

Input $h_t$: $(B, S, d)= (B, S, 7168)$
Weight $W^{KR}$: $(d, d_R^h) = (7168, 64)$
Output $k^R_t$: $(B, S, 64)$ (note only one head)

Fourth: Concatenate to complete K

$$k_t = [k_t^C; k^R_t] = (B, S, 128, 192)$$

TP Split doesn’t split third step:

Input $c^{KV}_t$: $(B, S, d^c_{kv})$
Weight $W^{UK}$: $(d^c_{kv}, \frac{H}{TP} \times d_C^h)$
Output $k_t^C$: $(B, S, \frac{H \times d_C^h}{TP})$

Computation per GPU: $\frac{2BS \times d^c_{kv} \times H \times d_C^h}{TP} + {2BS \times d \times d_R^h}$

Step 5: V Up-Projection

$$v_{t} = c^{KV}_t \times W^{UV}$$

Per GPU computation:

Input $c^{KV}_t$: $(B, S, 512)$
Weight $W^{UV}$: $(d^c_{kv}, \frac{H}{TP} \times d_v^h) = (512, \frac{128}{TP} \times 128)$
Output $v_{t}$: $(B, S, \frac{128}{TP}, 128)$

Computation: $\frac{2BS \times d^c_{kv} \times H \times d_v^h}{TP}$

Step 6: Attention Computation (128 multi-heads)

Shape per GPU:

$Q$: $(B, S, \frac{H}{TP}, d_{qk}^h) = (B, S, \frac{128}{TP}, 192)$
$K$: $(B, S, \frac{H}{TP}, d_{qk}^h) = (B, S, \frac{128}{TP}, 192)$
$V$: $(B, S, \frac{H}{TP}, d_{v}^h) = (B, S, \frac{128}{TP}, 128)$

Computation flow (128 heads compute independently):

Score computation:
$$\text{Score} = \frac{Q \times K^T}{\sqrt{d_{qk}^h}}$$
- Shape: $(B, \frac{H}{TP}, S, S)$
Attention weights:
$$\text{Weights} = \text{Softmax}(\text{Score})$$
- Shape: $(B, \frac{H}{TP}, S, S)$
Weighted sum:
$$\text{Output} = \text{Weights} \times V$$
- Output: $(B, S, \frac{H \times d_{v}^h}{TP})$

Computation:

Score: $\frac{H}{TP}\times2BS^2 \times d_{qk}^h$
Attention×V: $\frac{H}{TP}\times2BS^2 \times d_{v}^h$
Total: $\frac{H}{TP} \times 2BS^2 \times (d_{qk}^h + d_{v}^h)$

Step 7: Output Projection (Reduce back to hidden dimension)

Computation:

$$y_t = \text{output} \times W_O$$

Input (Attention output): $(B, S, H \times d_{v}^h) = (B, S, 16384)$
Weight $W_O$: $(H \times d_{v}^h, d) = (16384, 7168)$
Output $y_t$: $(B, S, d) = (B, S, 7168)$

TP Split (row-wise split):

Input: $(B, S, \frac{H \times d_{v}^h}{TP})$
Weight $W_O$: $(\frac{H \times d_{v}^h}{TP}, d)$
Output: $(B, S, d)$ (partial result)

Computation: $\frac{2BS \times H \times d_{v}^h \times d}{TP}$

Communication:

All-Reduce: $BS \times d = 7168 \times BS$ elements (same as MHA)

Prefill Phase Summary

Step	Involves Heads	Computation (per GPU)	Description
Q Down	❌ No	$2BS \times d \times d^c_q$	Query compression, no head concept
Q Up	✅ 128	$\frac{2BS \times d^c_q \times H \times (d_C^h + d_R^h)}{TP}$	Query decompression, multi-head expansion + RoPE
KV Down	❌ No	$2BS \times d \times d^c_{kv}$	KV joint compression, no head concept, KV Cache duplicated per TP
K Up	✅ 128	$\frac{2BS \times d^c_{kv} \times H \times d_C^h}{TP} + {2BS \times d \times d_R^h}$	K decompression, multi-head expansion + RoPE
V Up	✅ 128	$\frac{2BS \times d^c_{kv} \times H \times d_v^h}{TP}$	V decompression, head aggregation form
Attention	✅ 128	$\frac{2BS^2 \times (d_{qk}^h + d_v^h)}{TP}$	Score and weighted sum
Output	✅ 128→1	$\frac{2BS \times H \times d_v^h \times d}{TP}$	Multi-head merge, project back to hidden dim

Decode Phase

Step 1: Q Down-Projection (Query Compression, No Head Concept)

Same as Prefill

Step 2: Q Up-Projection and Multi-head Expansion

First: Linear projection expand dimension

$$q_t^C = c^Q_t \times W^{UQ}$$

Input $c^Q_t$: $(B, 1, 1536)$
Weight $W^{UQ}$: $(d^c_q, H \times d_{C}^h) = (1536, 128 \times 128) = (1536, 24576)$
Output $q_t^C$: $(B, 1, 24576)$

Second: Reshape to multi-head structure

$$q_t^C = \text{reshape}(q_t^C, (B, 1, H, d_C^h)) = (B, 1, 128, 128)$$

Now has $H=128$ heads’ explicit structure
Each head’s Non-RoPE dimension: $d_C^h = 128$

Third: Absorb matrix $W^{UK}$ into $q_t^C$

$$q_t^{C\_absorb} = q_t^C \times {W^{UK}}^T$$

Input $q_t^C$: $(B, 1, 128, 128)$
Weight $W^{UK}$: $(d^c_{kv}, H,d_C^h) = (512, 128,128)$
Output $q_t^{C\_absorb}$: $(B, 1, H, d^c_{kv})=(B, 1, 128, 512)$

Fourth: RoPE part

$$q^R_t = \text{RoPE}(c^Q_t \times W^{QR})$$

Input $c^Q_t$: $(B, 1, 1536)$
Weight $W^{QR}$: $(d^c_q, H \times d_R^h) = (1536, 128 \times 64) = (1536, 8192)$
Output $q^R_t$: $(B, 1, 8192)$ → reshape to $(B, 1, 128, 64)$
Apply RoPE positional encoding

Fifth: Concatenate to complete Q

$$q_{t} = [q_t^{C\_absorb}; q_t^R] = (B, 1, 128, 576)$$

TP Split (column-wise split):

Input $c^Q_t$: $(B, 1, d^c_q)$
Weights $W^{UQ}$: $(d^c_q, \frac{H}{TP} \times d_C^h)$, $W^{UK}$: $(d^c_{kv}, \frac{H}{TP}\times d_C^h)$, $W^{QR}$: $(d^c_q, \frac{H}{TP} \times d_R^h)$
Output $q_t$: $(B, 1, \frac{H \times d_{qk}^h}{TP})$ → reshape to $(B, 1, \frac{H}{TP}, d_{qk}^h)$

Computation per GPU (ignoring RoPE positional encoding): $\frac{2B \times d^c_q \times H \times (d_{qk}^h+d_R^h)+2B \times d^c_{kv} \times H \times d_{qk}^h}{TP}$

Step 3: KV Down-Projection

First step same as Prefill, computation:

$$c^{KV}_t = h_t \times W^{DKV}$$

Input $h_t$: $(B, 1, d) = (B, 1, 7168)$
Weight $W^{DKV}$: $(d, d^c_{kv}) = (7168, 512)$
Output $c^{KV}_t$: $(B, 1, d^c_{kv}) = (B, 1, 512)$

Use previously compressed latent vector KV Cache storage:

Second: RoPE part

$$k^R_t = \text{RoPE}(h_t \times W^{KR})$$

Input $h_t$: $(B, 1, d)= (B, 1, 7168)$
Weight $W^{KR}$: $(d, d_R^h) = (7168, 64)$
Output $k^R_t$: $(B, 1, 64)$ (note only one head)

Third: Concatenate to complete K

$$k_t = [c^{KV}_t; k^R_t] = (B, 1, 576)$$

TP cannot split this step

Computation per GPU: ${2B \times d \times (d^c_{kv} + d_R^h)}$

Step 4: Attention Computation (MQA)

Shape per GPU:

$Q$: $(B, 1, \frac{H}{TP}, d_{kv}^c+d^h_R) = (B, 1, \frac{128}{TP}, 576)$
$K$: $(B, S, d_{kv}^c+d^h_R) = (B, S, 576)$
$V$: $(B, S, d_{kv}^c) = (B, S, 512)$

Computation flow (128 heads compute independently):

Score computation:
$$\text{Score} = \frac{Q \times K^T}{\sqrt{d_{qk}^h}}$$
- Shape: $(B, \frac{H}{TP}, 1, S)$
Attention weights:
$$\text{Weights} = \text{Softmax}(\text{Score})$$
- Shape: $(B, \frac{H}{TP}, 1, S)$
Weighted sum:
$$\text{Output} = \text{Weights} \times V$$
- V doesn’t need to be upscaled to multi-head form at this point
- Output: $(B, \frac{H}{TP}, 1, d_{kv}^c)$

Computation per GPU:

Score: $\frac{H}{TP}\times2BS \times (d_{kv}^c+d^h_R)$
Attention×V: $\frac{H}{TP}\times2BS \times d_{kv}^c$
Total: $\frac{H}{TP}\times2BS \times (2d_{kv}^c+d^h_R)$

Step 5: V Up-Projection applied to MQA result

$$Out_{up} = \text{Output} \times W^{UV}$$

Per GPU computation:

Input MQA Output from previous step: $(B, \frac{H}{TP}, 1, d_{kv}^c)=(B, \frac{H}{TP}, 1, 512)$
Weight $W^{UV}$: $(\frac{H}{TP}, d^c_{kv}, d_v^h) = (\frac{128}{TP} ,512, 128)$
Output $Out_{up}$: $(B, 1, \frac{128}{TP}, 128)$

Computation: $\frac{2B \times d^c_{kv} \times H \times d_v^h}{TP}$

Step 6: Output Projection (Reduce back to hidden dimension)

$$y_t = Out_{up} \times W_O$$

Input $Out_{up}$: $(B, 1, H \times d_{v}^h) = (B, 1, 16384)$
Weight $W_O$: $(H \times d_{v}^h, d) = (16384, 7168)$
Output $y_t$: $(B, 1, d) = (B, 1, 7168)$

TP Split:

Input: $(B, 1, \frac{H \times d_{v}^h}{TP})$
Weight $W_O$: $(\frac{H \times d_{v}^h}{TP}, d)$
Output: $(B, 1, d)$ (partial result)

Computation: $\frac{2B \times H \times d_{v}^h \times d}{TP}$

Communication:

All-Reduce: $B \times d = 7168 \times B$ elements

Decode Phase Summary

Step	Involves Heads (per GPU)	Computation (per GPU) (B: Batch, S: History Len)	Key Features
Step 1: Q Down	❌ No	$2B \times d \times d^c_q$	Query compression Input $h_t$ projects to latent space, no head concept.
Step 2: Q Up	✅ 128	$\frac{2B \cdot d^c_q \cdot H \cdot d_C^h}{TP} + \text{AbsorbCost}^*$ $+ \frac{2B \cdot d^c_q \cdot H \cdot d_R^h}{TP}$	Query decompression + absorb K-Up Expand multi-head Q, and absorb $W^{UK}$ weight; Generate RoPE part and concatenate. Q in “Latent KV” space.
Step 3: KV Down	❌ No	$2B \times d \times (d^c_{kv} + d_R^h)$	KV generation (single head) Only generate latent $c^{KV}$ and $k^R$; No multi-head expansion, directly stored in KV Cache.
Step 4: Attention	✅ 128 (Q) vs 1 (KV)	Score: $\frac{H}{TP} \times 2B \cdot S \times (d^c_{kv} + d_R^h)$ Agg: $\frac{H}{TP} \times 2B \cdot S \times d^c_{kv}$	MQA (Multi-Query Attention) Multi-head Q attends to shared Latent KV Cache; Computational complexity grows linearly with history length $S$.
Step 5: V Up	✅ 128	$\frac{2B \cdot d^c_{kv} \cdot H \cdot d_v^h}{TP}$	V decompression (post) Only upsample Attention output; Restore multi-head Value’s expressiveness.
Step 6: Output	✅ 128→1	$\frac{2B \cdot H \cdot d_v^h \cdot d}{TP}$	Output projection Multi-head merge, project back to hidden dimension $d$; Requires All-Reduce communication.

AbsorbCost: Refers to computational overhead of fusing $W^{UK}$ into Q, typically $\frac{2B \cdot H \cdot d_C^h \cdot d^c_{kv}}{TP}$, depends on specific operator fusion implementation.

Dimension	Prefill Phase	Decode Phase	Core Difference
Attention Type	MHA (Standard)	MQA (Multi-Query)	Decode directly does MQA on Latent vector
KV Cache Content	$c^{KV}$ (Latent)	$c^{KV}$ (Latent)	Storage content identical, Decode reuses Prefill-generated Cache, no conversion needed.
KV Up-Proj Timing	Before Attention	After Attention (V) Fused into Q (K)	Prefill explicitly restores all KV heads; Decode avoids restoring historical KV, only restores on result.
K Form	Multi-head $K$ $(B, S, H, d_{qk}^h)$	Single-head Latent $K$ $(B, S, 1, d^c_{kv}+d_R^h)$	Prefill’s K is physical multi-head; Decode’s K is Latent vector plus RoPE, logical broadcast.
Computational Complexity	$O(S^2)$	$O(S)$	Prefill can still consider TP, but Decode using TP is difficult to hide communication overhead, additionally KV Cache efficiency is low.

MHA vs MLA Comparison Summary

MHA has independent heads, so using TP only requires final communication overhead.
MLA’s Latent Vector needs duplicate copies in each TP rank, so TP efficiency is lower.
MLA’s memory bottleneck is alleviated, TP communication impact becomes more prominent.
MLA inference practice commonly uses DP-Attention.
MLA’s Latent Vector can compress KV Cache size, advantageous for memory-intensive Decode.

TP Impact on FFN

Prefill and Decode Phases are Consistent

Difference is Decode’s S is 1

Step 1: First Layer Projection ($d \to d_{ff}$)

Computation:

Complete computation: $H = X \times W_1$
- Input $X$: $(B, S, d)$
- Weight $W_1$: $(d, d_{ff})$ (first FFN layer weight)
- Output $H$: $(B, S, d_{ff})$
TP Split: Weight matrix column-wise split
Shape per GPU:
- Input $X$: $(B, S, d)$ (broadcast to all GPUs)
- Weight $W_1$: $(d, \frac{d_{ff}}{TP})$ (column split)
- Output $H$: $(B, S, \frac{d_{ff}}{TP})$
Computation per GPU: $\frac{2BS \times d \times d_{ff}}{TP}$

Communication: None

Step 2: Activation Function

Computation: Each GPU computes independently (SwiGLU/GELU)

Shape per GPU: Input and output both $(B, S, \frac{d_{ff}}{TP})$
Computation: $O(BS \times \frac{d_{ff}}{TP})$ (relatively small)

Communication: None

Step 3: Second Layer Projection ($d_{ff} \to d$)

Computation:

Complete computation: $Y = \text{Activation}(H) \times W_2$
- Input (after activation): $(B, S, d_{ff})$
- Weight $W_2$: $(d_{ff}, d)$ (second FFN layer weight)
- Output $Y$: $(B, S, d)$
TP Split: Weight matrix row-wise split
Shape per GPU:
- Input: $(B, S, \frac{d_{ff}}{TP})$
- Weight $W_2$: $(\frac{d_{ff}}{TP}, d)$ (row split)
- Local output: $(B, S, \frac{d_{ff}}{TP}) \times (\frac{d_{ff}}{TP}, d) \to (B, S, d)$ (partial result)
Computation per GPU: $\frac{2BS \times d_{ff} \times d}{TP}$

Communication:

All-Reduce: Merge TP split results
Communication volume: $BS \times d$ elements (same as Attention)
- FFN immediately added to Attention result, communication pattern consistent

Communication Overhead Comparison

Communication Volume Derivation Summary for Parallelism Strategies

Convention: In this section’s tables, “data to synchronize” refers to logical tensor size (i.e., number of activation value elements to be transferred). Actual transfer volume refers to total data sent/received per GPU on physical links, needs to multiply by algorithm-related constant factor.

Parallelism Strategy	Communication Pattern	Data to Synchronize (Logical Size)	Actual Communication (per GPU, order of magnitude)	Key Derivation Logic
TP	All-Reduce	$N \times d$	$\approx 2\frac{TP-1}{TP} \times N d$	Row-wise weight split, each card computes partial sum, needs Ring All-Reduce accumulation. $N$ is token count for this operator.
EP	All-to-All (combined with DP-Attention)	$N \times k \times d$	$\approx 4 \times \frac{N k d}{EP} \times \frac{EP-1}{EP}$	MoE layer contains two All-to-All (Dispatch + Combine). Each logical data volume is $N \times k \times d$ (each token selects $k$ experts). Final coefficient because no self-communication needed.
PP	P2P Send/Recv	$N \times d$	$\approx N \times d$ (unidirectional)	Point-to-point transfer at stage boundaries. Previous Stage completes computation of $N$ tokens’ activation values, sends to next Stage.
DP	None	0	0	In pure inference phase, no communication between data parallel replicas.

Key Variable Description:

Prefill Phase: $N \approx B \times S$ (Batch $\times$ Sequence Length)
Decode Phase: $N \approx B$ (Batch $\times$ 1)
Conclusion: In Decode phase, all strategies’ communication volume reduces by approximately $S$ times compared to Prefill.

Prefill vs Decode TP / EP / PP Communication Comparison

Table below explicitly distinguishes Prefill and Decode phase communication characteristics under different parallelism strategies. Ignores DP, focuses on single layer/single operation overhead.

Notation: $B$ (Batch Size), $S$ (Seq Len), $d$ (Hidden Dim), $k$ (Top-k experts), $d_{act}$ (Activation Dim).

Parallelism Strategy	Phase	Logical Communication Data Size (Single Layer)	Rough Communication (per GPU)	Characteristic Analysis
TP	Prefill	$BS \times d$	$\approx 1.75 \times BS d$ (TP=8)	Large communication, but easily hidden by computation under high arithmetic intensity.
	Decode	$B \times d$	$\approx 1.75 \times B d$	Small communication, but computation also small, latency directly exposed, limited by NVLink bandwidth and latency.
EP	Prefill	$BS \times k \times d$	$\approx 4 \frac{BSkd}{EP}\times \frac{EP-1}{EP}$	Cross-node All-to-All communication. Prefill phase has large data volume, easily hits cross-machine interconnect bandwidth bottleneck.
	Decode	$B \times k \times d$	$\approx 4 \frac{Bkd}{EP}\times \frac{EP-1}{EP}$	Data volume small, but All-to-All’s startup latency and network congestion control become main bottleneck.
PP	Prefill	$BS \times d_{act}$	$\approx BS d$	Huge activation value transfer causes pipeline Bubble hard to fill, and first Token latency (TTFT) significantly increases.
	Decode	$B \times d_{act}$	$\approx B d$	Smaller communication, main issue is pipeline’s serial dependency causing latency accumulation.

TP implementation simple, good MHA support. But needs good communication performance. Each layer requires All-Reduce communication.
EP used for MoE architecture. Advantage is GPU doesn’t need to load all experts, aligns with MoE’s memory loading reduction goal. Commonly paired with DP attention in DeepSeek. Each MoE layer needs All-to-All communication, also needs attention.
PP only needs communication at layer split points. Computation-communication ratio higher, suitable for insufficient memory but poor communication performance scenarios.

Static vs Continuous Batching

In real production systems, Batch is no longer static $(B, S)$, but a dynamic stream.

Dimension	Static Batching (Simplified Model)	Continuous Batching (Real Model)	Core Difference
Definition	Assumes all requests in Batch have consistent state (all Prefill or all Decode).	Allows mixed within Batch: $B_p$ requests doing Prefill, $B_d$ doing Decode.	Higher resource utilization, but increases scheduling complexity.
Total Token Count	Prefill: $T = B \times S$ Decode: $T = B$	$T_{total} = \sum_{i=1}^{B_p} S_i + B_d$	Communication no longer determined by $B$, but by current Step’s effective total token count $T_{total}$.
TP Communication	Each layer fixed transfer $BSd$ or $Bd$	Each layer transfer $\propto T_{total} \times d$	$T_{total}$ fluctuates dramatically between Steps, bandwidth demand shows pulsed pattern.
EP Communication	Each layer fixed transfer $BSkd$ or $Bkd$	Each layer transfer $\propto T_{total} \times k \times d$	EP’s All-to-All load fluctuates with $T_{total}$, and expert load may be unbalanced.

System Design Impact of Continuous Batching

After introducing Continuous Batching, brings following challenges to parallel communication system design:

Communication Bandwidth Dynamics:
- Because $T_{total}$ depends on incoming requests’ Prefill length $S_i$, communication volume at a Step may suddenly surge.
- Network design must plan by Peak rather than average, otherwise large Prefill arrival causes severe Head-of-Line Blocking.
Scheduling Strategy Tradeoffs:
- Prefill Priority: Maximize $T_{total}$, increase computation density (arithmetic intensity), hide TP/EP communication latency. Disadvantage is memory suddenly surges, may squeeze Decode space.
- Decode Priority: Ensure low latency, but $T_{total}$ small, computation cannot hide communication, GPU must wait for communication completion even with low utilization.
Memory (KV Cache) Management:
- KV Cache usage formula becomes: $\text{Mem}_{KV} = (\sum S_{curr}) \times L \times d_{kv} \times \text{Layers}$.
- TP/PP must ensure consistent memory allocation on all Devices, while EP may cause uneven KV distribution across different experts (but in MLA/MHA architectures sharing KV, KV typically doesn’t do EP split).

Chunked Prefill Impact on Communication and Pipeline

To solve long sequence Prefill blocking Decode problem, typically introduce Chunked Prefill: split long sequence $S_{total}$ into multiple $S_{chunk}$ (e.g., 512 tokens) for batch processing.

A. Impact on TP / EP Communication

Total unchanged, peak shaving:
- Without Chunk: Single communication volume $B \times S_{total} \times d$, huge data packets may cause network congestion.
- With Chunk: Single communication volume $B \times S_{chunk} \times d$. Although total bytes transferred unchanged, single communication smaller, easier to Overlap with computation Kernel.

B. Decisive Role on PP (Pipeline Parallelism)

Eliminate Pipeline Bubble:
- No Chunk: One long Prefill request enters Stage 1, all subsequent Decode requests must wait for it to completely flow through entire Pipeline, causing huge latency jitter.
- With Chunk: Prefill’s Chunk can be Interleaved with Decode requests into pipeline.
- Effect: Although single Prefill request’s completion time (TTFT) doesn’t significantly reduce, system’s average throughput and Decode requests’ P99 latency are significantly optimized.