Contents

Computational and Communication Modeling of LLM Serving System

Notation and Basic Parameters

Basic Parameters

Parameter Symbol Example Meaning Performance Impact
Batch Size $B$ 16 Number of concurrent requests $B$↑ → Hardware utilization↑, KV usage↑
Sequence Length $S$ 4096 Number of tokens in input $S$↑ → Computation↑(Prefill), Memory↑(Decode)
Hidden Dimension $d$ 7168 Transformer internal representation dimension $d$↑ → Computation∝$d^2$, Memory∝$d$
Number of Layers $L$ 61 Number of Transformer blocks $L$↑ → Total params∝$L$, Inference time∝$L$
Attention Heads $H$ 128 Number of multi-head attention heads $H$↑ → TP upper limit↑

Parallelism Strategy Parameters

Parameter Symbol Example Meaning Key Relationships
Tensor Parallelism $TP$ 8 Split single tensor $H % TP = 0$, each GPU processes $H/TP$ heads
Pipeline Parallelism $PP$ 4 Split layers across GPUs Each stage processes $L/PP$ layers, has bubble overhead
Data Parallelism $DP$ 2 Split batch Each GPU processes $B/DP$ requests, replicas infer independently
Expert Parallelism $EP$ 64 Split experts across GPUs Each GPU stores $E/EP$ experts, requires All-to-All communication
Total GPUs $N_{GPU}$ 512 Total GPUs in cluster $N_{GPU} = TP \times PP \times DP \times EP$
Communication Group Size $TP_{comm}$ 8 Number of GPUs in single communication TP requires All-Reduce, EP requires All-to-All
Parameters per GPU $P_{GPU}$ ~50GB Parameters stored per GPU $P_{GPU} \approx P_{total}/(TP \times PP \times EP)$, DP doesn’t affect param count

Continuous Batching Parameters (Real Serving Scenario)

Parameter Symbol Meaning Relationship to Simplified Model
Prefill Requests $B_p$ Number of Prefill requests Dynamically changing
Decode Requests $B_d$ Number of Decode requests Dynamically changing
Total Requests $B = B_p + B_d$ Total requests in current batch $B$ is fixed in simplified model
Total Prefill Tokens $T_p = \sum_{i=1}^{B_p} S_i$ Total tokens in Prefill requests Each request has different sequence length $S_i$
Decode Tokens $T_d = B_d$ Tokens generated by Decode requests Each request generates 1 token
Total Tokens $T_{total} = T_p + T_d$ Actual total tokens processed Communication∝$T_{total}$, not $B \times S$

Detailed Parallelism Strategies (Simplified Model)

Note: The following analysis is a simplified model, assuming all requests in a batch are in the same stage (pure Prefill or pure Decode). Real serving scenarios use continuous batching, detailed in later sections.

TP Impact on Attention

TP All-Reduce Communication Derivation

Complete Computation (No TP)

$$Y = X \times W$$

  • $X: (N, d_{in})$ - Input
  • $W: (d_{in}, d_{out})$ - Weight matrix
  • $Y: (N, d_{out})$ - Output

TP Partitioning (Row-wise Weight Split)

  1. Input Split: Each GPU holds part of input $(N, d_{in}/TP)$

  2. Row-wise Weight Split:

1
2
3
4
Complete weight: W = (d_in, d_out)

Split into TP parts:
GPU i: W_i = (d_in/TP, d_out)  <- Rows i×(d_in/TP) to (i+1)×(d_in/TP)-1
  1. Local Computation: Each GPU computes: $Y_i = X_i \times W_i = (N, d_{in}/TP) \times (d_{in}/TP, d_{out}) \to (N, d_{out})$

    Note: Each GPU produces $(N, d_{out})$ output, but these are partial results!

  2. Why Partial Results?

    Complete matrix multiplication should be: $$Y = \begin{bmatrix} X_0 & X_1 & \cdots & X_{TP-1} \end{bmatrix} \times \begin{bmatrix} W_0 \ W_1 \ \vdots \ W_{TP-1} \end{bmatrix} = \sum_{i=0}^{TP-1} X_i W_i$$

    Each GPU only computes one term, requires All-Reduce summation to get final result.

  3. All-Reduce Communication:

    • Data size to synchronize: $(N \times d_{out})$ elements
    • Actual communication (Ring All-Reduce algorithm, per GPU): $$\text{Send} + \text{Receive} = 2 \times \frac{TP-1}{TP} \times N \times d_{out} \text{ elements}$$
    • When $TP=8$: Actually transfers $1.75 \times N \times d_{out}$ elements

General Rule

All layers using TP with row-wise weight split (output projection, FFN second layer, etc.), All-Reduce communication depends on output shape: $$\text{To sync} = N \times d_{out}, \quad \text{Actual transfer (per GPU)} = 2 \times \frac{TP-1}{TP} \times N \times d_{out}$$

Simplified Convention

For brevity, “communication volume” in following text refers to data size to synchronize (i.e., output size), actual transfer needs to multiply by factor $2(TP-1)/TP$.

Essential Difference: Prefill vs Decode

Although both are matrix-matrix multiplication (when batch size $B > 1$), their properties are completely different:

Dimension Prefill Decode
Input Shape $(B, S, d)$, $S$ large (e.g., 2048) $(B, 1, d)$, sequence length is 1
Matrix Mult $(B, S, d) \times (d, d)$, large matrix $(B, d) \times (d, d)$, small matrix
Computation $2BSd^2$ $2Bd^2$
Arithmetic Intensity $\frac{2BSd^2}{2d^2} \approx BS$ (high) $\frac{2Bd^2}{2d^2} \approx B$ (low)
Bottleneck Compute-bound Memory-bound
GPU Utilization High (80-90%) Low (10-30%)
Optimization Increase parallelism (TP), compute optimization Reduce memory access (quantization, KV Cache compression)
Communication Impact Can be hidden by computation Latency directly exposed

Standard MHA (Multi-Head Attention)

Prefill Phase

Step 1: QKV Projection

Computation:

  • Complete computation: $Q, K, V = X \times W_Q, X \times W_K, X \times W_V$

    • Input $X: (B, S, d)$
    • Weights $W_Q, W_K, W_V: (d, d)$
    • Output $Q, K, V: (B, S, d)$ → reshape to $(B, S, H, d_h)$
  • TP Split: Each GPU processes $H/TP$ heads, weight matrices column-wise split

  • Shape per GPU:

    • Input $X$: $(B, S, d)$ (broadcast to all GPUs)
    • Weights $W_Q, W_K, W_V$: $(d, \frac{d}{TP})$ (column split)
    • Output $Q, K, V$: $(B, S, \frac{d}{TP})$ → reshape to $(B, S, \frac{H}{TP}, d_h)$
  • Computation per GPU: $3 \times \frac{2BS \times d \times d}{TP} = \frac{6BSd^2}{TP}$ (Q, K, V three projections)

  • KV Cache Usage:

    • Per GPU per layer: $2 \times B \times S \times \frac{d}{TP}$ ($\frac{H}{TP}$ heads, K and V each $B \times S \times d_h$)
    • Total across all GPUs: $2 \times B \times S \times d$

Communication: None (TP column split, each GPU computes independently)

Step 2: Attention Computation

Computation: Scaled Dot-Product Attention

  • Shape per GPU:

    • Input Q, K, V: $(B, S, \frac{H}{TP}, d_h)$ (each GPU processes $\frac{H}{TP}$ heads)
    • Merge batch and sequence dimensions: $(B \times S, \frac{H}{TP}, d_h)$ or view as each head processes $(B, S, d_h)$
  • Computation flow (each head independent):

    1. $\text{Score} = \frac{Q \times K^T}{\sqrt{d_h}}$: $(B, S, d_h) \times (B, d_h, S) \to (B, S, S)$ (attention score matrix)
    2. $\text{AttentionWeights} = \text{Softmax}(\text{Score})$: $(B, S, S)$ (softmax per row)
    3. $\text{Output} = \text{AttentionWeights} \times V$: $(B, S, S) \times (B, S, d_h) \to (B, S, d_h)$
  • Output merge:

    • Each GPU output: $(B, S, \frac{H}{TP}, d_h)$
    • After concat heads: $(B, S, \frac{d}{TP})$
      • Concatenate $\frac{H}{TP}$ heads’ dimension $d_h$: $\frac{H}{TP} \times d_h = \frac{d}{TP}$
  • Computation per GPU: $\frac{2BS^2 \times d}{TP}$ ($Q \times K^T$ and $\text{AttentionWeights} \times V$ each half)

Communication: None (each GPU computes its own heads independently)

Step 3: Output Projection

Computation:

  • Complete computation: $Y = \text{Concat}(Q, K, V) \times W_O$

    • Input (concatenated attention output): $(B, S, d)$
    • Weight $W_O$: $(d, d)$
    • Output $Y$: $(B, S, d)$
  • TP Split: Weight matrix row-wise split

  • Shape per GPU:

    • Input: $(B, S, \frac{d}{TP})$ (each GPU’s attention output)
    • Weight $W_O$: $(\frac{d}{TP}, d)$ (row split)
    • Local output: $(B, S, \frac{d}{TP}) \times (\frac{d}{TP}, d) \to (B, S, d)$ (partial result)
  • Computation per GPU: $\frac{2BS \times d \times d}{TP}$

Communication:

  • All-Reduce: Merge partial results from all GPUs $(B, S, d)$
  • Communication volume: $B \times S \times d$ elements (per layer)
  • Large communication pressure when $S$ is large, but can be hidden by computation
  • Communication/Computation ratio: $\frac{BSd}{BSd^2/TP} = \frac{TP}{d} \approx 0.1%$ (when $d=7168, TP=8$)
Decode Phase

Step 1: QKV Projection

Computation:

  • Complete computation: $Q, K, V = X_{new} \times W_Q, X_{new} \times W_K, X_{new} \times W_V$

    • Input $X_{new}$ (new token): $(B, d)$
    • Weights $W_Q, W_K, W_V$: $(d, d)$
    • Output $Q, K, V$: $(B, d)$ → reshape to $(B, H, d_h)$
    • K and V appended to KV Cache
  • TP Split: Weight matrices column-wise split

  • Shape per GPU:

    • Input $X_{new}$: $(B, d)$ (broadcast to all GPUs)
    • Weights $W_Q, W_K, W_V$: $(d, \frac{d}{TP})$ (column split)
    • Output $Q, K, V$: $(B, \frac{d}{TP})$ → reshape to $(B, \frac{H}{TP}, d_h)$
  • Computation: $3 \times \frac{2B \times d^2}{TP} = \frac{6Bd^2}{TP}$ (Q, K, V three projections)

  • KV Cache Usage ($S$ gradually growing):

    • Per GPU per layer: $2 \times B \times S \times \frac{d}{TP}$
    • Total across all GPUs: $2 \times B \times S \times d$

Communication: None

Step 2: Attention Computation

Computation: New token’s Q attends to all historical KV

  • Shape per GPU:

    • $Q$ (new token): $(B, \frac{H}{TP}, d_h)$ (sequence length is 1)
    • $K$ (history, from KV Cache): $(B, S, \frac{H}{TP}, d_h)$ ($S$ is history length, including new token)
    • $V$ (history, from KV Cache): $(B, S, \frac{H}{TP}, d_h)$
  • Computation flow (each head independent, each GPU processes $\frac{H}{TP}$ heads):

    1. $\text{Score} = \frac{Q \times K^T}{\sqrt{d_h}}$: $(B, 1, d_h) \times (B, d_h, S) \to (B, 1, S)$
    2. $\text{Weights} = \text{Softmax}(\text{Score})$: $(B, 1, S)$
    3. $\text{Output} = \text{Weights} \times V$: $(B, 1, S) \times (B, S, d_h) \to (B, 1, d_h)$
  • Output merge:

    • Each GPU output: $(B, \frac{H}{TP}, d_h)$
    • Concat heads dimension: $(B, \frac{d}{TP})$
  • Computation per GPU: $\frac{2B \times S \times d}{TP}$ ($S$ is history length)

Memory access (main bottleneck): Need to read all KV Cache

  • Memory access per GPU: $2 \times B \times S \times L \times \frac{d}{TP}$ (K and V for L layers)
  • Total across all GPUs: $2BSLd$

Communication: None

Step 3: Output Projection

Computation:

  • Complete computation: $Y = \text{Concat}(\text{Output heads}) \times W_O$

    • Input (concatenated attention output): $(B, d)$
    • Weight $W_O$: $(d, d)$
    • Output $Y$: $(B, d)$
  • TP Split: Weight matrix row-wise split

  • Shape per GPU:

    • Input: $(B, \frac{d}{TP})$ (each GPU’s attention output)
    • Weight $W_O$: $(\frac{d}{TP}, d)$ (row split)
    • Local output: $(B, \frac{d}{TP}) \times (\frac{d}{TP}, d) \to (B, d)$ (partial result)
  • Computation per GPU: $\frac{2B \times d \times d}{TP}$

Communication:

  • All-Reduce: Merge partial results from all GPUs
  • Communication volume: $B \times d$ elements (less than Prefill by $S$ times)
  • Key issue: Communication latency becomes bottleneck
    • Prefill: Large communication but larger computation, communication can be hidden
    • Decode: Fast computation (memory-bound), communication latency exposed
  • Communication/Computation ratio same as Prefill: $\frac{Bd}{Bd^2/TP} = \frac{TP}{d} \approx 0.1%$

MLA (Multi-head Latent Attention, DeepSeek V3)

Core Idea

Through low-rank joint compression, K and V are compressed to a low-dimensional latent space for storage, while Q is also compressed to reduce training activation memory. During inference, matrix absorption optimization is adopted to avoid explicitly decompressing huge matrices, achieving extreme KV Cache and memory savings. In this analysis, we simplify these details; specific details can be found in another blog post.

DeepSeek V3 Key Parameters:

  • $d = 7168$ (Model Hidden Size)
  • $H = 128$ (Attention Heads count, in true multi-head structure)
  • $d_v^h = 128$ (Per-head V Dimension)
  • $d_{qk}^h = 192$ (Per-head Q/K Dimension)
    • Where $d_{qk}^h = d_C^h + d_R^h = 128 + 64$
  • $d_C^h = 128$ (Q, K Non-RoPE head dimension)
  • $d_R^h = 64$ (Q, K RoPE head dimension, Decoupled)
  • $d^c_{kv} = 512$ (KV compression dimension, in latent space)
  • $d^c_q = 1536$ (Query compression dimension, in latent space)

Latent Vector

  • Position: In compressed dimension ($d^c_{kv} = 512$, $d^c_q = 1536$)
  • Form: One-dimensional vector, no explicit multi-head tensor shape
  • Storage: Exists in KV Cache, but because it’s compressed not in head form, each GPU in TP must have a complete copy.
  • Example: $c_{KV}: (B, S, 512)$ doesn’t show information of 128 heads

Attention Head Division and Parallelism

  • Position: During Up-Projection, attention heads reappear, at which point TP can be used to split matrices.

Prefill Phase

Step 1: Q Down-Projection (Query Compression, No Head Concept)

Computation: Project input to low-dimensional latent space

$$c^Q_t = h_t \times W^{DQ}$$

  • Input $h_t$: $(B, S, d) = (B, S, 7168)$
  • Weight $W^{DQ}$: $(d, d^c_q) = (7168, 1536)$
  • Output $c^Q_t$: $(B, S, d^c_q) = (B, S, 1536)$

Key Features:

  • This step has no “head” concept
  • $c^Q_t$ is a one-dimensional vector, information compressed in 1536 dimensions, each GPU has a copy of this latent vector.

Computation per GPU: $2BS \times d \times d^c_q$

Step 2: Q Up-Projection and Multi-head Expansion

First: Linear projection expand dimension

$$q_t^C = c^Q_t \times W^{UQ}$$

  • Input $c^Q_t$: $(B, S, 1536)$
  • Weight $W^{UQ}$: $(d^c_q, H \times d_{C}^h) = (1536, 128 \times 128) = (1536, 24576)$
  • Output $q_t^C$: $(B, S, 24576)$

Second: Reshape to multi-head structure

$$q_t^C = \text{reshape}(q_t^C, (B, S, H, d_C^h)) = (B, S, 128, 128)$$

  • Now has $H=128$ heads’ explicit structure
  • Each head’s Non-RoPE dimension: $d_C^h = 128$

Third: RoPE part

$$q^R_t = \text{RoPE}(c^Q_t \times W^{QR})$$

  • Input $c^Q_t$: $(B, S, 1536)$
  • Weight $W^{QR}$: $(d^c_q, H \times d_R^h) = (1536, 128 \times 64) = (1536, 8192)$
  • Output $q^R_t$: $(B, S, 8192)$ → reshape to $(B, S, 128, 64)$
  • Apply RoPE positional encoding

Fourth: Concatenate to complete Q

$$q_{t} = [q_t^C; q_t^R] = (B, S, 128, 192)$$

TP Split (column-wise split):

  • Input $c^Q_t$: $(B, S, d^c_q)$
  • Weights $W^{UQ}$: $(d^c_q, \frac{H}{TP} \times d_C^h)$, $W^{QR}$: $(d^c_q, \frac{H}{TP} \times d_R^h)$
  • Output $q_t$: $(B, S, \frac{H \times d_{qk}^h}{TP})$ → reshape to $(B, S, \frac{H}{TP}, d_{qk}^h)$

Computation per GPU: $\frac{2BS \times d^c_q \times H \times d_{qk}^h}{TP}$

Step 3: KV Down-Projection (K, V Joint Compression, No Head Concept)

Computation:

$$c^{KV}_t = h_t \times W^{DKV}$$

  • Input $h_t$: $(B, S, d) = (B, S, 7168)$
  • Weight $W^{DKV}$: $(d, d^c_{kv}) = (7168, 512)$
  • Output $c^{KV}t$: $(B, S, d^c{kv}) = (B, S, 512)$

Key Features:

  • This step also has no “head” concept
  • K, V information of 128 heads all compressed into 512-dimensional vector

KV Cache Storage:

  • Each TP rank must store duplicate $c^{KV}_t$: $(B, S, 512)$
  • Due to no head structure, can no longer partition by heads to TP

Per GPU compared to standard MHA:

  • MHA KV Cache: $\frac{2 \times B \times S \times d }{TP} = \frac{2 \times B \times S \times 7168}{TP}$
  • MLA KV Cache: $B \times S \times d^c_{kv} = B \times S \times 512$
  • Larger TP scale, more KV Cache duplication, lower MLA advantage

Computation per GPU: $2 \times B \times S \times d \times d^c_{kv}$

Step 4: K Up-Projection

First: Linear projection expand dimension

$$k_t^C = c^{KV}_t \times W^{UK}$$

  • Input $c^{KV}_t$: $(B, S, 512)$
  • Weight $W^{UK}$: $(d^c_{kv}, H \times d_C^h) = (512, 128 \times 128) = (512, 16384)$
  • Output $k_t^C$: $(B, S, 16384)$

Second: Reshape to multi-head structure

$$k_t^C = \text{reshape}(k_t^C, (B, S, H, d_C^h)) = (B, S, 128, 128)$$

Third: RoPE part

$$k^R_t = \text{RoPE}(h_t \times W^{KR})$$

  • Input $h_t$: $(B, S, d)= (B, S, 7168)$
  • Weight $W^{KR}$: $(d, d_R^h) = (7168, 64)$
  • Output $k^R_t$: $(B, S, 64)$ (note only one head)

Fourth: Concatenate to complete K

$$k_t = [k_t^C; k^R_t] = (B, S, 128, 192)$$

TP Split doesn’t split third step:

  • Input $c^{KV}t$: $(B, S, d^c{kv})$
  • Weight $W^{UK}$: $(d^c_{kv}, \frac{H}{TP} \times d_C^h)$
  • Output $k_t^C$: $(B, S, \frac{H \times d_C^h}{TP})$

Computation per GPU: $\frac{2BS \times d^c_{kv} \times H \times d_C^h}{TP} + {2BS \times d \times d_R^h}$

Step 5: V Up-Projection

$$v_{t} = c^{KV}_t \times W^{UV}$$

Per GPU computation:

  • Input $c^{KV}_t$: $(B, S, 512)$
  • Weight $W^{UV}$: $(d^c_{kv}, \frac{H}{TP} \times d_v^h) = (512, \frac{128}{TP} \times 128)$
  • Output $v_{t}$: $(B, S, \frac{128}{TP}, 128)$

Computation: $\frac{2BS \times d^c_{kv} \times H \times d_v^h}{TP}$

Step 6: Attention Computation (128 multi-heads)

Shape per GPU:

  • $Q$: $(B, S, \frac{H}{TP}, d_{qk}^h) = (B, S, \frac{128}{TP}, 192)$
  • $K$: $(B, S, \frac{H}{TP}, d_{qk}^h) = (B, S, \frac{128}{TP}, 192)$
  • $V$: $(B, S, \frac{H}{TP}, d_{v}^h) = (B, S, \frac{128}{TP}, 128)$

Computation flow (128 heads compute independently):

  1. Score computation: $$\text{Score} = \frac{Q \times K^T}{\sqrt{d_{qk}^h}}$$

    • Shape: $(B, \frac{H}{TP}, S, S)$
  2. Attention weights: $$\text{Weights} = \text{Softmax}(\text{Score})$$

    • Shape: $(B, \frac{H}{TP}, S, S)$
  3. Weighted sum: $$\text{Output} = \text{Weights} \times V$$

    • Output: $(B, S, \frac{H \times d_{v}^h}{TP})$

Computation:

  • Score: $\frac{H}{TP}\times2BS^2 \times d_{qk}^h$
  • Attention×V: $\frac{H}{TP}\times2BS^2 \times d_{v}^h$
  • Total: $\frac{H}{TP} \times 2BS^2 \times (d_{qk}^h + d_{v}^h)$

Step 7: Output Projection (Reduce back to hidden dimension)

Computation:

$$y_t = \text{output} \times W_O$$

  • Input (Attention output): $(B, S, H \times d_{v}^h) = (B, S, 16384)$
  • Weight $W_O$: $(H \times d_{v}^h, d) = (16384, 7168)$
  • Output $y_t$: $(B, S, d) = (B, S, 7168)$

TP Split (row-wise split):

  • Input: $(B, S, \frac{H \times d_{v}^h}{TP})$
  • Weight $W_O$: $(\frac{H \times d_{v}^h}{TP}, d)$
  • Output: $(B, S, d)$ (partial result)

Computation: $\frac{2BS \times H \times d_{v}^h \times d}{TP}$

Communication:

  • All-Reduce: $BS \times d = 7168 \times BS$ elements (same as MHA)

Prefill Phase Summary

Step Involves Heads Computation (per GPU) Description
Q Down ❌ No $2BS \times d \times d^c_q$ Query compression, no head concept
Q Up ✅ 128 $\frac{2BS \times d^c_q \times H \times (d_C^h + d_R^h)}{TP}$ Query decompression, multi-head expansion + RoPE
KV Down ❌ No $2BS \times d \times d^c_{kv}$ KV joint compression, no head concept, KV Cache duplicated per TP
K Up ✅ 128 $\frac{2BS \times d^c_{kv} \times H \times d_C^h}{TP} + {2BS \times d \times d_R^h}$ K decompression, multi-head expansion + RoPE
V Up ✅ 128 $\frac{2BS \times d^c_{kv} \times H \times d_v^h}{TP}$ V decompression, head aggregation form
Attention ✅ 128 $\frac{2BS^2 \times (d_{qk}^h + d_v^h)}{TP}$ Score and weighted sum
Output ✅ 128→1 $\frac{2BS \times H \times d_v^h \times d}{TP}$ Multi-head merge, project back to hidden dim
Decode Phase

Step 1: Q Down-Projection (Query Compression, No Head Concept)

Same as Prefill

Step 2: Q Up-Projection and Multi-head Expansion

First: Linear projection expand dimension

$$q_t^C = c^Q_t \times W^{UQ}$$

  • Input $c^Q_t$: $(B, 1, 1536)$
  • Weight $W^{UQ}$: $(d^c_q, H \times d_{C}^h) = (1536, 128 \times 128) = (1536, 24576)$
  • Output $q_t^C$: $(B, 1, 24576)$

Second: Reshape to multi-head structure

$$q_t^C = \text{reshape}(q_t^C, (B, 1, H, d_C^h)) = (B, 1, 128, 128)$$

  • Now has $H=128$ heads’ explicit structure
  • Each head’s Non-RoPE dimension: $d_C^h = 128$

Third: Absorb matrix $W^{UK}$ into $q_t^C$

$$q_t^{C_absorb} = q_t^C \times {W^{UK}}^T$$

  • Input $q_t^C$: $(B, 1, 128, 128)$
  • Weight $W^{UK}$: $(d^c_{kv}, H,d_C^h) = (512, 128,128)$
  • Output $q_t^{C_absorb}$: $(B, 1, H, d^c_{kv})=(B, 1, 128, 512)$

Fourth: RoPE part

$$q^R_t = \text{RoPE}(c^Q_t \times W^{QR})$$

  • Input $c^Q_t$: $(B, 1, 1536)$
  • Weight $W^{QR}$: $(d^c_q, H \times d_R^h) = (1536, 128 \times 64) = (1536, 8192)$
  • Output $q^R_t$: $(B, 1, 8192)$ → reshape to $(B, 1, 128, 64)$
  • Apply RoPE positional encoding

Fifth: Concatenate to complete Q

$$q_{t} = [q_t^{C_absorb}; q_t^R] = (B, 1, 128, 576)$$

TP Split (column-wise split):

  • Input $c^Q_t$: $(B, 1, d^c_q)$
  • Weights $W^{UQ}$: $(d^c_q, \frac{H}{TP} \times d_C^h)$, $W^{UK}$: $(d^c_{kv}, \frac{H}{TP}\times d_C^h)$, $W^{QR}$: $(d^c_q, \frac{H}{TP} \times d_R^h)$
  • Output $q_t$: $(B, 1, \frac{H \times d_{qk}^h}{TP})$ → reshape to $(B, 1, \frac{H}{TP}, d_{qk}^h)$

Computation per GPU (ignoring RoPE positional encoding): $\frac{2B \times d^c_q \times H \times (d_{qk}^h+d_R^h)+2B \times d^c_{kv} \times H \times d_{qk}^h}{TP}$

Step 3: KV Down-Projection

First step same as Prefill, computation:

$$c^{KV}_t = h_t \times W^{DKV}$$

  • Input $h_t$: $(B, 1, d) = (B, 1, 7168)$
  • Weight $W^{DKV}$: $(d, d^c_{kv}) = (7168, 512)$
  • Output $c^{KV}t$: $(B, 1, d^c{kv}) = (B, 1, 512)$

Use previously compressed latent vector KV Cache storage:

Second: RoPE part

$$k^R_t = \text{RoPE}(h_t \times W^{KR})$$

  • Input $h_t$: $(B, 1, d)= (B, 1, 7168)$
  • Weight $W^{KR}$: $(d, d_R^h) = (7168, 64)$
  • Output $k^R_t$: $(B, 1, 64)$ (note only one head)

Third: Concatenate to complete K

$$k_t = [c^{KV}_t; k^R_t] = (B, 1, 576)$$

TP cannot split this step

Computation per GPU: ${2B \times d \times (d^c_{kv} + d_R^h)}$

Step 4: Attention Computation (MQA)

Shape per GPU:

  • $Q$: $(B, 1, \frac{H}{TP}, d_{kv}^c+d^h_R) = (B, 1, \frac{128}{TP}, 576)$
  • $K$: $(B, S, d_{kv}^c+d^h_R) = (B, S, 576)$
  • $V$: $(B, S, d_{kv}^c) = (B, S, 512)$

Computation flow (128 heads compute independently):

  1. Score computation: $$\text{Score} = \frac{Q \times K^T}{\sqrt{d_{qk}^h}}$$

    • Shape: $(B, \frac{H}{TP}, 1, S)$
  2. Attention weights: $$\text{Weights} = \text{Softmax}(\text{Score})$$

    • Shape: $(B, \frac{H}{TP}, 1, S)$
  3. Weighted sum: $$\text{Output} = \text{Weights} \times V$$

    • V doesn’t need to be upscaled to multi-head form at this point
    • Output: $(B, \frac{H}{TP}, 1, d_{kv}^c)$

Computation per GPU:

  • Score: $\frac{H}{TP}\times2BS \times (d_{kv}^c+d^h_R)$
  • Attention×V: $\frac{H}{TP}\times2BS \times d_{kv}^c$
  • Total: $\frac{H}{TP}\times2BS \times (2d_{kv}^c+d^h_R)$

Step 5: V Up-Projection applied to MQA result

$$Out_{up} = \text{Output} \times W^{UV}$$

Per GPU computation:

  • Input MQA Output from previous step: $(B, \frac{H}{TP}, 1, d_{kv}^c)=(B, \frac{H}{TP}, 1, 512)$
  • Weight $W^{UV}$: $(\frac{H}{TP}, d^c_{kv}, d_v^h) = (\frac{128}{TP} ,512, 128)$
  • Output $Out_{up}$: $(B, 1, \frac{128}{TP}, 128)$

Computation: $\frac{2B \times d^c_{kv} \times H \times d_v^h}{TP}$

Step 6: Output Projection (Reduce back to hidden dimension)

$$y_t = Out_{up} \times W_O$$

  • Input $Out_{up}$: $(B, 1, H \times d_{v}^h) = (B, 1, 16384)$
  • Weight $W_O$: $(H \times d_{v}^h, d) = (16384, 7168)$
  • Output $y_t$: $(B, 1, d) = (B, 1, 7168)$

TP Split:

  • Input: $(B, 1, \frac{H \times d_{v}^h}{TP})$
  • Weight $W_O$: $(\frac{H \times d_{v}^h}{TP}, d)$
  • Output: $(B, 1, d)$ (partial result)

Computation: $\frac{2B \times H \times d_{v}^h \times d}{TP}$

Communication:

  • All-Reduce: $B \times d = 7168 \times B$ elements

Decode Phase Summary

Step Involves Heads (per GPU) Computation (per GPU) (B: Batch, S: History Len) Key Features
Step 1: Q Down ❌ No $2B \times d \times d^c_q$ Query compression Input $h_t$ projects to latent space, no head concept.
Step 2: Q Up ✅ 128 $\frac{2B \cdot d^c_q \cdot H \cdot d_C^h}{TP} + \text{AbsorbCost}^*$ $+ \frac{2B \cdot d^c_q \cdot H \cdot d_R^h}{TP}$ Query decompression + absorb K-Up Expand multi-head Q, and absorb $W^{UK}$ weight; Generate RoPE part and concatenate. Q in “Latent KV” space.
Step 3: KV Down ❌ No $2B \times d \times (d^c_{kv} + d_R^h)$ KV generation (single head) Only generate latent $c^{KV}$ and $k^R$; No multi-head expansion, directly stored in KV Cache.
Step 4: Attention ✅ 128 (Q) vs 1 (KV) Score: $\frac{H}{TP} \times 2B \cdot S \times (d^c_{kv} + d_R^h)$ Agg: $\frac{H}{TP} \times 2B \cdot S \times d^c_{kv}$ MQA (Multi-Query Attention) Multi-head Q attends to shared Latent KV Cache; Computational complexity grows linearly with history length $S$.
Step 5: V Up ✅ 128 $\frac{2B \cdot d^c_{kv} \cdot H \cdot d_v^h}{TP}$ V decompression (post) Only upsample Attention output; Restore multi-head Value’s expressiveness.
Step 6: Output ✅ 128→1 $\frac{2B \cdot H \cdot d_v^h \cdot d}{TP}$ Output projection Multi-head merge, project back to hidden dimension $d$; Requires All-Reduce communication.
  • AbsorbCost: Refers to computational overhead of fusing $W^{UK}$ into Q, typically $\frac{2B \cdot H \cdot d_C^h \cdot d^c_{kv}}{TP}$, depends on specific operator fusion implementation.
Dimension Prefill Phase Decode Phase Core Difference
Attention Type MHA (Standard) MQA (Multi-Query) Decode directly does MQA on Latent vector
KV Cache Content $c^{KV}$ (Latent) $c^{KV}$ (Latent) Storage content identical, Decode reuses Prefill-generated Cache, no conversion needed.
KV Up-Proj Timing Before Attention After Attention (V) Fused into Q (K) Prefill explicitly restores all KV heads; Decode avoids restoring historical KV, only restores on result.
K Form Multi-head $K$ $(B, S, H, d_{qk}^h)$ Single-head Latent $K$ $(B, S, 1, d^c_{kv}+d_R^h)$ Prefill’s K is physical multi-head; Decode’s K is Latent vector plus RoPE, logical broadcast.
Computational Complexity $O(S^2)$ $O(S)$ Prefill can still consider TP, but Decode using TP is difficult to hide communication overhead, additionally KV Cache efficiency is low.

MHA vs MLA Comparison Summary

  • MHA has independent heads, so using TP only requires final communication overhead.
  • MLA’s Latent Vector needs duplicate copies in each TP rank, so TP efficiency is lower.
  • MLA’s memory bottleneck is alleviated, TP communication impact becomes more prominent.
  • MLA inference practice commonly uses DP-Attention.
  • MLA’s Latent Vector can compress KV Cache size, advantageous for memory-intensive Decode.

TP Impact on FFN

Prefill and Decode Phases are Consistent

Difference is Decode’s S is 1

Step 1: First Layer Projection ($d \to d_{ff}$)

Computation:

  • Complete computation: $H = X \times W_1$

    • Input $X$: $(B, S, d)$
    • Weight $W_1$: $(d, d_{ff})$ (first FFN layer weight)
    • Output $H$: $(B, S, d_{ff})$
  • TP Split: Weight matrix column-wise split

  • Shape per GPU:

    • Input $X$: $(B, S, d)$ (broadcast to all GPUs)
    • Weight $W_1$: $(d, \frac{d_{ff}}{TP})$ (column split)
    • Output $H$: $(B, S, \frac{d_{ff}}{TP})$
  • Computation per GPU: $\frac{2BS \times d \times d_{ff}}{TP}$

Communication: None

Step 2: Activation Function

Computation: Each GPU computes independently (SwiGLU/GELU)

  • Shape per GPU: Input and output both $(B, S, \frac{d_{ff}}{TP})$
  • Computation: $O(BS \times \frac{d_{ff}}{TP})$ (relatively small)

Communication: None

Step 3: Second Layer Projection ($d_{ff} \to d$)

Computation:

  • Complete computation: $Y = \text{Activation}(H) \times W_2$

    • Input (after activation): $(B, S, d_{ff})$
    • Weight $W_2$: $(d_{ff}, d)$ (second FFN layer weight)
    • Output $Y$: $(B, S, d)$
  • TP Split: Weight matrix row-wise split

  • Shape per GPU:

    • Input: $(B, S, \frac{d_{ff}}{TP})$
    • Weight $W_2$: $(\frac{d_{ff}}{TP}, d)$ (row split)
    • Local output: $(B, S, \frac{d_{ff}}{TP}) \times (\frac{d_{ff}}{TP}, d) \to (B, S, d)$ (partial result)
  • Computation per GPU: $\frac{2BS \times d_{ff} \times d}{TP}$

Communication:

  • All-Reduce: Merge TP split results
  • Communication volume: $BS \times d$ elements (same as Attention)
    • FFN immediately added to Attention result, communication pattern consistent

Communication Overhead Comparison

Communication Volume Derivation Summary for Parallelism Strategies

Convention: In this section’s tables, “data to synchronize” refers to logical tensor size (i.e., number of activation value elements to be transferred). Actual transfer volume refers to total data sent/received per GPU on physical links, needs to multiply by algorithm-related constant factor.

Parallelism Strategy Communication Pattern Data to Synchronize (Logical Size) Actual Communication (per GPU, order of magnitude) Key Derivation Logic
TP All-Reduce $N \times d$ $\approx 2\frac{TP-1}{TP} \times N d$ Row-wise weight split, each card computes partial sum, needs Ring All-Reduce accumulation. $N$ is token count for this operator.
EP All-to-All (combined with DP-Attention) $N \times k \times d$ $\approx 4 \times \frac{N k d}{EP} \times \frac{EP-1}{EP}$ MoE layer contains two All-to-All (Dispatch + Combine). Each logical data volume is $N \times k \times d$ (each token selects $k$ experts). Final coefficient because no self-communication needed.
PP P2P Send/Recv $N \times d$ $\approx N \times d$ (unidirectional) Point-to-point transfer at stage boundaries. Previous Stage completes computation of $N$ tokens’ activation values, sends to next Stage.
DP None 0 0 In pure inference phase, no communication between data parallel replicas.

Key Variable Description:

  • Prefill Phase: $N \approx B \times S$ (Batch $\times$ Sequence Length)
  • Decode Phase: $N \approx B$ (Batch $\times$ 1)
  • Conclusion: In Decode phase, all strategies’ communication volume reduces by approximately $S$ times compared to Prefill.

Prefill vs Decode TP / EP / PP Communication Comparison

Table below explicitly distinguishes Prefill and Decode phase communication characteristics under different parallelism strategies. Ignores DP, focuses on single layer/single operation overhead.

  • Notation: $B$ (Batch Size), $S$ (Seq Len), $d$ (Hidden Dim), $k$ (Top-k experts), $d_{act}$ (Activation Dim).
Parallelism Strategy Phase Logical Communication Data Size (Single Layer) Rough Communication (per GPU) Characteristic Analysis
TP Prefill $BS \times d$ $\approx 1.75 \times BS d$ (TP=8) Large communication, but easily hidden by computation under high arithmetic intensity.
Decode $B \times d$ $\approx 1.75 \times B d$ Small communication, but computation also small, latency directly exposed, limited by NVLink bandwidth and latency.
EP Prefill $BS \times k \times d$ $\approx 4 \frac{BSkd}{EP}\times \frac{EP-1}{EP}$ Cross-node All-to-All communication. Prefill phase has large data volume, easily hits cross-machine interconnect bandwidth bottleneck.
Decode $B \times k \times d$ $\approx 4 \frac{Bkd}{EP}\times \frac{EP-1}{EP}$ Data volume small, but All-to-All’s startup latency and network congestion control become main bottleneck.
PP Prefill $BS \times d_{act}$ $\approx BS d$ Huge activation value transfer causes pipeline Bubble hard to fill, and first Token latency (TTFT) significantly increases.
Decode $B \times d_{act}$ $\approx B d$ Smaller communication, main issue is pipeline’s serial dependency causing latency accumulation.
  • TP implementation simple, good MHA support. But needs good communication performance. Each layer requires All-Reduce communication.
  • EP used for MoE architecture. Advantage is GPU doesn’t need to load all experts, aligns with MoE’s memory loading reduction goal. Commonly paired with DP attention in DeepSeek. Each MoE layer needs All-to-All communication, also needs attention.
  • PP only needs communication at layer split points. Computation-communication ratio higher, suitable for insufficient memory but poor communication performance scenarios.

Static vs Continuous Batching

In real production systems, Batch is no longer static $(B, S)$, but a dynamic stream.

Dimension Static Batching (Simplified Model) Continuous Batching (Real Model) Core Difference
Definition Assumes all requests in Batch have consistent state (all Prefill or all Decode). Allows mixed within Batch: $B_p$ requests doing Prefill, $B_d$ doing Decode. Higher resource utilization, but increases scheduling complexity.
Total Token Count Prefill: $T = B \times S$
Decode: $T = B$
$T_{total} = \sum_{i=1}^{B_p} S_i + B_d$ Communication no longer determined by $B$, but by current Step’s effective total token count $T_{total}$.
TP Communication Each layer fixed transfer $BSd$ or $Bd$ Each layer transfer $\propto T_{total} \times d$ $T_{total}$ fluctuates dramatically between Steps, bandwidth demand shows pulsed pattern.
EP Communication Each layer fixed transfer $BSkd$ or $Bkd$ Each layer transfer $\propto T_{total} \times k \times d$ EP’s All-to-All load fluctuates with $T_{total}$, and expert load may be unbalanced.

System Design Impact of Continuous Batching

After introducing Continuous Batching, brings following challenges to parallel communication system design:

  1. Communication Bandwidth Dynamics:

    • Because $T_{total}$ depends on incoming requests’ Prefill length $S_i$, communication volume at a Step may suddenly surge.
    • Network design must plan by Peak rather than average, otherwise large Prefill arrival causes severe Head-of-Line Blocking.
  2. Scheduling Strategy Tradeoffs:

    • Prefill Priority: Maximize $T_{total}$, increase computation density (arithmetic intensity), hide TP/EP communication latency. Disadvantage is memory suddenly surges, may squeeze Decode space.
    • Decode Priority: Ensure low latency, but $T_{total}$ small, computation cannot hide communication, GPU must wait for communication completion even with low utilization.
  3. Memory (KV Cache) Management:

    • KV Cache usage formula becomes: $\text{Mem}{KV} = (\sum S{curr}) \times L \times d_{kv} \times \text{Layers}$.
    • TP/PP must ensure consistent memory allocation on all Devices, while EP may cause uneven KV distribution across different experts (but in MLA/MHA architectures sharing KV, KV typically doesn’t do EP split).

Chunked Prefill Impact on Communication and Pipeline

To solve long sequence Prefill blocking Decode problem, typically introduce Chunked Prefill: split long sequence $S_{total}$ into multiple $S_{chunk}$ (e.g., 512 tokens) for batch processing.

A. Impact on TP / EP Communication

  • Total unchanged, peak shaving:
    • Without Chunk: Single communication volume $B \times S_{total} \times d$, huge data packets may cause network congestion.
    • With Chunk: Single communication volume $B \times S_{chunk} \times d$. Although total bytes transferred unchanged, single communication smaller, easier to Overlap with computation Kernel.

B. Decisive Role on PP (Pipeline Parallelism)

  • Eliminate Pipeline Bubble:
    • No Chunk: One long Prefill request enters Stage 1, all subsequent Decode requests must wait for it to completely flow through entire Pipeline, causing huge latency jitter.
    • With Chunk: Prefill’s Chunk can be Interleaved with Decode requests into pipeline.
    • Effect: Although single Prefill request’s completion time (TTFT) doesn’t significantly reduce, system’s average throughput and Decode requests’ P99 latency are significantly optimized.