Chapter 3: Action Chunking
Learning Objectives
- Understand the action chunking paradigm and why predicting action sequences outperforms single-step predictions
- Learn how Action Chunking with Transformers (ACT) and Diffusion Policy generate smooth robot trajectories
- Implement action chunk generation and temporal ensembling for robust execution
- Compare action chunking with traditional single-step behavior cloning
The Single-Step Problem
In traditional behavior cloning, a policy predicts one action at a time: given the current observation, output the next joint position or end-effector target. This frame-by-frame approach has several fundamental problems:
- Compounding errors — small prediction errors accumulate over hundreds of steps, causing the robot to drift far from the demonstrated trajectory
- Multi-modality — when multiple valid actions exist (go left or right around an obstacle), averaging them produces an invalid action (go straight into the obstacle)
- Temporal inconsistency — consecutive predictions may conflict, causing jerky, oscillating motion
Action chunking solves these problems by predicting an entire sequence of future actions at once—a "chunk" of 10 to 100 timesteps. This gives the model a planning horizon and ensures temporal consistency within each chunk.
How Action Chunking Works
Instead of predicting a single action a_t, an action-chunking policy predicts a sequence:
π(o_t) → [a_t, a_t+1, a_t+2, ..., a_t+H-1]
where H is the chunk size (typically 10–100 steps at 10–50 Hz control frequency). The robot executes the first few actions from the chunk, then re-predicts a new chunk from the updated observation. This overlapping execution creates smooth, robust behavior.
Temporal Ensembling
When chunks overlap, multiple predictions exist for the same future timestep. Temporal ensembling combines these predictions using exponentially weighted averaging:
# temporal_ensemble.py — Temporal ensembling for overlapping action chunks
"""Implement temporal ensembling to combine overlapping action chunk predictions."""
import math
def exponential_weights(num_predictions: int, decay: float = 0.01) -> list[float]:
"""Generate exponentially decaying weights for temporal ensembling.
More recent predictions get higher weight.
Args:
num_predictions: Number of overlapping predictions for a timestep
decay: Exponential decay rate (smaller = more uniform weighting)
Returns:
Normalized weight list (most recent prediction last, highest weight)
"""
raw_weights = [math.exp(-decay * (num_predictions - 1 - i))
for i in range(num_predictions)]
total = sum(raw_weights)
return [w / total for w in raw_weights]
def temporal_ensemble(
action_chunks: list[list[list[float]]],
chunk_starts: list[int],
target_timestep: int,
decay: float = 0.01,
) -> list[float]:
"""
Combine overlapping action chunk predictions for a single timestep.
Args:
action_chunks: List of chunks, each is a list of action vectors
chunk_starts: Starting timestep for each chunk
target_timestep: The timestep to compute the ensembled action for
decay: Exponential decay rate for weighting
Returns:
Ensembled action vector for the target timestep
"""
# Collect all predictions that cover the target timestep
predictions: list[list[float]] = []
for chunk, start in zip(action_chunks, chunk_starts):
offset = target_timestep - start
if 0 <= offset < len(chunk):
predictions.append(chunk[offset])
if not predictions:
raise ValueError(f"No predictions cover timestep {target_timestep}")
# Weight more recent predictions higher
weights = exponential_weights(len(predictions), decay)
# Weighted average across action dimensions
action_dim = len(predictions[0])
ensembled = [0.0] * action_dim
for pred, weight in zip(predictions, weights):
for d in range(action_dim):
ensembled[d] += pred[d] * weight
return ensembled
if __name__ == "__main__":
# Simulate 3 overlapping chunks (chunk size=4, action dim=2)
chunks = [
[[1.0, 0.0], [1.1, 0.1], [1.2, 0.2], [1.3, 0.3]], # chunk 0 (t=0..3)
[[1.05, 0.08], [1.15, 0.18], [1.25, 0.28], [1.35, 0.38]], # chunk 1 (t=1..4)
[[1.12, 0.16], [1.22, 0.26], [1.32, 0.36], [1.42, 0.46]], # chunk 2 (t=2..5)
]
starts = [0, 1, 2]
print("Temporal Ensembling Results:")
for t in range(6):
try:
action = temporal_ensemble(chunks, starts, t)
coverage = sum(1 for c, s in zip(chunks, starts) if 0 <= t - s < len(c))
print(f" t={t}: action={[round(a, 3) for a in action]} "
f"(from {coverage} overlapping chunks)")
except ValueError:
print(f" t={t}: no coverage")
# Expected output:
# t=0: action=[1.0, 0.0] (from 1 overlapping chunks)
# t=1: action=[~1.07, ~0.09] (from 2 overlapping chunks)
# t=2: action=[~1.15, ~0.18] (from 3 overlapping chunks)
# t=3: action=[~1.28, ~0.28] (from 3 overlapping chunks)
# t=4: action=[~1.32, ~0.35] (from 2 overlapping chunks)
# t=5: action=[1.42, 0.46] (from 1 overlapping chunks)
ACT: Action Chunking with Transformers
ACT (Action Chunking with Transformers) is a foundational algorithm from Tony Zhao et al. (2023) that combines a CVAE (Conditional Variational Autoencoder) with a Transformer decoder to predict action chunks from visual observations:
Architecture
- Visual encoder (ResNet-18) extracts features from camera images
- CVAE encoder (training only) encodes the expert action sequence into a style variable z
- Transformer decoder takes visual features + style variable and autoregressively generates the action chunk
- At inference, z is sampled from the prior (standard Gaussian), providing diversity in multi-modal scenarios
Why ACT Works
- The chunk prediction eliminates compounding errors within the chunk horizon
- The CVAE style variable handles multi-modality (multiple valid ways to do a task)
- The Transformer decoder captures long-range temporal dependencies within the chunk
- Temporal ensembling during execution smooths transitions between chunks
# act_config.py — Configure ACT model hyperparameters
"""Configuration for Action Chunking with Transformers (ACT)."""
from dataclasses import dataclass
@dataclass
class ACTConfig:
"""Hyperparameters for ACT model training and inference."""
# Action space
action_dim: int = 7 # 6-DOF pose + gripper
chunk_size: int = 50 # Number of future actions per prediction
control_freq_hz: int = 50
# Architecture
hidden_dim: int = 512
num_encoder_layers: int = 4
num_decoder_layers: int = 7
num_heads: int = 8
feedforward_dim: int = 3200
dropout: float = 0.1
# CVAE
latent_dim: int = 32
kl_weight: float = 10.0 # Weight of KL divergence loss
# Visual encoder
backbone: str = "resnet18"
num_cameras: int = 2 # e.g., wrist + overhead
image_size: tuple[int, int] = (480, 640)
# Training
lr: float = 1e-5
weight_decay: float = 1e-4
batch_size: int = 8
num_epochs: int = 3000
# Temporal ensembling at inference
ensemble_decay: float = 0.01
execution_length: int = 10 # Execute this many steps before re-predicting
def chunk_duration_seconds(self) -> float:
"""Duration of one action chunk in seconds."""
return self.chunk_size / self.control_freq_hz
def total_params_estimate(self) -> int:
"""Rough estimate of model parameters."""
transformer_params = (
self.num_encoder_layers * 4 * self.hidden_dim * self.hidden_dim
+ self.num_decoder_layers * 4 * self.hidden_dim * self.hidden_dim
)
backbone_params = 11_000_000 # ResNet-18 baseline
return transformer_params + backbone_params
if __name__ == "__main__":
config = ACTConfig(
chunk_size=50,
num_cameras=2,
action_dim=7,
)
print(f"ACT Configuration:")
print(f" Chunk size: {config.chunk_size} steps "
f"({config.chunk_duration_seconds():.1f}s at {config.control_freq_hz}Hz)")
print(f" Action dim: {config.action_dim}")
print(f" Cameras: {config.num_cameras}")
print(f" CVAE latent dim: {config.latent_dim}")
print(f" Transformer: {config.num_encoder_layers}enc + "
f"{config.num_decoder_layers}dec layers, "
f"{config.hidden_dim}D, {config.num_heads} heads")
print(f" ~{config.total_params_estimate():,} parameters")
print(f" Ensemble decay: {config.ensemble_decay}")
# Expected output:
# ACT Configuration:
# Chunk size: 50 steps (1.0s at 50Hz)
# Action dim: 7
# Cameras: 2
# CVAE latent dim: 32
# Transformer: 4enc + 7dec layers, 512D, 8 heads
# ~35,651,584 parameters
# Ensemble decay: 0.01
Diffusion Policy: Action Chunking via Denoising
An alternative approach to action chunking uses diffusion models—the same generative models behind Stable Diffusion and DALL-E, but generating action trajectories instead of images. Diffusion Policy (Chi et al., 2023) iteratively denoises a random noise vector into a coherent action sequence:
- Start with random noise of shape
(chunk_size, action_dim) - Condition on visual observations
- Run K denoising steps (typically 10–100)
- Output the denoised action chunk
Diffusion Policy excels at multi-modal tasks where multiple equally-valid action sequences exist—it naturally represents this diversity in its denoising distribution.
Choosing Chunk Size
The chunk size H is a critical hyperparameter:
| Chunk Size | Pros | Cons |
|---|---|---|
| Small (5–10) | Reactive, adapts quickly | More compounding errors, jerky motion |
| Medium (20–50) | Good balance of planning and reactivity | Standard choice for most tasks |
| Large (50–100) | Very smooth trajectories, long-horizon planning | Slow to react to perturbations |
For contact-rich manipulation (assembly, insertion), medium chunks (20–50 steps) are standard. For free-space motion (reaching, pick-and-place), larger chunks (50–100) enable smoother trajectories.
Key Takeaways
- Action chunking predicts sequences of future actions rather than single actions, eliminating compounding errors and ensuring temporal consistency
- Temporal ensembling smoothly combines overlapping chunk predictions using exponentially weighted averaging
- ACT uses a CVAE + Transformer to generate diverse action chunks, handling multi-modal demonstrations effectively
- Diffusion Policy applies denoising diffusion to action trajectory generation, excelling at multi-modal tasks
- Chunk size is a key design choice: larger chunks give smoother motion but slower reactivity to perturbations