Chapter 2: Isaac Gym
Learning Objectives
- Understand how Isaac Gym enables massively parallel GPU-accelerated reinforcement learning for robotics
- Set up Isaac Gym environments with thousands of simultaneous robot instances
- Design observation spaces, action spaces, and reward functions for robot control tasks
- Train and evaluate robot locomotion and manipulation policies using GPU-parallel RL
What Is Isaac Gym?
Isaac Gym is NVIDIA's GPU-accelerated physics simulation environment designed specifically for reinforcement learning (RL) research. Unlike traditional simulators that run physics on the CPU and rendering on the GPU, Isaac Gym runs everything on the GPU—physics computation, state observation, reward calculation, and policy inference all happen in GPU memory without transfers to CPU. This eliminates the CPU-GPU data transfer bottleneck that limits training speed in other simulators.
The result is staggering throughput: Isaac Gym can simulate tens of thousands of robot environments simultaneously on a single GPU, achieving billions of simulation steps per day. This scale makes it practical to train complex robot behaviors—humanoid locomotion, dexterous manipulation, whole-body control—that would take months with traditional CPU-based simulation.
Isaac Gym vs Traditional RL Pipelines
| Aspect | Traditional (Gym + MuJoCo) | Isaac Gym |
|---|---|---|
| Physics computation | CPU | GPU |
| Number of parallel environments | 8–64 (CPU cores) | 4,096–65,536 (GPU) |
| Observation transfer | GPU→CPU→GPU | GPU only |
| Steps per second (humanoid) | ~10K | ~1M+ |
| Time to 1B steps | Days–weeks | Hours |
| Hardware required | Multi-core CPU | Single RTX GPU |
Setting Up Isaac Gym Environments
Isaac Gym environments are defined as Python classes that configure the robot, task parameters, and observation/action spaces. The framework handles parallelization automatically—you define a single environment, and Isaac Gym replicates it across the GPU:
# env_config.py — Isaac Gym environment configuration builder
"""Build and validate Isaac Gym environment configuration."""
from dataclasses import dataclass, field
@dataclass
class IsaacGymEnvConfig:
"""Configuration for an Isaac Gym parallel environment."""
# Environment settings
env_name: str = "HumanoidLocomotion"
num_envs: int = 4096
env_spacing: float = 2.0 # meters between environment copies
# Physics settings
physics_engine: str = "physx" # physx or flex
dt: float = 1.0 / 60.0 # simulation timestep
substeps: int = 2 # physics substeps per step
num_threads: int = 4 # CPU threads for broad-phase
# Robot settings
robot_asset: str = "humanoid.xml"
num_dofs: int = 21
num_bodies: int = 13
# RL settings
obs_dim: int = 60 # observation vector size
action_dim: int = 21 # one action per DOF
episode_length: int = 1000 # max steps per episode
reward_scale: float = 1.0
# Randomization
randomize: bool = True
friction_range: tuple[float, float] = (0.5, 1.5)
mass_scale_range: tuple[float, float] = (0.8, 1.2)
push_force_range: tuple[float, float] = (0.0, 50.0)
def compute_effective_dt(self) -> float:
"""Compute the effective physics timestep including substeps."""
return self.dt / self.substeps
def estimate_steps_per_second(self, gpu_multiplier: float = 0.85) -> float:
"""Estimate training throughput in steps per second.
Assumes ~85% GPU utilization for physics + inference.
"""
wall_time_per_step = self.dt # real-time equivalent
parallel_factor = self.num_envs * gpu_multiplier
return parallel_factor / wall_time_per_step
def validate_env_config(config: IsaacGymEnvConfig) -> dict:
"""Validate Isaac Gym environment configuration."""
issues: list[str] = []
if config.num_envs < 1:
issues.append("num_envs must be >= 1")
if config.num_envs > 65536:
issues.append("num_envs above 65536 may exceed GPU memory")
if config.dt <= 0:
issues.append("dt must be positive")
if config.substeps < 1:
issues.append("substeps must be >= 1")
if config.obs_dim < config.num_dofs:
issues.append("obs_dim should be >= num_dofs (need at least joint positions)")
if config.action_dim != config.num_dofs:
issues.append(f"action_dim ({config.action_dim}) should match num_dofs ({config.num_dofs})")
if config.physics_engine not in ("physx", "flex"):
issues.append(f"Unknown physics engine: {config.physics_engine}")
effective_dt = config.compute_effective_dt()
estimated_sps = config.estimate_steps_per_second()
return {
"valid": len(issues) == 0,
"env_name": config.env_name,
"num_envs": config.num_envs,
"effective_dt": effective_dt,
"estimated_sps": estimated_sps,
"issues": issues,
}
if __name__ == "__main__":
config = IsaacGymEnvConfig(
env_name="HumanoidWalk",
num_envs=4096,
num_dofs=21,
obs_dim=60,
action_dim=21,
)
result = validate_env_config(config)
print(f"Environment: {result['env_name']}")
print(f"Parallel envs: {result['num_envs']}")
print(f"Effective dt: {result['effective_dt']:.6f} s")
print(f"Estimated throughput: {result['estimated_sps']:,.0f} steps/s")
if result["valid"]:
print("✓ Configuration valid")
else:
for issue in result["issues"]:
print(f" ✗ {issue}")
# Expected output:
# Environment: HumanoidWalk
# Parallel envs: 4096
# Effective dt: 0.008333 s
# Estimated throughput: 209,100 steps/s
# ✓ Configuration valid
Observation Space Design
Designing the observation vector is critical for successful RL training. For a humanoid locomotion task, a typical observation includes:
- Joint positions (21 DOFs) — current angles of each joint
- Joint velocities (21 DOFs) — angular velocities of each joint
- Base orientation (4 values) — quaternion of the torso
- Base angular velocity (3 values) — rotational velocity of the torso
- Base linear velocity (3 values) — translational velocity of the torso
- Gravity vector in body frame (3 values) — tells the policy which way is "down"
- Previous actions (21 values) — the last action taken (helps with smooth control)
This gives an observation dimension of about 76 values—compact enough for a small neural network but rich enough to capture the robot's full dynamic state.
Reward Function Engineering
The reward function drives what behavior the robot learns. For humanoid walking, a well-designed reward balances forward progress, energy efficiency, and stability:
# humanoid_reward.py — Reward computation for humanoid locomotion
"""Compute reward components for humanoid walking task in Isaac Gym."""
import math
def compute_locomotion_reward(
linear_velocity: tuple[float, float, float],
angular_velocity: tuple[float, float, float],
torso_height: float,
joint_torques: list[float],
target_velocity: float = 1.0,
target_height: float = 0.9,
alive_bonus: float = 2.0,
velocity_weight: float = 1.5,
energy_weight: float = 0.01,
height_weight: float = 1.0,
angular_penalty_weight: float = 0.1,
fall_threshold: float = 0.4,
) -> dict:
"""
Compute multi-component reward for humanoid walking.
Components:
- alive_bonus: positive reward for staying upright
- velocity_reward: reward for matching target forward velocity
- energy_penalty: penalize high joint torques (encourage efficiency)
- height_reward: reward for maintaining target torso height
- angular_penalty: penalize excessive torso rotation
"""
# Check if robot has fallen
fallen = torso_height < fall_threshold
if fallen:
return {
"total_reward": -10.0,
"alive_bonus": 0.0,
"velocity_reward": 0.0,
"energy_penalty": 0.0,
"height_reward": 0.0,
"angular_penalty": 0.0,
"fallen": True,
"done": True,
}
# Forward velocity reward (exponential for smooth gradient)
vx = linear_velocity[0]
velocity_error = abs(vx - target_velocity)
velocity_reward = velocity_weight * math.exp(-2.0 * velocity_error)
# Energy penalty (sum of squared torques)
energy = sum(t * t for t in joint_torques)
energy_penalty = -energy_weight * energy
# Height reward (keep torso at target height)
height_error = abs(torso_height - target_height)
height_reward = height_weight * math.exp(-5.0 * height_error)
# Angular velocity penalty (minimize torso spin)
angular_speed = math.sqrt(sum(w * w for w in angular_velocity))
angular_penalty = -angular_penalty_weight * angular_speed
total = alive_bonus + velocity_reward + energy_penalty + height_reward + angular_penalty
return {
"total_reward": total,
"alive_bonus": alive_bonus,
"velocity_reward": velocity_reward,
"energy_penalty": energy_penalty,
"height_reward": height_reward,
"angular_penalty": angular_penalty,
"fallen": False,
"done": False,
}
if __name__ == "__main__":
result = compute_locomotion_reward(
linear_velocity=(0.95, 0.01, 0.0),
angular_velocity=(0.1, 0.05, 0.02),
torso_height=0.88,
joint_torques=[5.0, 3.0, 2.0, 4.0, 1.0] * 4 + [0.5],
target_velocity=1.0,
)
print("Humanoid locomotion reward breakdown:")
for key, value in result.items():
if isinstance(value, float):
print(f" {key}: {value:.4f}")
else:
print(f" {key}: {value}")
# Expected output:
# total_reward: ~4.25
# alive_bonus: 2.0000
# velocity_reward: ~1.36
# energy_penalty: ~-2.50
# height_reward: ~0.90
# angular_penalty: ~-0.01
# fallen: False
# done: False
Domain Randomization
Domain randomization is a technique where simulation parameters are varied randomly during training so the resulting policy is robust to real-world variations. Isaac Gym supports randomizing:
- Physics parameters — friction, restitution, mass, center of mass
- Visual appearance — textures, colors, lighting (for vision-based policies)
- External disturbances — random pushes, varying ground slopes
- Observation noise — simulating sensor noise in joint encoders and IMU
By training across these variations, the policy learns to be robust to the inevitable differences between simulation and reality—the cornerstone of sim-to-real transfer.
Training at Scale
Isaac Gym pairs naturally with NVIDIA's rl_games library for PPO training:
# Train a humanoid walking policy with Isaac Gym + rl_games
python train.py task=Humanoid \
num_envs=4096 \
max_iterations=5000 \
headless=True
# Evaluate the trained policy with rendering
python train.py task=Humanoid \
num_envs=64 \
test=True \
checkpoint=runs/Humanoid/nn/Humanoid.pth
# Expected training metrics (after 5000 iterations):
# Mean reward: ~6000-8000
# Mean episode length: ~900-1000 steps
# Training time: ~2-4 hours on RTX 3090
Key Takeaways
- Isaac Gym runs physics, observations, rewards, and inference entirely on the GPU, eliminating CPU-GPU data transfer bottlenecks
- It can simulate tens of thousands of parallel environments on a single GPU, achieving million-step-per-second throughput
- Observation space design should capture the robot's full dynamic state: joint positions/velocities, base pose, gravity, and previous actions
- Multi-component reward functions balance task objectives (forward velocity) with constraints (energy, stability, smoothness)
- Domain randomization across physics and visual parameters is essential for policies that transfer from simulation to real robots