Chapter 2: Isaac Gym

Learning Objectives

Understand how Isaac Gym enables massively parallel GPU-accelerated reinforcement learning for robotics
Set up Isaac Gym environments with thousands of simultaneous robot instances
Design observation spaces, action spaces, and reward functions for robot control tasks
Train and evaluate robot locomotion and manipulation policies using GPU-parallel RL

What Is Isaac Gym?

Isaac Gym is NVIDIA's GPU-accelerated physics simulation environment designed specifically for reinforcement learning (RL) research. Unlike traditional simulators that run physics on the CPU and rendering on the GPU, Isaac Gym runs everything on the GPU—physics computation, state observation, reward calculation, and policy inference all happen in GPU memory without transfers to CPU. This eliminates the CPU-GPU data transfer bottleneck that limits training speed in other simulators.

The result is staggering throughput: Isaac Gym can simulate tens of thousands of robot environments simultaneously on a single GPU, achieving billions of simulation steps per day. This scale makes it practical to train complex robot behaviors—humanoid locomotion, dexterous manipulation, whole-body control—that would take months with traditional CPU-based simulation.

Isaac Gym vs Traditional RL Pipelines

Aspect	Traditional (Gym + MuJoCo)	Isaac Gym
Physics computation	CPU	GPU
Number of parallel environments	8–64 (CPU cores)	4,096–65,536 (GPU)
Observation transfer	GPU→CPU→GPU	GPU only
Steps per second (humanoid)	~10K	~1M+
Time to 1B steps	Days–weeks	Hours
Hardware required	Multi-core CPU	Single RTX GPU

Setting Up Isaac Gym Environments

Isaac Gym environments are defined as Python classes that configure the robot, task parameters, and observation/action spaces. The framework handles parallelization automatically—you define a single environment, and Isaac Gym replicates it across the GPU:

# env_config.py — Isaac Gym environment configuration builder
"""Build and validate Isaac Gym environment configuration."""
from dataclasses import dataclass, field


@dataclass
class IsaacGymEnvConfig:
    """Configuration for an Isaac Gym parallel environment."""

    # Environment settings
    env_name: str = "HumanoidLocomotion"
    num_envs: int = 4096
    env_spacing: float = 2.0  # meters between environment copies

    # Physics settings
    physics_engine: str = "physx"  # physx or flex
    dt: float = 1.0 / 60.0  # simulation timestep
    substeps: int = 2  # physics substeps per step
    num_threads: int = 4  # CPU threads for broad-phase

    # Robot settings
    robot_asset: str = "humanoid.xml"
    num_dofs: int = 21
    num_bodies: int = 13

    # RL settings
    obs_dim: int = 60  # observation vector size
    action_dim: int = 21  # one action per DOF
    episode_length: int = 1000  # max steps per episode
    reward_scale: float = 1.0

    # Randomization
    randomize: bool = True
    friction_range: tuple[float, float] = (0.5, 1.5)
    mass_scale_range: tuple[float, float] = (0.8, 1.2)
    push_force_range: tuple[float, float] = (0.0, 50.0)

    def compute_effective_dt(self) -> float:
        """Compute the effective physics timestep including substeps."""
        return self.dt / self.substeps

    def estimate_steps_per_second(self, gpu_multiplier: float = 0.85) -> float:
        """Estimate training throughput in steps per second.

        Assumes ~85% GPU utilization for physics + inference.
        """
        wall_time_per_step = self.dt  # real-time equivalent
        parallel_factor = self.num_envs * gpu_multiplier
        return parallel_factor / wall_time_per_step


def validate_env_config(config: IsaacGymEnvConfig) -> dict:
    """Validate Isaac Gym environment configuration."""
    issues: list[str] = []

    if config.num_envs < 1:
        issues.append("num_envs must be >= 1")
    if config.num_envs > 65536:
        issues.append("num_envs above 65536 may exceed GPU memory")
    if config.dt <= 0:
        issues.append("dt must be positive")
    if config.substeps < 1:
        issues.append("substeps must be >= 1")
    if config.obs_dim < config.num_dofs:
        issues.append("obs_dim should be >= num_dofs (need at least joint positions)")
    if config.action_dim != config.num_dofs:
        issues.append(f"action_dim ({config.action_dim}) should match num_dofs ({config.num_dofs})")
    if config.physics_engine not in ("physx", "flex"):
        issues.append(f"Unknown physics engine: {config.physics_engine}")

    effective_dt = config.compute_effective_dt()
    estimated_sps = config.estimate_steps_per_second()

    return {
        "valid": len(issues) == 0,
        "env_name": config.env_name,
        "num_envs": config.num_envs,
        "effective_dt": effective_dt,
        "estimated_sps": estimated_sps,
        "issues": issues,
    }


if __name__ == "__main__":
    config = IsaacGymEnvConfig(
        env_name="HumanoidWalk",
        num_envs=4096,
        num_dofs=21,
        obs_dim=60,
        action_dim=21,
    )
    result = validate_env_config(config)
    print(f"Environment: {result['env_name']}")
    print(f"Parallel envs: {result['num_envs']}")
    print(f"Effective dt: {result['effective_dt']:.6f} s")
    print(f"Estimated throughput: {result['estimated_sps']:,.0f} steps/s")
    if result["valid"]:
        print("✓ Configuration valid")
    else:
        for issue in result["issues"]:
            print(f"  ✗ {issue}")
    # Expected output:
    #   Environment: HumanoidWalk
    #   Parallel envs: 4096
    #   Effective dt: 0.008333 s
    #   Estimated throughput: 209,100 steps/s
    #   ✓ Configuration valid

Observation Space Design

Designing the observation vector is critical for successful RL training. For a humanoid locomotion task, a typical observation includes:

Joint positions (21 DOFs) — current angles of each joint
Joint velocities (21 DOFs) — angular velocities of each joint
Base orientation (4 values) — quaternion of the torso
Base angular velocity (3 values) — rotational velocity of the torso
Base linear velocity (3 values) — translational velocity of the torso
Gravity vector in body frame (3 values) — tells the policy which way is "down"
Previous actions (21 values) — the last action taken (helps with smooth control)

This gives an observation dimension of about 76 values—compact enough for a small neural network but rich enough to capture the robot's full dynamic state.

Reward Function Engineering

The reward function drives what behavior the robot learns. For humanoid walking, a well-designed reward balances forward progress, energy efficiency, and stability:

# humanoid_reward.py — Reward computation for humanoid locomotion
"""Compute reward components for humanoid walking task in Isaac Gym."""
import math


def compute_locomotion_reward(
    linear_velocity: tuple[float, float, float],
    angular_velocity: tuple[float, float, float],
    torso_height: float,
    joint_torques: list[float],
    target_velocity: float = 1.0,
    target_height: float = 0.9,
    alive_bonus: float = 2.0,
    velocity_weight: float = 1.5,
    energy_weight: float = 0.01,
    height_weight: float = 1.0,
    angular_penalty_weight: float = 0.1,
    fall_threshold: float = 0.4,
) -> dict:
    """
    Compute multi-component reward for humanoid walking.

    Components:
    - alive_bonus: positive reward for staying upright
    - velocity_reward: reward for matching target forward velocity
    - energy_penalty: penalize high joint torques (encourage efficiency)
    - height_reward: reward for maintaining target torso height
    - angular_penalty: penalize excessive torso rotation
    """
    # Check if robot has fallen
    fallen = torso_height < fall_threshold
    if fallen:
        return {
            "total_reward": -10.0,
            "alive_bonus": 0.0,
            "velocity_reward": 0.0,
            "energy_penalty": 0.0,
            "height_reward": 0.0,
            "angular_penalty": 0.0,
            "fallen": True,
            "done": True,
        }

    # Forward velocity reward (exponential for smooth gradient)
    vx = linear_velocity[0]
    velocity_error = abs(vx - target_velocity)
    velocity_reward = velocity_weight * math.exp(-2.0 * velocity_error)

    # Energy penalty (sum of squared torques)
    energy = sum(t * t for t in joint_torques)
    energy_penalty = -energy_weight * energy

    # Height reward (keep torso at target height)
    height_error = abs(torso_height - target_height)
    height_reward = height_weight * math.exp(-5.0 * height_error)

    # Angular velocity penalty (minimize torso spin)
    angular_speed = math.sqrt(sum(w * w for w in angular_velocity))
    angular_penalty = -angular_penalty_weight * angular_speed

    total = alive_bonus + velocity_reward + energy_penalty + height_reward + angular_penalty

    return {
        "total_reward": total,
        "alive_bonus": alive_bonus,
        "velocity_reward": velocity_reward,
        "energy_penalty": energy_penalty,
        "height_reward": height_reward,
        "angular_penalty": angular_penalty,
        "fallen": False,
        "done": False,
    }


if __name__ == "__main__":
    result = compute_locomotion_reward(
        linear_velocity=(0.95, 0.01, 0.0),
        angular_velocity=(0.1, 0.05, 0.02),
        torso_height=0.88,
        joint_torques=[5.0, 3.0, 2.0, 4.0, 1.0] * 4 + [0.5],
        target_velocity=1.0,
    )
    print("Humanoid locomotion reward breakdown:")
    for key, value in result.items():
        if isinstance(value, float):
            print(f"  {key}: {value:.4f}")
        else:
            print(f"  {key}: {value}")
    # Expected output:
    #   total_reward: ~4.25
    #   alive_bonus: 2.0000
    #   velocity_reward: ~1.36
    #   energy_penalty: ~-2.50
    #   height_reward: ~0.90
    #   angular_penalty: ~-0.01
    #   fallen: False
    #   done: False

Domain Randomization

Domain randomization is a technique where simulation parameters are varied randomly during training so the resulting policy is robust to real-world variations. Isaac Gym supports randomizing:

Physics parameters — friction, restitution, mass, center of mass
Visual appearance — textures, colors, lighting (for vision-based policies)
External disturbances — random pushes, varying ground slopes
Observation noise — simulating sensor noise in joint encoders and IMU

By training across these variations, the policy learns to be robust to the inevitable differences between simulation and reality—the cornerstone of sim-to-real transfer.

Training at Scale

Isaac Gym pairs naturally with NVIDIA's rl_games library for PPO training:

# Train a humanoid walking policy with Isaac Gym + rl_games
python train.py task=Humanoid \
  num_envs=4096 \
  max_iterations=5000 \
  headless=True

# Evaluate the trained policy with rendering
python train.py task=Humanoid \
  num_envs=64 \
  test=True \
  checkpoint=runs/Humanoid/nn/Humanoid.pth

# Expected training metrics (after 5000 iterations):
#   Mean reward: ~6000-8000
#   Mean episode length: ~900-1000 steps
#   Training time: ~2-4 hours on RTX 3090

Key Takeaways

Isaac Gym runs physics, observations, rewards, and inference entirely on the GPU, eliminating CPU-GPU data transfer bottlenecks
It can simulate tens of thousands of parallel environments on a single GPU, achieving million-step-per-second throughput
Observation space design should capture the robot's full dynamic state: joint positions/velocities, base pose, gravity, and previous actions
Multi-component reward functions balance task objectives (forward velocity) with constraints (energy, stability, smoothness)
Domain randomization across physics and visual parameters is essential for policies that transfer from simulation to real robots

Learning Objectives​

What Is Isaac Gym?​

Isaac Gym vs Traditional RL Pipelines​

Setting Up Isaac Gym Environments​

Observation Space Design​

Reward Function Engineering​

Domain Randomization​

Training at Scale​

Key Takeaways​