The KAUST Supercomputing Core Lab invites you to join the Distributed Deep Learning Workshop on IBEX, a hands-on training designed to help users efficiently scale AI workloads across multiple GPUs and compute nodes using IBEX’s high-performance computing environment.
This workshop provides a practical introduction to essential distributed training frameworks for accelerating training of models on IBEX GPUs using data and model parallelism. We will focus on PyTorch Distributed (DDP), DeepSpeed, Fully Sharded Data Parallel (FSDP) and NVIDIA NeMo demonstrates how to scale from one to many GPUs on a single and multiple nodes of IBEX.
Register here: Distributed Deep Learning Workshop on IBEX
Who should attend
- Researchers working with ML and DL models
- Data scientists and computational scientists
- AI engineers working with GPU-intensive workloads
- Anyone interested in scaling model training on HPC systems
Learning outcomes
After attending, participants will be able to:
- Familiarize with distributed training frameworks (DDP, DeepSpeed, FSDP, NVIDIA NeMo)
- Launch and manage multi-GPU and multi-node jobs using SLURM on IBEX
- Through hands-on exercises understand the limitations of models and frameworks with respect to their scaling on multiple GPUs – “using more compute resources doesn’t alway mean faster model training”
Important Note on Workshop Scope
This workshop focuses on scaling and distributing existing deep learning workloads rather than teaching fundamental Python or neural network concepts. Attendees are expected to have prior familiarity with Python-based ML frameworks (e.g., PyTorch) and basic model training. The sessions will emphasize practical usage of distributed training frameworks and optimizing performance at scale on IBEX—not introductory model development.
Agenda
Day 1:
9:00 – 10:00 — Distributed Deep Learning Overview
10:00 - 10:15 — Coffee break
10:15 – 12:00 — Hands-On Session: PyTorch Distributed Data Parallel
12:00 – 1:00 — Lunch Break
1:00 – 1:45 — Hands-On Session: DeepSpeed
1:45 - 2:00 – Coffee break
2:00 – 3:00 — Hands-On Session: DeepSpeed
Day 2:
9:00 – 10:00 — Hands-On Session: Fully-Shared Data Parallel
10:00 - 10:15 — Coffee break
10:15 – 12:00 — Hands-On Session: Fully-Shared Data Parallel
12:00 – 1:00 — Lunch Break
1:00 – 2:15 — Hands-On Session: NVIDIA NeMo
Register here: Distributed Deep Learning Workshop on IBEX
For any questions, please contact: training@hpc.kaust.edu.sa
This opportunity is brought to you by the KAUST Core Labs – Supercomputing Core Lab.