Member of Technical Staff, Distributed Training Systems

Bagel Labs is an AI research lab that trains frontier diffusion models across commodity hardware instead of one uniform GPU cluster. Our method, Distributed Diffusion Models or DDM, trains many smaller expert models independently with no gradient synchronization, then combines them at inference with a lightweight router. Paris-1 proved the idea on images and Paris-2 proved it on video, where three 11B experts and a router beat a monolithic baseline trained on the same compute by more than 50% on FVD. We are now applying DDM to physical AI, where the training runs are long, expensive, and easy to break.

We ignore years of experience and pedigree. If you have strong systems taste and can make messy research infrastructure reliable under pressure, we want to hear from you. Every requirement below is flexible for someone with the engineering judgment to back it up.

Role Overview

You will build the systems layer that turns frontier research into results we can trust. Training across mixed hardware with no gradient sync breaks the usual playbook, so much of this infrastructure does not exist yet and you will invent it. The work spans distributed training, GPU orchestration, observability, benchmark harnesses, experiment tracking, and data and model pipelines. Physical AI workloads are the focus, but the core skill is building high leverage ML systems that researchers actually want to use.

What You'll Do

Build and operate distributed training for diffusion heavy workloads across heterogeneous compute.
Make experiments first class, with solid launchers, configs, checkpointing, logging, metrics, run comparison, and reproducibility.
Build benchmark and evaluation harnesses for physical AI research, including robotics and world model experiments.
Own the data and model pipelines so every result traces back to a dataset, a version, and a config.
Add observability for GPU utilization, failure modes, data quality, routing behavior, model quality, and training stability.
Turn fragile research prototypes into repeatable runs and trustworthy artifacts.

Who You Might Be

You have made messy experiments reliable before, maybe in ML infrastructure, distributed training, research engineering, GPU systems, or data infrastructure. Strong candidates come from large model training, video generation, inference systems, infrastructure startups, or research teams where prototypes had to grow into dependable systems. You think about the researcher on the other side of your tools, and you build things they will actually use.

Desired Skills

Hands on experience with distributed training, GPU workloads, experiment infrastructure, or large scale ML systems.
The ability to debug performance, reliability, and reproducibility problems in complex training and evaluation workflows.
A taste for simple tools that researchers will actually adopt.
Clear communication and strong ownership.

What We Offer

Competitive compensation and meaningful equity.
A deeply technical culture where research and systems work stay closely tied together.
Ownership of foundational infrastructure that decides whether frontier ideas turn into results we can trust.
Paid travel to the top ML and systems conferences around the world.