<- All roles

North America / Full-time / Systems

Member of Technical Staff, Distributed Training Systems

Full-timeSystems

Bagel Labs is an AI research lab and infrastructure company building distributed training systems for diffusion-heavy physical AI. Our work started with the Paris family of image and video models and now extends toward physical-AI workloads where world models, action representations, simulation, and heterogeneous compute become first-order bottlenecks.

We ignore years of experience and pedigree. If you have high agency, strong systems taste, and can make research infrastructure reliable under ambiguity, we want to hear from you. Every requirement below is flexible for a candidate with enough engineering judgment.

Role Overview

You will build the systems layer that lets Bagel turn research into credible proof. This role spans distributed training, GPU orchestration, observability, benchmark infrastructure, experiment tracking, data/model pipelines, and the engineering glue needed to run serious physical-AI experiments.

The role is broader than data-pipeline work and broader than benchmark maintenance. Physical-AI workloads are the wedge, but the core skill is building high-leverage ML systems.

Key Responsibilities

  • Build and operate distributed training infrastructure for diffusion-heavy workloads across heterogeneous compute.
  • Improve experiment infrastructure: launchers, configs, checkpointing, logging, metrics, run comparison, and reproducibility.
  • Build benchmark and evaluation harnesses for physical-AI research, including robotics/world-model-style experiments where relevant.
  • Own data and model pipelines that keep large experiments traceable from dataset/version/config to result.
  • Add observability for GPU utilization, failure modes, data quality, routing behavior, model quality, and training stability.
  • Work with researchers to turn fragile prototypes into repeatable runs and credible artifacts.

First 90 Days

  • Map the current Paris 3 / physical-AI experiment stack and identify the highest-friction bottlenecks.
  • Ship one infrastructure improvement that makes experiments more reproducible, observable, or cheaper to run.
  • Stand up or harden a benchmark/evaluation path for a physical-AI workload.
  • Produce a clear runbook for future experiments, including failure diagnosis and result packaging.

Who You Might Be

You might come from ML infrastructure, distributed training, research engineering, GPU systems, observability, data infrastructure, benchmark engineering, simulation infrastructure, or model-platform work.

You do not need to be a robotics specialist. Strong candidates may come from large-model training, computer vision, video generation, inference systems, infra startups, AI labs, GPU/cloud platforms, or research teams where messy experiments had to become reliable.

Desired Skills

  • Strong engineering ability in Python and at least one systems-oriented stack used in ML infrastructure.
  • Experience with distributed training, GPU workloads, experiment infrastructure, data/model pipelines, or large-scale ML systems.
  • Ability to debug performance, reliability, reproducibility, and observability issues in complex training or evaluation workflows.
  • Taste for simple tools that researchers will actually use.
  • Clear communication and strong ownership.

What We Offer

  • Top-of-market compensation and meaningful equity.
  • A deeply technical culture where research and systems work are tightly connected.
  • Ownership over infrastructure that can decide whether frontier ideas become credible proof.
  • Flexible location for the right candidate, including relocation support when appropriate.
  • Paid travel opportunities to top ML and systems conferences around the world.