We are Bagel Labs, an artificial intelligence research lab pioneering distributed training of frontier diffusion models on commodity hardware.
We ignore years of experience and pedigree. If you have high agency, meaning your default assumption is that you can control the outcome of whatever situation you are in, we want to hear from you. Every requirement below is flexible for a candidate with high enough agency and tolerance for ambiguity.
Role Overview
You will design, build, and relentlessly optimize the infrastructure that trains and serves large diffusion models. Your job is to make GPUs go faster, make clusters behave, and make training and inference scale across multiple nodes, regions, and hardware types without turning into a reliability tax.
This role sits at the intersection of systems engineering, performance engineering, and research enablement. You will touch kernels, networking, orchestration, compilers, and model code when needed.
Key Responsibilities
- Build and operate distributed training stacks for diffusion models across multi-node GPU clusters.
- Implement and tune parallelism strategies for training and inference, including data parallel, tensor parallel, pipeline parallel, ZeRO/FSDP-style sharding, expert parallel, and diffusion-specific tricks.
- Profile end-to-end GPU performance and remove bottlenecks across kernels, memory, communications, and I/O.
- Own inference serving for diffusion workloads with high throughput and predictable latency.
- Design robust orchestration for heterogeneous and preemptible environments, including checkpointing, resumability, and fault tolerance.
- Build useful observability for diffusion: step-time breakdowns, denoising throughput, VRAM headroom, NCCL health, queueing, tail latency, error budgets, and cost per sample.
- Implement pragmatic quantization and precision strategies for diffusion inference and training.
- Improve developer velocity through reproducible environments, performance CI, and automation for cluster bring-up and rollouts.
- Write clear internal docs and occasional public technical deep-dives when it helps the community and hiring.
Who You Might Be
You are the person teammates call when GPUs underperform, distributed training deadlocks, or a simple deployment turns into a week of debugging. You like the truth in traces and profiler timelines. You can move between high-level architecture and low-level debugging without getting lost.
You probably have scars from chasing down NCCL hangs, fixing memory fragmentation, learning why one flag caused a slowdown, or shipping a system that stays up while people are actively trying to break it.
Required Skills
- Strong Linux fundamentals, networking basics, and the ability to debug production incidents without panic.
- Deep GPU performance instincts: profiling, memory behavior, kernel-level thinking, and practical CUDA tooling literacy.
- Hands-on experience scaling training and/or inference across multiple GPUs and nodes.
- Comfort implementing parallelism and sharding in modern frameworks such as PyTorch, NCCL, torch.distributed, FSDP, or ZeRO-style systems.
- Experience building reliable deployment pipelines across containers, rollouts, versioning, rollback, secrets, and config management.
- The ability to read model code and change it when infrastructure and performance require it.
Bonus Skills
- Contributions to open-source performance or distributed systems projects.
- Experience with diffusion-specific serving and optimization.
- TensorRT or compiler experience and a habit of measuring instead of guessing.
- Experience building multi-tenant GPU platforms with isolation, fair scheduling, and predictable QoS.
- Comfort with cost engineering in GPU clusters.
What We Offer
- Top of the market compensation.
- A deeply technical culture where bold frontier ideas are debated, stress-tested, and built.
- High autonomy and direct ownership of critical systems.
- In-person role at our Toronto office.
- Work that can set the direction for frontier diffusion models.
- Paid travel opportunities to top ML conferences.