Distributed Superintelligence.
Bagel Labs is an Artificial Intelligence Research Lab developing novel methods for distributed training of frontier diffusion models on commodity hardware. Our work enables training of state-of-the-art generative models for robotics, video, and world modelling across heterogeneous hardware, unlocking compute capacity that current training architectures can't touch.
Decentralized Diffusion Models.
Distributed Diffusion Models (DDM) replace a single large diffusion model with an ensemble of smaller expert models, each trained independently on a partition of the dataset with no gradient synchronization between nodes. At inference, a lightweight router ensembles their outputs. This removes the tight coupling that forces conventional training onto homogeneous GPU superclusters.
Paris-1.
Paris is the first publicly released DDM. Despite using 14x less data and 16x less compute than prior decentralized baselines, it outperforms models trained on traditional monolithic clusters, achieving a 24% FID improvement (22.60 vs 29.64) on standard benchmarks.
| Inference Strategy | FID-50K ↓ |
|---|---|
| Monolithic (single) | 29.64 |
| Top-1 | 30.60 |
| Top-2 | 22.60 |
| Full Ensemble | 47.89 |
| Improvement | 7.04 |
We are Bagel Labs, an artificial intelligence research lab pioneering distributed training of frontier diffusion models on commodity hardware.
We ignore years of experience and pedigree. If you have high agency, meaning your default assumption is that you can control the outcome of whatever situation you are in, we want to hear from you. Every requirement below is flexible for a candidate with high enough agency and tolerance for ambiguity.
Role Overview
You will design, build, and relentlessly optimize the infrastructure that trains and serves large diffusion models. Your job is to make GPUs go faster, make clusters behave, and make training and inference scale across multiple nodes, regions, and hardware types without turning into a reliability tax.
This role sits at the intersection of systems engineering, performance engineering, and research enablement. You will touch kernels, networking, orchestration, compilers, and model code when needed.
Key Responsibilities
- Build and operate distributed training stacks for diffusion models (U-Net, DiT, video diffusion, world-model variants) across multi-node GPU clusters.
- Implement and tune parallelism strategies for training and inference, including data parallel, tensor parallel, pipeline parallel, ZeRO/FSDP-style sharding, expert parallel, and diffusion-specific tricks (timestep-level scheduling, CFG parallelism, microbatching).
- Profile end-to-end GPU performance and remove bottlenecks across kernels, memory, comms, and I/O (CUDA graphs, kernel fusion, attention kernels, NCCL tuning, overlap of compute and comms).
- Own inference serving for diffusion workloads with high throughput and predictable latency, including dynamic batching, variable resolution handling, caching, prefill/conditioning optimization, and multi-GPU execution.
- Design robust orchestration for heterogeneous and preemptible environments (on-prem, bare metal, cloud, spot), including checkpointing, resumability, and fault tolerance.
- Build observability that is actually useful for diffusion: step-time breakdowns, denoising throughput, VRAM headroom, NCCL health, queueing, tail latency, error budgets, and cost per sample.
- Implement pragmatic quantization and precision strategies for diffusion inference and training, balancing quality, speed, and stability (BF16/FP16/TF32/FP8, weight-only INT8/INT4 where it makes sense, selective quantization of submodules).
- Improve developer velocity through reproducible environments, CI for performance regressions, and automation for cluster bring-up and rollouts.
- Write clear internal docs and occasional public technical deep-dives on blog.bagel.com when it helps the community and hiring.
Who You Might Be
You are the person teammates call when GPUs underperform, distributed training deadlocks, or a "simple" deployment turns into a week of whack-a-mole. You like the ugly truth in traces and profiler timelines. You can move between high-level architecture and low-level debugging without getting lost.
You probably have scars from at least a few of these:
- chasing down NCCL hangs, stragglers, and clock drift
- fixing memory fragmentation and OOMs that should not happen
- turning a 2x slowdown into a 10 percent regression by changing one flag, then learning why
- shipping a system that stays up while people are actively trying to break it
Required Skills (flexible)
- Strong Linux fundamentals, networking basics, and the ability to debug production incidents without panic.
- Deep GPU performance instincts: profiling, memory behavior, kernel-level thinking, and practical CUDA tooling literacy (even if you are not writing CUDA daily).
- Hands-on experience scaling training and/or inference across multiple GPUs and nodes.
- Comfort implementing parallelism and sharding in modern frameworks (PyTorch, NCCL, torch.distributed, FSDP/ZeRO-style systems, or equivalent).
- Experience building reliable deployment pipelines (containers, rollouts, versioning, rollback, secrets, config management).
- The ability to read model code and change it when infrastructure and performance require it.
Bonus Skills
- Contributions to open-source performance or distributed systems projects (PyTorch internals, Triton kernels, xFormers/FlashAttention, NCCL tooling, Ray, Kubernetes operators, etc.).
- Experience with diffusion-specific serving and optimization (Diffusers, ComfyUI, custom schedulers/solvers, distillation, few-step generation, VAE decode optimization, tiled generation).
- TensorRT or compiler experience (torch.compile/Inductor, XLA, CUDA graphs), and a habit of measuring instead of guessing.
- Experience building multi-tenant GPU platforms with isolation, fair scheduling, and predictable QoS.
- Comfort with cost engineering: understanding where dollars burn in GPU clusters and how to reduce it without fragility.
What We Offer
- Top of the market compensation.
- A deeply technical culture where bold frontier ideas are debated, stress-tested, and built.
- High autonomy and direct ownership of critical systems.
- In-person role at our Toronto office.
- Work that can set the direction for frontier diffusion models.
- Paid travel opportunities to the top ML conferences around the world.
We are Bagel Labs - an artificial intelligence research lab pioneering distributed training of frontier diffusion models on commodity hardware.
We ignore years of experience and pedigree. If you have high agency - meaning your default assumption is that you can control the outcome of whatever situation you are in - we want to hear from you. Every requirement below is flexible for a candidate with high enough agency and tolerance for ambiguity.
Role Overview
We encourage curiosity-driven research and welcome bold, untested concepts. You will explore frontiers in continual learning, world modelling, and reinforcement learning on diffusion models. We love novel, provocative, untested ideas that challenge conventional paradigms.
Key Responsibilities
- Advance decentralized diffusion models (DDM) and pioneer next-generation architectures including rectified flows, EDM variants, and latent consistency models.
- Develop novel sampling algorithms, guidance mechanisms, and conditioning strategies that unlock new capabilities in controllable generation.
- Push the frontier of video generation and synthesis, including temporal modeling and multi-modal architectures.
- Publish at top-tier ML venues and share insights through blog posts, open-source contributions, and community engagement.
Who You Might Be
You are extremely curious. You actively consume the latest ML research - scanning arXiv, attending conferences, dissecting new open-source releases, and integrating breakthroughs into your own experimentation. You thrive on first-principles reasoning, see potential in unexplored ideas, and view learning as a perpetual process.
Desired Skills
- Deep expertise in modern diffusion models including training, sampling, denoising schedulers, score matching, flow matching, consistency training, and distillation techniques.
- Experience with transformer architectures such as DiT, MM-DiT, and attention mechanisms.
- Hands-on experience with distributed training at scale across multi-GPU and multi-node setups, with familiarity in mixed-precision training (FP8, BF16).
- Experience with video generation and synthesis, including temporal modeling and 3D positional encodings.
- Knowledge of VAE architectures such as HunyuanVAE, DC-AE, and latent representations, as well as motion modeling and optical flow.
- Strong mathematical foundation in SDEs, ODEs, optimal transport, and variational inference for designing novel generative objectives.
What We Offer
- Top of the market compensation and time to pursue open-ended research.
- A deeply technical culture where bold, frontier ideas are debated, stress-tested, and built.
- In-person role at our Toronto office.
- Ownership of work that can set the direction for frontier diffusion models.
- Paid travel opportunities to the top ML conferences around the world.
We are Bagel Labs - an artificial intelligence research lab pioneering distributed training of frontier diffusion models on commodity hardware.
We ignore years of experience and pedigree. If you have high agency - meaning your default assumption is that you can control the outcome of whatever situation you are in - we want to hear from you. Every requirement below is flexible for a candidate with high enough agency and tolerance for ambiguity.
Role
Work directly with the CEO and own go to market end to end, land the first partners, loop their feedback into product, iterate to PMF.
Responsibilities
- Map partners, rank by impact and pursue.
- Lead high level technical and commercial conversations.
- Represent Bagel in ML events.
- Track metrics, report learning, adjust strategy.
Requirements (flexible)
- Record of GTM, BD, or partnerships in an early-stage deep tech startups.
- Deep understanding of either the AI stack. Knowledge of decentralized-AI tooling is a plus.
- Existing network of builders, investors, or partners in AI.
- Bias to action and data.
- Crisp written and verbal communication.
What We Offer
- Competitive salary plus founding-team-level equity.
- Direct influence on company trajectory and culture.
We are Bagel Labs - an artificial intelligence research lab pioneering distributed training of frontier diffusion models on commodity hardware.
We ignore years of experience and pedigree. If you have high agency - meaning your default assumption is that you can control the outcome of whatever situation you are in - we want to hear from you. Every requirement below is flexible for a candidate with high enough agency and tolerance for ambiguity.
Role Overview
You will build the data foundation for our frontier video and image generation diffusion models, turning massive, messy collections of media into clean, well-labeled datasets that researchers can trust for training. You will own pipelines end to end and work closely with the modeling team to unblock experiments and catch data issues before they quietly degrade model quality.
Key Responsibilities
- Build pipelines that ingest, filter, and transform millions of video clips and images into training-ready shards.
- Run quality scoring and synthetic captioning at scale across GPU clusters.
- Own dataset versioning so researchers can trace any training run back to an exact snapshot.
- Optimize storage and compute to keep PB-scale processing fast and cost-efficient.
Who You Might Be
You stay on top of the latest vision model releases and are often one of the first to try new open-source tools when they drop. You have strong intuition for what makes a good training sample and get frustrated when bad data silently hurts model quality. You are pragmatic about making systems work reliably even when requirements shift mid-flight.
Desired Skills
- Comfort with video and image data at the file level, whether that means transcoding, cropping, or detecting scene boundaries.
- Experience running filters, scorers, or captioning models across large media datasets.
- Python proficiency for batch processing and moving petabytes through object storage.
- Experience with large-scale storage systems including NAS, object storage, and distributed filesystems.
- Familiarity with text-conditioned generative models such as CLIP and T5 for embedding precomputation and captioning at scale.
- Basic understanding of video codecs and containers — knowing the difference between H.264/H.265, keyframe structures, and variable frame rates matters at this scale.
- Understanding of how diffusion models work is a plus.
- Active Hugging Face or GitHub presence with open-source contributions is a plus.
What We Offer
- Top of the market compensation with equity upside and time to pursue open-ended research.
- A deeply technical culture where bold, frontier ideas are debated, stress-tested, and built.
- In-person role at our Toronto office.
- Paid travel opportunities to the top ML conferences around the world.