FOSDEM 2026

"One GPU, Many Models: What Works and What Segfaults" ( 2026 )

Saturday at 13:55, 20 minutes, UD2.120 (Chavanne), UD2.120 (Chavanne), AI Plumbers YASH PANCHAL , slides , video

Serving multiple models on a single GPU sounds great until something segfaults.

Two approaches dominate for parallel inference: MIG (hardware partitioning) and MPS (software sharing). Both promise efficient GPU sharing.

I tested both strategies for video generation workloads in parallel.

This talk digs into what actually happened: where things worked, where memory isolation fell apart, which configs crashed, and what survives under load.

By the end, you'll know:

How to utilize unused GPU capacity.
How to setup MIG and MPS.
Memory issues, crashes, and failures.
Workload specific configs

2026

0.59 "GPU Virtualization with MIG: Multi-Tenant Isolation for AI Inference Workloads"
0.49 "Beyond nvidia-smi: Tools for Real GPU Performance Metrics"
0.47 "Observability for AI Workloads on HPC: Beyond GPU Utilization Metrics"
0.47 "Building Cloud Infrastructure for AI"
0.46 "The Hidden Cost of Intelligence: The Energy Footprint of AI from Code to GPU Kernels"
0.46 "Single-source cross-platform GPU LLM inference with Slang and Rust"
0.46 "GPU Offloading in LLVM: Architecture, API, and Plugins"

2025

0.51 "GPUStack: Building a Simple and Scalable Management Experience for Diverse AI Models"
0.47 "Optimizing Resource Utilization for Interactive GPU Workloads with Transparent Container Checkpointing"
0.47 "The bare metal perspective on AMD's GPU ASICs"

Related: