"eBPF Hookpoint Gotchas: Why Your Program Fires (or Fails) in Unexpected Ways" ( 2026 )

Saturday at 10:30, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Donia Chaiehloudj Chris Tarazi , slides , video

eBPF programs often behave differently than developers expect, not because of incorrect logic, but because of subtle behaviours of the hookpoints themselves. In this talk, we focus on a small set of high-impact, commonly misunderstood attachment types — kprobes/fentry, tracepoints and uprobes, and expose the internal kernel mechanics that cause surprising edge cases.

Rather than attempting to cover all eBPF hooks, this session distills a practical set of real-world gotchas that routinely affect production tools, explaining why they occur and how to work around them.

"Lessons from scaling BPF to detect RDMA Device Drivers Bugs in real time" ( 2026 )

Saturday at 11:00, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Prankur Gupta Maksim Samoilov , slides , video

Training large models requires significant resources and failure of any GPU or Host can significantly prolong training times. At Meta, we observed that 17% of our jobs fail due to RDMA-related syscall errors which arise due to bugs in the RDMA driver code. Unlike other parts of the Kernel RDMA-related syscalls are opaque and the errors create a mismatched application/kernel view of hardware resources. As a result of this opacity and mismatch existing observability tools provided limited visibility and DevOps found it challenging to triage – we required a new scalable framework to analyze kernel state and identify the cause of this mismatch.

Direct approaches like tracing the kernel calls and capturing meta involved in the systems turned out to be prohibitively expensive. In this talk, we will describe the set of optimizations used to scale tracking kernel state and the map-based systems designed to efficiently export relevant state without impacting production workloads.

"Optimizing eBPF loading with reachability analysis" ( 2026 )

Saturday at 11:30, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Dylan Reimerink , slides , video

Any eBPF project that has started in the last couple of years is most likely written to take advantage of CO-RE, compiling your eBPF programs ahead of time, and being able to run that program on a wide range of kernels and machines.

Before CO-RE it was common to ship the whole toolchain and compile on target. This is what Cillium currently still does. Compiling on target empowered a core value of Cilium: "you do not pay for what you do not use". But it turns out that with CO-RE sometimes you DO pay for what you do not use, which makes it painful to switch over.

This payment mostly comes in the from of unused maps which still have to be created and loading tail calls which will never be called.

We created what we call "reachability analysis" which allows us to predict in userspace which parts of an eBPF program will be unused when loaded with a given set of global constants. This allows us to avoid creating maps that will never be used or load tail calls that will never be called, opening the way for Cilium migration to CO-RE.

I would like to show how this works.

"Performance and reliability pitfalls of eBPF" ( 2026 )

Saturday at 12:00, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Usama Saqib , slides , video

This talk will go over a number of performance and reliability pitfalls of the different eBPF program types we have discovered building production ready eBPF based products at Datadog. This includes the changing performance characteristics of kprobes over different kernel versions, reliability issues with fentry due to a kernel bug, and the pains of scaling uprobes, among other things.

"OOMProf: profiling Go heap memory at OOM time" ( 2026 )

Saturday at 12:30, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Tommy Reilly , slides , video

OOMProf is a Go library that installs a eBPF programs that listen to Linux kernel tracepoints involved in OOM killing and records a memory profile before your Go program is dead and gone. The memory profile can be logged as a pprof file or sent to a Parca server for storage and analysis. This talk will be a deep dive into the implementation and its limitations and possible future directions.

"Extending AF_XDP for fast co-located packet transfer" ( 2026 )

Saturday at 13:15, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Debojeet Das , slides , video

XDP and AF_XDP provide a high-performance mechanism for driver-layer packet processing and zero-copy delivery of packets into userspace, while maintaining access to standard kernel networking constructs — capabilities that distinguish them from full kernel-bypass frameworks such as DPDK. However, the current AF_XDP implementation offers no efficient in-kernel mechanism for forwarding packets between AF_XDP sockets in a zero-copy manner. Because AF_XDP operates without the conventional full network stack socket abstraction, even basic localhost redirection requires an external switch or additional hardware-assisted NIC capabilities, limiting both performance and usability.

In this talk, we introduce FLASH, an extension to the AF_XDP subsystem that enables low-overhead, in-kernel packet transfer between AF_XDP sockets. FLASH provides zero-copy delivery for sockets that share a memory area and a fast single-copy datapath for sockets backed by independent memories. The design incorporates several performance-oriented mechanisms, including smart blocking with backpressure for congestion handling and an adaptive interrupt-to-busypoll transition to reduce latency under load.

We demonstrate that co-located applications using AF_XDP can leverage FLASH to achieve up to 2.5× higher throughput compared to SR-IOV-based approaches, while preserving the programming model and flexibility of the XDP/AF_XDP ecosystem. The talk will also outline future directions and how FLASH can be one of the use cases of XDP_EGRESS PoC.

Resources

FLASH PoC Linux Kernel
FLASH userspace library
FLASH paper @ SoCC'25

"Lightweight XDP Profiling" ( 2026 )

Saturday at 13:45, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Andrea Monterubbiano Vladimiro Paschali , slides , video

The eBPF eXpress Data Path (XDP) allows high-speed packet processing applications. Achieving high throughput requires careful design and profiling of XDP applications. However, existing profiling tools lack eBPF support. We introduce InXpect, a lightweight monitoring framework that profiles eBPF programs with fine granularity and minimal overhead, making it suitable for XDP-based in-production systems. We demonstrate how InXpect outperforms existing tools in profiling overhead and capabilities. InXpect is the first XDP/eBPF profiling system that provides real-time statistics streaming, enabling immediate detection of changes in program behavior

"XDP Virtual Server: An eBPF Load Balancer library" ( 2026 )

Saturday at 14:15, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF David Coles , slides , video

XDP Virtual Server: An eBPF Load Balancer library

Faced with the looming retirement of our traditional load balancer appliances we decided to give XDP a try. Facebook's Katran library did not support layer 2 switching, which was still a requirement, so we built an eBPF application in C and a supporting library with Golang.

We came across a few issues along the way - driver support for network cards gave me headaches - but on the whole eBPF has made what would have been practically unthinkable a few years ago into a relatively straightforward task.

The library is used by an application which adds configuration management, BGP, metrics, etc., and after testing on smaller services for some time the balancer now handles streaming audio and website content for the UK's largest commercial radio broadcaster, delivering tens of gigabits per second to our audience. COTS servers handle high volumes of traffic and can be trivially scaled/migrated when updated hardware comes along as simply as running an Ansible job.

The library

The application

"A Unified I/O Monitoring Framework Using eBPF" ( 2026 )

Saturday at 14:45, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Mahendra Paipuri , slides , video

The interoperability of I/O monitoring and profiling tools is very limited due to their strong dependence on the underlying file system (LUSTRE, Spectrum Scale, NFS, etc) and resource managers (batch jobs, VMs, containerized workloads, etc). Widely adopted generic monitoring tools often lack the temporal information of the I/O activity which is often required to understand the I/O behavior of the applications. The increasing diversity of applications and computing platforms demands greater flexibility and scope in I/O characterization. This talk proposes a framework for monitoring I/O activity using extended Berkley Packet Filter (eBPF) technology which has gained much traction in observability and cloud-native landscape. By tracing the kernel’s Virtual File System (VFS) functions with eBPF, it is possible to monitor the I/O activity on different types of platforms like HPC, cloud hypervisors or Kubernetes. By storing the metrics traced by eBPF programs in a high performance time series database like Prometheus, it is possible to perform system-wide monitoring of computing platforms that use different types of local or remote file systems in a unified manner. The current talk presents the basics of eBPF and discusses the framework that is used to monitor I/O activity in a file system and application agnostic way. It also presents the experimental results of quantifying the overhead and accuracy of the proposed framework using IOR benchmark results as the reference. The results indicate that there is negligible overhead in using the framework and bandwidths reported by the proposed methodology are in a very good agreement with the ones from IOR tests. Finally, results from a production HPC platform that uses the proposed framework to monitor I/O activity on the LUSTRE file system are presented.

"String kfuncs - simplifying string handling in eBPF programs" ( 2026 )

Saturday at 15:15, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Viktor Malik , slides , video

When it comes to string handling, C is not the most ergonomic language for the job. But, at least, the standard library provides a basic set of functions for finding characters, comparing strings, or finding sub-strings. The same has not been true for eBPF programs where developers had to implement all the operations manually by looking into individual bytes.

This has changed in kernel version 6.17, which added a set of eBPF kernel functions (so-called kfuncs) to perform the most common string processing operations. While implementing these sounded like a very straightforward job at the beginning (just call the in-kernel implementations of the respective functions, right?), it turned out that eBPF programs have a number of specifics which required the implementation to be much more complicated.

In this talk, I will walk you through this journey and show how and why the kfuncs have to be implemented differently from the in-kernel implementations. We'll also dive into the API specifics and demonstrate how string kfuncs have been adopted by bpftrace [1] and what benefits they brought.

[1] https://bpftrace.org/

"eBPF with Nix: laptop to testbed" ( 2026 )

Saturday at 16:00, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Yifei Sun , video

Setting up eBPF development environment often require some effort on getting the correct headers, manage compiler versions, tweaking kconfig knobs, just to get a program running. In this session, we'll cover how to solve these problems using Nix [1] (NixOS not required). Unlike traditional workflows that rely on imperative package managers, Nix allows us to define kernel, userspace tooling, and testing infrastructure reproducibly.

We'll explore a workflow that bridges the gap between local prototyping and experiments/production environments using NixOS VM tests [2], which would allow developers easily to spin up multiple QEMU VMs with custom kernel (e.g. with patches or non-conventional config/build flags) and network connection.

We'll then demonstrate how to scale the exact environment from a laptop to testbeds like Grid'5000 [3]. With Nix and NixOS-Compose [4], we can deploy multi-node experiments with bit-perfect* reproducibility. In the demo, we'll use a trivial eBPF program (using bpf_override_return to mandate CONFIG_BPF_KPROBE_OVERRIDE + ALLOW_ERROR_INJECTION and mock syscalls), test it locally, and deploy to a cluster to collect live telemetry and visualizations.

[1] https://nixos.org/

[2] https://wiki.nixos.org/wiki/NixOS_VM_tests

[3] https://www.grid5000.fr/w/Grid5000:Home

[4] https://github.com/oar-team/nixos-compose

[*] https://reproducible.nixos.org/

"PythonBPF - writing eBPF programs in Python" ( 2026 )

Saturday at 16:30, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Pragyansh Chaturvedi Varun R Mallya , slides , video

This talk aims to present the first major release of PythonBPF and how other developers can start using it. The speakers will discuss the progress of this project since it was demoed at LPC 2025 in December 2025 (what actions were taken on the feedback gathered at LPC).

PythonBPF is a project that enables developers to write eBPF programs in pure Python. We allow a reduced Python grammar to be used for the eBPF-specific parts of code. This allows users to: - Write both eBPF logic and userspace code in Python (and can be in the same file), so the Python dev-tools apply to the whole file instead of just the non-BPF parts. - Process eBPF data and visualize it using Python's ecosystem, and interactively develop and debug eBPF programs using Python notebooks.

"Using eBPF within your Python program using EBPFCat" ( 2026 )

Saturday at 17:00, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Martin Teichmann , slides , video

eBPF is a powerful technology, but it is often hard to use, because its toolchain is non-trivial. In my talk I present EBPFCat, a pure Python library that can generate eBPF directly without any dependency on other code beyond the Linux kernel. Unlike most eBPF implementations, no compiler is involved. Instead, the user writes Python code which generates eBPF on-the-fly at runtime.

In EBPFCat, user- and kernel space are tightly integrated, so that both eBPF and Python code can access the same data structures. This way, one can use eBPF to write the performance critical parts of a program, while retaining the versatility of Python for the bulk of the code. This opens eBPF to a large audience who are interested in the performance boost by eBPF, but are hesitant to learn an entirely new tool set for it.

I will go through a simple example to show that using EBPFCat it is possible to fit an entire eBPF program including its user space counterpart on a single presentation slide. For this example I also show how EBPFCat generates the eBPF bytecode.

While EBPFCat can be employed for usual eBPF use cases, I present one well beyond typical systems-level applications: motion control. EtherCAT is a standard field bus protocol based on EtherNet. Using eBPF we can reduce the latency of communication with EtherCAT devices, allowing for real-time performance. I will show a real-world combined motion system for physics research that routinely uses EBPFCat.

Developers new to eBPF will get a jump start into everything needed for their first project, while experienced kernel developers will be surprised just how far eBPF can take you beyond system programming.

Links: EBPFCat on github EBPFCat on readthedocs

"Aya - what's new in Rust for eBPF?" ( 2026 )

Saturday at 17:30, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Michal Rostecki , slides , video

Aya is a library that allows writing eBPF programs, as well as their user-space counterparts, entirely in Rust. It has been presented in previous editions of FOSDEM, but the project has evolved since then. In this talk, we will highlight what has changed, what’s coming next, and how these developments shape the Rust-and-eBPF ecosystem.

Over the last year, Aya has gained support for several new eBPF program types, as well as additional map types, such as the family of storage maps (sk_storage, task_storage, inode_storage). We’ve worked with the LLVM community to enable BTF generation for Rust eBPF programs. The overall developer experience has continued to improve, and an increasing number of open-source projects are now building on top of Aya.

We will also share updates on our ongoing work, most notably our efforts to promote Rust’s eBPF targets to Tier 2, paving the way for building eBPF programs on Rust stable without requiring nightly toolchains. Alongside this, we are developing support for BTF relocation emission, refining the user-space XDP API, and broadening coverage of program and map types.

The talk will dive into the technical details behind these features, the architectural decisions that shaped them, and the challenges ahead. We will conclude with our vision for Aya’s future and how we see it moving forward.

"eBPF Observability on RISC: What Works, What Breaks, and How to Test It" ( 2026 )

Saturday at 18:00, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Yuning Liang Bruce Gain , slides , video

eBPF powers modern observability, but its behavior varies significantly across architectures. This talk examines whether eBPF can be used reliably on RISC-class systems—ARM64 and RISC-V—and what limitations appear in real workloads.

We use reproducible test environments to run tracing, profiling, and networking eBPF tools on x86_64, ARM64, and RISC-V, revealing practical differences in verifier constraints, helper availability, JIT maturity, and performance overhead. RISC-V support exists but remains incomplete, and we show exactly which features succeed, fail, or behave unpredictably.

Using a database benchmark as a workload generator, we compare instrumentation accuracy, latency impact, and stability across architectures. Attendees gain a clear understanding of eBPF’s practical portability and how to build a realistic multi-architecture observability testbed.

"BPF Tokens in Linux Distributions: A Path to Safe User-Space eBPF" ( 2026 )

Saturday at 18:30, 30 minutes, H.1308 (Rolin), H.1308 (Rolin), eBPF Daniel Mellado , video

BPF Tokens are a new Linux kernel mechanism for delegating restricted eBPF privileges to unprivileged processes. This talk explains how distributions can adopt them to provide safer access to tracing, observability, and networking tools—without granting root or CAP_SYS_ADMIN.

We’ll show how token-based delegation could reshape developer workflows, container runtimes, and system services in Fedora or other distros.

The session includes a walkthrough of real token policies and discusses how distributions can help build a secure, less-privileged eBPF ecosystem.

"Accelerating scientific code on AI hardware with Reactant.jl" ( 2026 )

Sunday at 09:00, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Mosè Giordano Jules Merckx , slides , video

Scientific models are today limited by compute resources, forcing approximations driven by feasibility rather than theory. They consequently miss important physical processes and decision-relevant regional details. Advances in AI-driven supercomputing — specialized tensor accelerators, AI compiler stacks, and novel distributed systems — offer unprecedented computational power. Yet, scientific applications such as ocean models, often written in Fortran, C++, or Julia and built for traditional HPC, remain largely incompatible with these technologies. This gap hampers performance portability and isolates scientific computing from rapid cloud-based innovation for AI workloads.

In this talk we present Reactant.jl, a free and open-source optimising compiler framework for the Julia programming language, based on MLIR and XLA. Reactant.jl preserves high-level semantics (e.g. linear algebra operations), enabling aggressive cross-function, high-level optimisations, and generating efficient code for a variety of backends (CPU, GPU, TPU and more). Furthermore, Reactant.jl combines with Enzyme to provide high-performance multi-backend automatic differentiation.

As a practical demonstration, we will show the integration of Reactant.jl with Oceananigans.jl, a state-of-the-art GPU-based ocean model. We show how the model can be seamlessly retargeted to thousands of distributed TPUs, unlocking orders-of-magnitude increases in throughput. This opens a path for scientific modelling software to take full advantage of next-generation AI and cloud hardware — without rewriting the codebase or sacrificing high-level expressiveness.

"ROCm™ on TheRock(s)" ( 2026 )

Sunday at 09:30, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Jan-Patrick Lehr , video

ROCm™ has been AMD’s software foundation for both high-performance computing (HPC) and AI workloads and continues to support the distinct needs of each domain. As these domains increasingly converge, ROCm™ is evolving into a more modular and flexible platform. Soon, the distribution model shifts to a core SDK with domain-specific add-ons—such as HPC—allowing users to select only the components they need. This reduces unnecessary overhead while maintaining a cohesive and interoperable stack.

To support this modularity, AMD transitions to TheRock, an open-source build system that enables component-level integration, nightly and weekly builds, and streamlined delivery across the ROCm™ stack. TheRock is designed to handle the complexity of building and packaging ROCm™ in a way that’s scalable and transparent for developers. It plays a central role in how ROCm™ is assembled and delivered, especially as the platform moves toward more frequent and flexible release cycles.

In this talk, we’ll cover the entire development and delivery pipeline—from the consolidation into three super-repos to how ROCm™ is built, tested, and shipped. This includes an overview of the development process, the delivery mechanism, TheRock’s implementation, and the testing infrastructure. We’ll also explain how contributors can engage with ROCm™—whether through code, documentation, or domain-specific enhancements—making it easier for developers to help shape the platform.

Online resources TheRock: https://github.com/ROCm/TheRock rocm-libraries: https://github.com/ROCm/rocm-libraries rocm-systems: https://github.com/ROCm/rocm-systems

Most projects are under MIT license.

Speaker JP Lehr, Senior Member of Technical Staff, ROCm™ GPU Compiler, AMD

© 2026 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ROCm, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. LLVM is a trademark of LLVM Foundation. The OpenMP name and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board.

"JUBE: An Environment for systematic benchmarking and scientific workflows" ( 2026 )

Sunday at 10:00, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Thomas Breuer , video

Wherever research software is developed and used, it needs to be installed, tested in various ways, benchmarked, and set up within complex workflows. Typically, in order to perform such tasks, either individual solutions are implemented - imposing significant restrictions due to the lack of portability - or the necessary steps are performed manually by developers or users, a time-consuming process, highly susceptible to errors. Furthermore, particularly in the field of high-performance computing (HPC), where large amounts of data are processed and the computer systems used are unique worldwide, not only performance, scalability, and efficiency of the applications are important, but so are modern research software engineering (RSE) principles such as reproducibility, reusability, and documentation.

With these challenges and requirements in mind, JUBE [1] (Jülich Benchmarking Environment) has been developed at the Jülich Supercomputing Centre (JSC), enabling automated and transparent scientific workflows. JUBE is a generic, lightweight, configurable environment to run, monitor and analyze application execution in a systematic way. It is a free, open-source software implemented in Python that operates on a "definition-based" paradigm where the “experiment” is described declaratively in a configuration file (XML or YAML). The JUBE engine is responsible for translating this definition into shell scripts, job submission files, and directory structures. Due to its standardized configuration format, it simplifies collaboration and usability of research software. JUBE also complements the Continuous Integration and Continuous Delivery (CI/CD) capabilities, leading to Continuous Benchmarking.

To introduce and facilitate JUBE’s usage, the documentation includes a tutorial with simple and advanced examples, an FAQ page, a description of the command line interface, and a glossary with all accepted keywords [2]. In addition, a dedicated Carpentries course offers an introduction to the JUBE framework [3] (basic knowledge of the Linux shell and either XML or YAML are beneficial when getting started with JUBE). A large variety of scientific codes and standard HPC benchmarks have already been automated using JUBE and are also available open-source [4].

In this presentation, an overview of JUBE will be provided, including its fundamental concepts, current status, and roadmap of future developments (external code contributions are welcome). Additionally, three illustrative use cases will be introduced to offer a comprehensive understanding of JUBE's practical applications: - benchmarking as part of the procurement of JUPITER, Europe’s first exascale supercomputer; - a complex scientific workflow for energy system modelling [5]; - continuous insight into HPC system health by regular execution of applications, and the subsequent graphical presentation of their results.

JUBE is a well-established software, which has already been used in several national and international projects and on numerous and diverse HPC systems [6-13]. Besides being available via EasyBuild [14] and Spack [15], further software has been built up based on JUBE [16,17]. Owing to its broad scope and range of applications, JUBE is likely to be of interest to audiences in the HPC sector, as well as those involved in big data and data science.

[1] https://github.com/FZJ-JSC/JUBE [2] https://apps.fz-juelich.de/jsc/jube/docu/index.html [3] https://carpentries-incubator.github.io/hpc-workflows-jube/ [4] https://github.com/FZJ-JSC/jubench [5] https://elib.dlr.de/196232/1/2023-09_UNSEEN-Compendium.pdf [6] MAX CoE: https://max-centre.eu/impact-outcomes/key-achievements/benchmarking-and-profiling/ [7] RICS2: https://risc2-project.eu/?p=2251 [8] EoCoE: https://www.eocoe.eu/technical-challenges/programming-models/ [9] DEEP: https://deep-projects.eu/modular-supercomputing/software/benchmarking-and-tools/ [10] DEEP-EST: https://cordis.europa.eu/project/id/754304/reporting [11] IO-SEA: https://cordis.europa.eu/project/id/955811/results [12] EPICURE: https://epicure-hpc.eu/wp-content/uploads/2025/07/EPICURE-BEST-PRACTICE-GUIDE-Power-measurements-in-EuroHPC-machines_v1.0.pdf [13] UNSEEN: https://juser.fz-juelich.de/record/1007796/files/UNSEEN_ISC_2023_Poster.pdf [14] EasyBuild: https://github.com/easybuilders/easybuild-easyconfigs/tree/develop/easybuild/easyconfigs/j/JUBE [15] Spack: https://packages.spack.io/package.html?name=jube [16] https://github.com/edf-hpc/unclebench [17] https://dl.acm.org/doi/10.1145/3733723.3733740

"Scaling Gmsh-based FEM on LUMI: Efficiently Handling Thousands of Partitions" ( 2026 )

Sunday at 10:30, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Boris Martin , video

Content

High-frequency wave simulations in 3D (with e.g. Finite Elements) involve systems with hundreds of millions unknowns (up to 600M in our runs), prompting the use of massively parallel algorithms. In the harmonic regime, we favor Domain Decomposition Methods (DDMs) where local problems are solved in smaller regions (subdomains) and the full solution of the PDE is recovered iteratively. This requires each rank to own a portion of the mesh and to have a view on neighboring partitions (ghost cells or overlaps). In particular, the Optimized Restricted Additive Schwarz algorithm requires assembling matrices at the boundary of overlaps, which requires creating additional elements after the partitioning.

During the last two years, I pushed our in-house FEM code (GmshFEM) to run increasingly large jobs, from 8 MPI ranks on a laptop, through local and national clusters, up to more than 30,000 ranks on LUMI. Each milestone provided its own challenges in the parallel implementation: as the problem size increases, simple global reductions can go from being a minor synchronization to being a major bottleneck, redundant information in partitioned meshes can eat hundreds of gigabytes of RAM, and load-balancing issues can become dominant.

In this talk, I will describe how we tackled these challenges and how the future versions of Gmsh will take into account these issues. In particular, the next version of the MSH file format will be optimized to reduce data duplication across subdomains. I will also present the new API for querying information about partitioned meshes, such as retrieving elements in overlapping regions.

About Gmsh

Gmsh (https://gmsh.info/) is an open-source (GPL-2) finite element mesh generator widely used in scientific and engineering applications. It provides a graphical interface, a scripting language for automation, and language bindings (C/C++, Fortran, Python, Julia). In this work, Gmsh serves as the front-end mesh generator for large-scale distributed FEM simulations using our in-house solver GmshFEM (https://gitlab.onelab.info/gmsh/fem).

"Productive Parallel Programming with Chapel and Arkouda" ( 2026 )

Sunday at 11:00, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Jade Abraham , slides , video

As the computing needs of the world have grown, the need for parallel systems has grown to match. However, the programming languages used to target those systems have not had the same growth. General parallel programming targeting distributed CPUs and GPUs is frequently locked behind low-level and unfriendly programming languages and frameworks. Programmers must choose between parallel performance with low-level programming or productivity with high-level languages.

Chapel is a programming language for productive parallel programming that scales from laptops to supercomputers. This talk will focus on the ways that Chapel addresses the above gap, giving programmers used to high level languages like Python access to distributed parallel performance. Chapel has long been open-source, but recently moved to become one of the many amazing projects hosted under the High Performance Software Foundation.

The talk will include a description of Chapel and its performance as well as a few examples of Chapel programs. I will also present Arkouda, an exploratory data science tool for massive scales of data. Arkouda is built in Chapel and completely closes the accessibly gap for Python programmers to access supercomputer-scale data analysis.

"Track Energy & Emissions of User Jobs on HPC/AI Platforms using CEEMS" ( 2026 )

Sunday at 11:30, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Mahendra Paipuri , slides , video

With the rapid acceleration of ML/AI research in the last couple of years, the already energy-hungry HPC platforms have become even more demanding. A major part of this energy consumption is due to users’ workloads and it is only by the participation of end users that it is possible to reduce the overall energy consumption of the platforms. However, most of the HPC platforms do not provide any sort of metrics related to energy consumption, nor the performance metrics out of the box, which in turn do not encourage end users to optimize their workloads.

The Compute Energy & Emissions Monitoring Stack (CEEMS) has been designed to address this issue. CEEMS can report energy consumption and equivalent emissions of user workloads in real time for SLURM (HPC), Openstack (Cloud) and Kubernetes platforms alike. It leverages the Linux perf subsystem and eBPF to monitor the performance metrics of the applications, which can help the end users to identify the bottlenecks in their workflows rapidly and consequently optimize them to reduce the energy and carbon footprint. CEEMS supports eBPF-based continuous profiling and it is the first monitoring stack to support continuous profiling on HPC platforms. Another advantage of CEEMS is that it can systematically monitor all the jobs on the platform without the end users having to modify their workflows or codes.

Besides CPU energy usage, it supports reporting energy usage and performance metrics of workloads on NVIDIA and AMD GPU accelerators. CEEMS has been built around the prominent open-source tools in the observability ecosystem, like Prometheus and Grafana. CEEMS has been designed to be extensible and it allows the HPC center operators to easily define the energy estimation rules of user workloads based on the underlying hardware. CEEMS monitors I/O and network metrics in a file system agnostic manner, allowing it to work on any parallel file system used by HPC platforms. Finally, the talk will conclude by showing how CEEMS monitoring is used on the Jean-Zay HPC platform with more than 2000 nodes that have a daily job churn rate of around 20k jobs.

"Partly Cloudy with a Chance of Zarr: A Virtualized Approach to Zarr Stores from ECMWF's Fields Database" ( 2026 )

Sunday at 12:00, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Tobias Kremer , slides , video

ECMWF manages petabytes of meteorological data critical for weather and climate research. But traditional storage formats pose challenges for machine learning, big-data analytics, and on-demand workflows.

We propose a solution which introduces a Zarr store implementation for creating virtual views of ECMWF’s Fields Database (FDB), enabling users to access GRIB data as if it were a native Zarr dataset. Unlike existing approaches such as VirtualiZarr or Kerchunk, our solution leverages the domain-specific MARS language to define virtual Zarr v3 stores directly from scientific requests, bridging GRIB and Zarr for dynamic, cloud-native access.

This work is developed as part of the WarmWorld Easier project, aiming to make climate and weather data more interoperable and accessible for the scientific community. By combining the efficiency of FDB with the flexibility of Zarr, we unlock new possibilities for HPC, big-data analytics, and machine learning pipelines.

In this talk, we will explore the architecture, discuss performance considerations, and demonstrate how virtual Zarr views accelerate integration in open-source workflows.

This session will: - Explain the motivation behind creating virtual Zarr views of ECMWF’s Fields Database. - Detail the design and implementation of a custom Zarr Store that translates Zarr access patterns into MARS requests. - Discuss performance trade-offs and scalability in HPC contexts. - Showcase real-world examples of how this approach may support data science workflows, machine learning, and distributed computing.

"Zero‑Touch HPC Nodes: NetBox, Tofu and Packer for a Self‑Configuring SLURM Cluster" ( 2026 )

Sunday at 12:30, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Erich Birngruber Ümit Seren Leon Schwarzäugl , slides , video

Over the last five years, we ran an HPC system for life sciences on top of OpenStack, with a deployment pipeline built from Ansible, manual steps (see FOSDEM 2020 talk). It worked—but it wasn’t something we could easily rebuild from scratch or apply consistently to other parts of our infrastructure.

As we designed our new HPC system (coming online in early 2026), we set ourselves a goal: treat the cluster as something we can declare and then recreate, not pet and nurture. The result is a “zero‑touch” style pipeline where a new node can go from “just racked” to “in SLURM and running jobs” with no manual intervention.

In this talk, we walk through the end‑to‑end workflow:

NetBox as DCIM and source of truth: racking a server and adding it to NetBox is the trigger; MACs, serials and IPs are automatically imported from vendor tools and IPAM/DNS into our automation.
Using Tofu/Terragrunt (instead of Openstack's Heat orchestration service) to provision OpenStack/Ironic, SLURM infrastructure and network fabric across three environments (dev plus two interchangeable prod clusters for blue/green rollouts).
Image‑based deployment with Packer and Ansible: we split roles into “install” and “configure”. Packages and heavy setup are baked into images, while an ansible-init service runs locally on first boot to apply configuration and join the cluster.
Making nodes self‑sufficient, including fetching the secrets they need via short‑lived credentials and a minimal external dependency chain.

Come and see how we built a reproducible HPC/Big-Data cluster on open‑source tooling, reusing as much of the stack as possible for the rest of our infrastructure.

"Accelerating complex Bioinformatics AI pipelines with Kubernetes" ( 2026 )

Sunday at 13:00, 10 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Alessandro Pilotti , slides , video

Bioinformatics is an interdisciplinary scientific field that deals with large amounts of biological data. The advent of transformer models applied to this field brought very interesting scientific innovations, including the introduction of Protein Language Models (PLMs) and Antibody Language Models (AbLMs). The complexity of training or fine tuning PLMs/AbLMs along with inference tasks requires a non-trivial amount of GPU resources and a disciplined approach, where DevOps and MLOps methodologies fit very well. In this session we will present a series of tasks related to fine tuning PLMs/AbLMs for classification of SARS-CoV-2's spike proteins. We will highlight how Kubernetes can be used to execute large numbers of computationally intensive tasks on GPU hosts, including best practices for sharing Nvidia GPUs (MIG, Time Slicing, MPS) as part of an open source stack orchestrated with Apache Airflow. While these methodologies can be applied to any Kubernetes cluster, including on hyperscalers, this talk is meant to facilitate the (re)use of on-prem hardware infrastructure, presenting a fully open-source stack that can be easily deployed and maintained on bare metal.

Source code: https://github.com/alexpilotti/bbk-mres https://github.com/alexpilotti/bbk-mres-airflow

Overview of the scientific research made possible by this pipeline: https://cloudba.se/NeBzX

"Observability for AI Workloads on HPC: Beyond GPU Utilization Metrics" ( 2026 )

Sunday at 13:10, 10 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science samuel desseaux , video

When you run LLMs or large-scale ML training on HPC clusters, traditional monitoring falls short. GPU utilization at 95% tells you nothing about model quality. Memory bandwidth looks healthy while your inference latency silently degrades. Your job scheduler reports success while concept drift erodes prediction accuracy. This talk introduces a practical observability framework specifically designed for AI workloads on HPC infrastructure, what I call "Cognitive SLIs" (Service Level Indicators for AI systems). I'll cover three critical gaps in current HPC monitoring: 1. Model-aware metrics that matter 2. GPU observability beyond utilization 3. Energy and cost accountability

The demo shows a complete stack built with open source tools: Victoria metrics with custom AI-specific exporters, Grafana dashboards designed for ML engineers (not just sysadmins), and OpenTelemetry instrumentation patterns for PyTorch/JAX workloads.

Attendees will leave with the following resources :

1)Architecture patterns for instrumenting HPC AI workloads 2) Victoria Metrics recording rules and alerting strategies for ML metrics 3)Grafana dashboard templates (GitHub repo provided) 4) Understanding of how AI Act logging requirements intersect with HPC operations

"Developing software tools for accelerated and differentiable scientific computing using JAX" ( 2026 )

Sunday at 13:20, 10 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Matt Graham , video

JAX is an open-source Python package for high-performance numerical computing. It provides a familiar NumPy style interface but with the advantages of allowing computations to be dispatched to accelerator devices such as graphics and tensor processing units, and supporting transformations to automatically differentiate, vectorize and just-in-time compile functions. While extensively used in machine learning applications, JAX's design also makes it ideal for scientific computing tasks such as simulating numerical models and fitting them to data.

This lightning talk will introduce JAX's interface and computation model, and some of its key function transformations. I will also briefly introduce the Python Array API standard and explain how it can be used to write portable code which works across JAX, NumPy and other array backends.

"High Performance Jupyter Notebooks with Zasper" ( 2026 )

Sunday at 13:35, 10 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Prasun Anand , video

Data science tools have come far, with Project Jupyter at the core. But what if we could greatly boost their performance, without leaving the Python ecosystem?

Introducing Zasper, an IDE for Jupyter notebooks build in Go with up to 5× less CPU and 40× less RAM thats also blazingly fast.

"Update on the High Performance Software Foundation (HPSF)" ( 2026 )

Sunday at 14:00, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Xavier Delaruelle , slides , video

The High Performance Software Foundation (HPSF) is a hub for open-source, high performance software with a growing set of member organizations and projects across the US, Europe, and Asia. It aims to advance portable software for diverse hardware by increasing adoption, aiding community growth, and enabling development efforts. It also fosters collaboration through working groups such as Continuous Integration, Benchmarking, and Binary distribution.

This talk will give an overview of HPSF and an update on its latest activities. We’ll talk about new member projects, new member organizations. We’ll give an update on plans for the European HPSF Community Summit 2026 and HPSFCon 2026. We’ll talk about how HPSF is supporting member projects and building collaborations that advance the HPSF community, and we’ll talk about project support and outreach activities.

Find out how you can benefit from joining or collaborating with HPSF, and help to improve the HPC open source world.

"Package management in the hands of users: dream and reality" ( 2026 )

Sunday at 14:30, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Ludovic Courtès , slides , video

Are HPC users autonomous? How much flexibility does one have when deploying software on a supercomputer? How close to one’s laptop development environment is it? How have EasyBuild, Spack, Guix, and Apptainer helped improve the situation in the past decade?

In this talk, I will look at the situation with lucidity. While Spack and EasyBuild enable software deployment by users, their primary user base appears to be HPC system administrators. Thus most HPC admins let users bring their own Singularity/Apptainer images when their needs are not satisfied—effectively “giving up” on complex deployment.

Brave and fearless, the Guix-HPC effort has not given up on the goal of putting reproducible package management in the hands of users, with successes and disappointments. I will report on our experience with Tier-2 supercomputers now providing Guix, and on ongoing work with French national supercomputers (“Tier-1”) as part of NumPEx, the French national program for HPC.

We will look back at the set of challenges overcome in past years—from supporting rootless execution of the build daemon, to making the bring-your-own-MPI approach viable and to enhancing support for CPU micro-architecture optimizations—and those yet to come.

"Spack v1.0 and Beyond: Managing HPC Software Stacks" ( 2026 )

Sunday at 15:00, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Harmen Stoppels , slides , video

Abstract

Spack is a flexible multi-language package manager for HPC, Data Science, and AI, designed to support multiple versions, configurations, and compilers of software on the same system. Since the last FOSDEM, the Spack community has reached a major milestone with the release of Spack v1.0, followed closely by v1.1. This talk will provide a comprehensive overview of the "What's New" in these releases, highlighting the changes that improve robustness, performance, and user experience. We will cover among other things the shift to modeling compilers as dependencies, the package repository split, and the new jobserver-aware parallel installer.

Description

With the release of Spack v1.0 in July 2025 and v1.1 in November 2025, the project has introduced significant architectural changes and new features requested by the community. In this talk, we will dive into the key features introduced across these releases:

Compilers as dependencies. Spack has fulfilled an old promise from FOSDEM 2018. Compilers are modeled as first-class dependencies, dependency resolution is more accurate, and binary distribution and ABI compatibility checks are more robust.
The separation of the package repository from the core tool and the introduction of a versioned Package API allows users to pin the package repository version independently from Spack itself and enables regular package repository releases.
Parallel builds with a new user interface. Spack has a new scheduler that coordinates parallel builds using the POSIX jobserver protocol, allowing efficient resource sharing across all build processes. The decades-old jobserver protocol is experiencing a major renaissance, adopted recently by Ninja v1.13 (July 2025) and the upcoming LLVM 22 release. We’ll talk about how this enables composable parallelism across make, ninja, cargo, GCC, LLVM, Spack, and other tools.

Expected Prior Knowledge / Intended Audience

This talk is aimed at Research Software Engineers (RSEs), HPC system administrators, and Data Scientists who use or manage software stacks. Familiarity with Spack is helpful but not strictly required; the talk will be accessible to anyone interested in package management and software reproducibility in scientific computing.

"Status update on EESSI, the European Environment for Scientific Software Installations" ( 2026 )

Sunday at 15:30, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Helena Vela Beltran , video

A few years ago, the European Environment for Scientific Software Installations (EESSI) was introduced at FOSDEM as a pilot project for improving software distribution and deployment everywhere, from HPC environments, to cloud environments or even a personal workstation or a Raspberry Pi . Since then, it has gained wide adoption across dozens of HPC systems in Europe, being installed natively in EuroHPC systems and becoming a component within the EuroHPC Federation Platform.

This session will highlight the progress EESSI has made, including the addition of new CPU and GPU targets, with broader support for modern computing technologies and much more software, featuring 600+ unique software projects (or over 3500 if you count individual Python packages and R libraries that are included) shipped with it. EESSI's capabilities have expanded significantly, turning it into a key service for managing and deploying software across a wide range of infrastructures.

We will provide an overview of the current status of EESSI, focusing on its new capabilities, the integration with tools like Spack and Open OnDemand, as well as its growing software ecosystem. Through a live hands-on demo, we will showcase how EESSI is being used in real-world HPC environments and cloud systems, and discuss the future direction of the platform. Looking ahead, we will cover upcoming features and improvements that will continue to make EESSI a solid enabler for HPC software management in Europe and beyond.

"Using OpenMP's interop for calling GPU-vendor libs with GCC" ( 2026 )

Sunday at 16:00, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science Tobias Burnus , slides , video

GPU vendors provide highly optimized libraries for math operations such as fast Fourier transformation or linear algebra (FFT, (sparse)BLAS/LAPACK, …) to perform those on devices. And OpenMP is a popular, vendor-agnostic method for parallelization on the CPU but increasingly also for offloading calculations to the GPU.

This talk shows how OpenMP can be used to reduce to reduce vendor-specific code, make calling it more convenient, and to combine OpenMP offloading with those libraries. While the presentation illustrates the use with the GNU Compiler Collection (GCC), the feature is a generic feature of OpenMP 5.2, extended in 6.0, and is supported by multiple compilers.

In terms of OpenMP features, the 'interop' directive provides the interoperability support, the 'declare variant' directive with the 'adjust_args' and 'append_args' clauses enable to write neater code; means for memory allocation and memory transfer and running code blocks on the GPU ('target' construct) complete the required feature set.

The OpenMP specification, current, past and future version, errata and example documents can be found at https://www.openmp.org/specifications/; a list of compilers and tools for OpenMP is at https://www.openmp.org/resources/openmp-compilers-tools/
GCC's OpenMP documentation is available at https://gcc.gnu.org/onlinedocs/libgomp/ (API routines, implementation status, …) and, in particular, the supported interop foreign runtimes are documented at https://gcc.gnu.org/onlinedocs/libgomp/Offload-Target-Specifics.html; GCC supports offloading to Nvidia and AMD GPUs. GCC supports OpenMP interop since GCC 15, including most of the OpenMP 6.0 additions, including the Fortran API routines.

"A Brief* overview of what makes modern accelerators interesting for HPC" ( 2026 )

Sunday at 16:30, 25 minutes, H.1308 (Rolin), H.1308 (Rolin), HPC, Big Data & Data Science FelixCLC , video

Evaluating and discussing what makes different types of accelerators interesting for which types of workloads, and the mental model most appropriate for choosing them.

Why it's sometimes a good idea to ignore them all and *just use a CPU, all the way to when FPGAs become interesting as a means of doing more science

Events in room H.1308 (Rolin)

Sat