View on GitHub

NERSC Data Seminars

Flux: Overcoming Scheduling Challenges for Exascale Workflows

Dong Ahn & Stephen Herbein (Lawrence Livermore National Laboratory)

Abstract

Many emerging scientific workflows that target high-end HPC systems require complex interplay with the resource and job management software (RJMS). However, portable, efficient and easy-to-use scheduling and execution of these workflows is still an unsolved problem. In this talk, I will present Flux, a next-generation RJMS designed specifically to address the key scheduling challenges of modern workflows in a scalable, easy-to-use, and portable manner. At the heart of Flux lies its ability to be seamlessly nested within batch allocations created by itself as well as other system schedulers (e.g., SLURM, MOAB, LSF, etc), serving the target workflows as their “personal RJMS instances”. In particular, Flux’s consistent and rich set of well-defined APIs portably and efficiently support those workflows that can often feature non-traditional execution patterns such as requirements for complex co-scheduling, massive ensembles of small jobs and coordination among jobs in an ensemble. As part of this talk, I will also discuss Flux’s graph-based resource data model, Flux’s response to needing to schedule increasingly diverse resources, and how this model is becoming the center of our industry co-design efforts: for example, multi-tiered storage scheduling co-design with HPE and Cloud resource co-design with IBM T.J. Watson and RedHat OpenShift.

Bio

Dong H. Ahn is a computer scientist. He has worked for Livermore Computing (LC) at Lawrence Livermore National Laboratory since 2001 and currently leads the next-generation computing enabling (NGCE) project within the ASC ATDM sub-program. During this period, Dong has worked on several code development-tools and resource management and scheduling software framework projects with a common goal to provide highly capable and scalable tools ecosystems for large computing systems. Towards this goal, he has architected an extreme-scale debugging strategy that conceived the Stack Trace Analysis Tool (STAT), a 2011 R&D 100 Award winner, and the PRUNERS Toolset, a 2017 R&D 100 Award Finalist.

Stephen Herbein is a computer scientist in Livermore Computing at Lawrence Livermore National Laboratory. His research interests include batch job scheduling, parallel IO, and data analytics. He is a part of the Flux team, developing next-generation IO-aware and multi-level schedulers for HPC.