
IT Support
Nebius Launches Soperator, World's First Fully Featured Open-Source Kubernetes Operator For Slurm, To Help AI And HPC Pros Optimize Workload Management And Orchestration
By PR Newswire

AMSTERDAM, Sept. 25, 2024 /PRNewswire/ -- Nebius, a leading AI infrastructure company, is excited to announce the open-source release of Soperator, the world's first fully featured Kubernetes operator for Slurm, designed to optimize workload management and orchestration in modern machine-learning (ML) and high-performance computing (HPC) environments.
Soperator has been developed by Nebius to merge the power of Slurm, a job orchestrator designed to manage large-scale HPC clusters, with Kubernetes' flexible and scalable container orchestration. It delivers simplicity and efficient job scheduling when working in compute-intensive environments, particularly for GPU-heavy workloads, making it ideal for ML training and distributed computing tasks.
Narek Tatevosyan, Director of Product Management for the Nebius Cloud Platform, said:
"Nebius is rebuilding cloud for the AI age by responding to the challenges that we know AI and ML professionals are facing. Currently there is no workload orchestration product on the market that is specialized for GPU-heavy workloads. By releasing Soperator as an open-source solution, we aim to put a powerful new tool into the hands of the ML and HPC communities.
"We are strong believers in community driven innovation and our team has a strong track record of open-sourcing innovative products. We're excited to see how this technology will continue to evolve and enable AI professionals to focus on enhancing their models and building new products."
Danila Shtan, Chief Technology Officer at Nebius, added:
"By open-sourcing Soperator, we're not just releasing a tool – we're standing by our commitment to open-source innovation in an industry where many keep their solutions proprietary. We're pushing for a cloud-native approach to traditionally conservative HPC workloads, modernizing workload orchestration for GPU-intensive tasks. This strategic initiative reflects our dedication to fostering community collaboration and advancing AI and HPC technologies globally."
Key features of Soperator include:
- Enhanced scheduling and orchestration: Soperator provides precise workload distribution across large compute clusters, optimizing GPU resource usage and enabling parallel job execution. This minimizes idle GPU capacity, optimizes costs, and facilitates more efficient collaboration, making it a crucial tool for teams working on large-scale ML projects.
- Fault-tolerant training: Soperator includes a hardware health check mechanism that monitors GPU status, automatically reallocating resources in case of hardware issues. This improves training stability even in highly distributed environments and reduces GPU hours required to complete the task.
- Simplified cluster management: By having a shared root file system across all cluster nodes, Soperator eliminates the challenge of maintaining identical states across multi-node installations. Together with Terraform operator, this simplifies the user experience, allowing ML teams to focus on their core tasks without the need for extensive DevOps expertise.
Future planned enhancements include improvements to security and stability, scalability and node management, as well as upgrades according to emerging software and hardware updates.
The first public release of Soperator is available from today as an open-source solution to all ML and HPC professionals on the Nebius GitHub, along with relevant deployment tools and packages. Nebius also invites anyone who would like to try out the solution for their ML training or HPC calculations running on multi-node GPU installations; the company's solution architects are ready to provide assistance and guidance through the installation and deployment process in the Nebius environment.
For more information about Soperator please read the blog post published today on Nebius's website: https://nebius.ai/blog/posts/soperator-in-open-source-explained
About Nebius
Nebius is a technology company building full-stack infrastructure to service the explosive growth of the global AI industry, including large-scale GPU clusters, cloud platforms, and tools and services for developers. Headquartered in Amsterdam and listed on Nasdaq, the company has a global footprint with R&D hubs across Europe, North America and Israel.
Nebius's core business is an AI-centric cloud platform built for intensive AI workloads. With proprietary cloud software architecture and hardware designed in-house (including servers, racks and data center design), Nebius gives AI builders the compute, storage, managed services and tools they need to build, tune and run their models.
An NVIDIA preferred cloud service provider, Nebius offers high-end infrastructure optimized for AI training and inference. The company boasts a team of over 500 skilled engineers, delivering a true hyperscale cloud experience tailored for AI builders.
To learn more please visit www.nebius.com
Contact
SOURCE Nebius
First published on Wed, Sep 25, 2024
Liked what you read? That’s only the tip of the tech iceberg!
Explore our vast collection of tech articles including introductory guides, product reviews, trends and more, stay up to date with the latest news, relish thought-provoking interviews and the hottest AI blogs, and tickle your funny bone with hilarious tech memes!
Plus, get access to branded insights from industry-leading global brands through informative white papers, engaging case studies, in-depth reports, enlightening videos and exciting events and webinars.
Dive into TechDogs' treasure trove today and Know Your World of technology like never before!
Disclaimer - Reference to any specific product, software or entity does not constitute an endorsement or recommendation by TechDogs nor should any data or content published be relied upon. The views expressed by TechDogs' members and guests are their own and their appearance on our site does not imply an endorsement of them or any entity they represent. Views and opinions expressed by TechDogs' Authors are those of the Authors and do not necessarily reflect the view of TechDogs or any of its officials. While we aim to provide valuable and helpful information, some content on TechDogs' site may not have been thoroughly reviewed for every detail or aspect. We encourage users to verify any information independently where necessary.
Trending PR Newswire
Ecotrak Launches Self-Service CMMS, Empowering Small Businesses To Take Control Of Facilities Management
By PR Newswire
Fireblocks Integrates Layerzero For Unparalleled Security And Connectivity Of Stablecoins
By PR Newswire
GMI Cloud To Build The Next Era Of AI With NVIDIA
By PR Newswire
IQM Quantum Computers To Supply Finland With A World-Leading Superconducting 300-Qubit Quantum Computer
By PR Newswire
Kucoin Pay Integrates With AEON To Revolutionize Web3 Mobile Payments In Retail
By PR Newswire
Join Our Newsletter
Get weekly news, engaging articles, and career tips-all free!
By subscribing to our newsletter, you're cool with our terms and conditions and agree to our Privacy Policy.
Join The Discussion