Storage
New Mlperf Storage V1.0 Benchmark Results Show Storage Systems Play A Critical Role In AI Model Training Performance
By Business Wire
Storage system providers showcase innovative solutions to keep pace with faster accelerators.
SAN FRANCISCO--(BUSINESS WIRE)--#ai--Today, MLCommons® announced results for its industry-standard MLPerf® Storage v1.0 benchmark suite, which is designed to measure the performance of storage systems for machine learning (ML) workloads in an architecture-neutral, representative, and reproducible manner. The results show that as accelerator technology has advanced and datasets continue to increase in size, ML system providers must ensure that their storage solutions keep up with the compute needs. This is a time of rapid change in ML systems, where progress in one technology area drives new demands in other areas. High-performance AI training now requires storage systems that are both large-scale and high-speed, lest access to stored data becomes the bottleneck in the entire system. With the v1.0 release of MLPerf Storage benchmark results, it is clear that storage system providers are innovating to meet that challenge.
Version 1.0 storage benchmark breaks new ground
The MLPerf Storage benchmark is the first and only open, transparent benchmark to measure storage performance in a diverse set of ML training scenarios. It emulates the storage demands across several scenarios and system configurations covering a range of accelerators, models, and workloads. By simulating the accelerators' "think time" the benchmark can generate accurate storage patterns without the need to run the actual training, making it more accessible to all. The benchmark focuses the test on a given storage system’s ability to keep pace, as it requires the simulated accelerators to maintain a required level of utilization.
Three models are included in the benchmark to ensure diverse patterns of AI training are tested: 3D-UNet, Resnet50, and CosmoFlow. These workloads offer a variety of sample sizes, ranging from hundreds of megabytes to hundreds of kilobytes, as well as wide-ranging simulated “think times” from a few milliseconds to a few hundred milliseconds.
The benchmark emulates NVIDIA A100 and H100 models as representatives of the currently available accelerator technologies. The H100 accelerator reduces the per-batch computation time for the 3D-UNet workload by 76% compared to the earlier V100 accelerator in the v0.5 round, turning what was typically a bandwidth-sensitive workload into much more of a latency-sensitive workload.
In addition, MLPerf Storage v1.0 includes support for distributed training. Distributed training is an important scenario for the benchmark because it represents a common real-world practice for faster training of models with large datasets, and it presents specific challenges for a storage system not only in delivering higher throughput but also in serving multiple training nodes simultaneously.
V1.0 benchmark results show performance improvement in storage technology for ML systems
The broad scope of workloads submitted to the benchmark reflect the wide range and diversity of different storage systems and architectures. This is testament to how important ML workloads are to all types of storage solutions, and demonstrates the active innovation happening in this space.
“The MLPerf Storage v1.0 results demonstrate a renewal in storage technology design,” said Oana Balmau, MLPerf Storage working group co-chair. “At the moment, there doesn’t appear to be a consensus ‘best of breed’ technical architecture for storage in ML systems: the submissions we received for the v1.0 benchmark took a wide range of unique and creative approaches to providing high-speed, high-scale storage.”
The results in the distributed training scenario show the delicate balance needed between the number of hosts, the number of simulated accelerators per host, and the storage system in order to serve all accelerators at the required utilization. Adding more nodes and accelerators to serve ever-larger training datasets increases the throughput demands. Distributed training adds another twist, because historically different technologies – with different throughputs and latencies – have been used for moving data within a node and between nodes. The maximum number of accelerators a single node can support may not be limited by the node’s own hardware but instead by the ability to move enough data quickly to that node in a distributed environment (up to 2.7 GiB/s per emulated accelerator). Storage system architects now have few design tradeoffs available to them: the systems must be high-throughput and low-latency, to keep a large-scale AI training system running at peak load.
“As we anticipated, the new, faster accelerator hardware significantly raised the bar for storage, making it clear that storage access performance has become a gating factor for overall training speed,” said Curtis Anderson, MLPerf Storage working group co-chair. “To prevent expensive accelerators from sitting idle, system architects are moving to the fastest storage they can procure – and storage providers are innovating in response.”
MLPerf Storage v1.0
The MLPerf Storage benchmark was created through a collaborative engineering process across more than a dozen leading storage solution providers and academic research groups. The open-source and peer-reviewed benchmark suite offers a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI training systems.
The v1.0 benchmark results, from a broad set of technology providers, demonstrate the industry’s recognition of the importance of high-performance storage solutions. MLPerf Storage v1.0 includes over 100 performance results from 13 submitting organizations: DDN, Hammerspace, Hewlett Packard Enterprise, Huawei, IEIT SYSTEMS, Juicedata, Lightbits Labs, MangoBoost, Nutanix, Simplyblock, Volumez, WEKA, and YanRong Tech.
“We’re excited to see so many storage providers, both large and small, participate in the first-of-its-kind v1.0 Storage benchmark,” said David Kanter, Head of MLPerf at MLCommons. “It shows both that the industry is recognizing the need to keep innovating in storage technologies to keep pace with the rest of the AI technology stack, and also that the ability to measure the performance of those technologies is critical to the successful deployment of ML training systems. As a trusted provider of open, fair, and transparent benchmarks, MLCommons ensures that technology providers know the performance target they need to meet, and consumers can procure and tune ML systems to maximize their utilization – and ultimately their return on investment.”
We invite stakeholders to join the MLPerf Storage working group and help us continue to evolve the benchmark. Future work includes improving and increasing accelerator emulations and AI training scenarios.
View the Results
To view the results for MLPerf Storage v1.0, please visit the Storage benchmark results.
About MLCommons
MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI Safety.
For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.
Contacts
Kelly Berschauer, kelly@mlcommons.org
First published on Thu, Sep 26, 2024
Enjoyed what you've read so far? Great news - there's more to explore!
Stay up to date with the latest news, a vast collection of tech articles including introductory guides, product reviews, trends and more, thought-provoking interviews, hottest AI blogs and entertaining tech memes.
Plus, get access to branded insights such as informative white papers, intriguing case studies, in-depth reports, enlightening videos and exciting events and webinars from industry-leading global brands.
Dive into TechDogs' treasure trove today and Know Your World of technology!
Disclaimer - Reference to any specific product, software or entity does not constitute an endorsement or recommendation by TechDogs nor should any data or content published be relied upon. The views expressed by TechDogs' members and guests are their own and their appearance on our site does not imply an endorsement of them or any entity they represent. Views and opinions expressed by TechDogs' Authors are those of the Authors and do not necessarily reflect the view of TechDogs or any of its officials. All information / content found on TechDogs' site may not necessarily be reviewed by individuals with the expertise to validate its completeness, accuracy and reliability.
Trending Business Wire
Freecast Launches Youbundle To Simplify Streaming Subscriptions
By Business Wire
Genius Sports Launches Fanhub, Worlds First Advertising & Activation Platform To Reach & Engage Sports Fans
By Business Wire
HPE Positioned As A Leader For Seven Years Running In 2024 Gartner Magic Quadrant For SD-WAN Report
By Business Wire
KUBRA & Silverblaze Partner To Upgrade Billing, Payment, And Customer Engagement Solutions
By Business Wire
MG Digital Direct Expands Multi-Channel Marketing Services With Radisys Engage Digital Platform
By Business Wire
Join Our Newsletter
Get weekly news, engaging articles, and career tips-all free!
By subscribing to our newsletter, you're cool with our terms and conditions and agree to our Privacy Policy.
Join The Discussion