mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

Open Access
Authors
Publication date 2022
Host editors
  • J. Cano
  • P. Trinder
Book title Euro-Par 2022: Parallel Processing
Book subtitle 28th International Conference on Parallel and Distributed Computing, Glasgow, UK, August 22–26, 2022 : proceedings
ISBN
  • 9783031125966
ISBN (electronic)
  • 9783031125973
Series Lecture Notes in Computer Science
Event 28th International European Conference on Parallel and Distributed Computing, Euro-Par 2022
Pages (from-to) 155-170
Number of pages 16
Publisher Cham: Springer
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract

Memory usage is becoming an increasingly pressing bottleneck in the training process of Deep Neural Networks (DNNs), especially when training on Graphics Processing Units (GPUs). Existing solutions for multi-GPU training setups partition the neural network over the GPUs in a way that favors training throughput over memory usage, and thus maximum trainable network size. We propose mCAP, a partitioning solution for pipeline-parallel DNN training that focuses specifically on memory usage. It evenly distributes Deep Learning models over the available resources with respect to per-device peak memory usage. Our partitioning approach uses a novel incremental profiling strategy to extract per-layer memory usage statistics. A model-based predictor uses the profiling data to recommend a partitioning that balances peak memory usage. Our approach is DL-framework agnostic and orthogonal to existing memory optimizations found in large-scale DNN training systems. Our results show that our approach enables training of neural networks that are 1.55 times larger than existing partitioning solutions in terms of the number of parameters.

Document type Conference contribution
Language English
Published at https://doi.org/10.1007/978-3-031-12597-3_10
Other links https://doi.org/10.6084/m9.figshare.20000960 https://www.scopus.com/pages/publications/85135773608
Downloads
978-3-031-12597-3_10 (Final published version)
Permalink to this page
Back