View on GitHub

NERSC Data Seminars

An Empirical Model of Large-Batch Training

Sam McCandlish (OpenAI)

How quickly can neural networks be trained using large batch sizes? The limits of data parallelism seem differ from domain to domain, ranging from batches of tens of thousands in ImageNet classifiers to batches of millions in RL agents that play the game Dota 2. We describe a simple and easy-to-measure statistic called the gradient noise scale that predicts the largest useful batch size across many applications. Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training.