AI Chips Depreciate Quickly Due To Heat Stress During Model Training Runs

The Michael Burry view is that companies are depreciating their GPUs over 5 year periods despite technological obsolescence in 18 months or less. This means negative profits and rates of return despite corporate claims to the contrary. 

But heat stress also depreciates GPUs. AI Chips run marathon sessions during model estimation (training) often exceeding a month and GPUs often fail due to heat stress: "Imagine you had a 10,000 or even a 20,000 GPU data center. You should expect on the statistics a chip to fail about every 3 or 4 hours. So long before I get to the point where I’m rapidly turning these over because there’s a new generation of chips, I’m turning over a vast chunk of my chips just because they’re failing under thermal stress." Source: https://paulkrugman.substack.com/p/talking-with-paul-kedrosky. GPUs used to estimate large language models are like ultra-marathon (50, 100, and 166 kilometers) runners on a 38 degree day. 

It took Facebook 54 days to estimate the Llama 3 open source large language model using16,384 Nvidia H100 80GB GPUs. There was a hardware failure roughly every three hours with GPU chip heat deaths comprising the majority of them. This despite the fact that GPUs are designed to slow down as temperature rises. 

Full details here: https://www.datacenterdynamic.s.com/en/news/meta-report-details-hundreds-of-gpu-and-hbm3-related-interruptions-to-llama-3-training-run/

Screenshot of Title Stating that META Reports Large Scale AI Chip Failures Due To Heat Stress


Comments

Popular posts from this blog

Here We Go Again: Several Major Cables Down Off Yemen

The Short Life Expectancy of Starlink's LEO Satellites

Update on META's Waterworth Project