Elon Musk's xAI Uses 550,000 Nvidia GPUs, Only 11% Utilized

Select Language:

Recent reports have shed light on the stark contrast between the hardware resources and actual utilization at xAI, the AI subsidiary founded by Elon Musk. Despite housing approximately 550,000 NVIDIA GPUs—comprising H100 and H200 series—the company’s model floating point unit utilization (MFU) hovers around just 11%. This discrepancy has quickly become a hot topic, raising widespread questions across the industry regarding the efficiency of their computational resources.

An internal memo obtained by The Information reveals that xAI’s President, Michael Nicolls, candidly acknowledged this low efficiency to his team. In the document, he described their model’s MFU as “embarrassingly low,” emphasizing the gap between available theoretical computational capacity and actual usage. Nicolls set a target for the coming months: doubling this figure to 50%, in an effort to dramatically improve performance.

While the scale of xAI’s hardware deployment impresses industry insiders, analysts note that the current utilization level doesn’t mean nearly 89% of the GPUs are idle. Rather, it reflects a stringent measure of how effectively the hardware’s peak theoretical power is being harnessed during training.

Compared to industry benchmarks, xAI’s efficiency falls notably short. For context, companies like Meta and Google typically maintain MFUs between 35% and 45% during large-scale model training, thanks to extensive software optimization and mature training pipelines. Even during the early, less efficient days of GPT-3, MFU hovered around 21% to 26%. xAI’s mere 11% stands as an alarming underperformance, even surpassing historical inefficiencies from earlier AI development eras.

Industry experts point out that the core issue isn’t hardware deficiency but rather software shortcomings. Although xAI has adopted NVIDIA’s standard deployment strategies, its software stack, parallelization techniques, and model engineering have lagged behind the rapid hardware expansion. Significant bottlenecks within the system are hindered further by slow HBM memory read speeds—much slower compared to processing chips—which leads to chips idling while waiting for data. Additionally, network topology and GPU-to-GPU communication overhead—amplified by the scale of tens of thousands of GPUs—compound these inefficiencies. Factors such as memory pressure, excessive recalculation of activations, and cross-GPU communication demands have all been identified as systemic contributors to low MFU.

Interestingly, xAI’s rapid infrastructure growth—the Colossus supercomputer was constructed in just 122 days—demonstrates industry-level achievement in hardware deployment. However, this rapid expansion has also exposed software optimization gaps, making the low utilization even more pronounced.

In sum, xAI’s case exemplifies how having the most advanced hardware doesn’t automatically translate into efficient AI training. The real challenge lies in developing software solutions that can keep pace with hardware scaling, ensuring the immense investment in resources translates into tangible performance gains.

[This article is courtesy of Kuai Keji. Reproduction must credit the source.]