Elon Musk has once again grabbed headlines by giving the world a glimpse of Cortex, X’s AI training supercomputer currently under construction at Tesla’s Giga Texas plant. In a video that’s both awe-inspiring and surreal, Musk showed off what a cool $1 billion in AI GPUs actually looks like. But if that wasn’t enough to make tech enthusiasts’ jaws drop, Musk recently took to his platform, X, to reveal that the real showstopper—Colossus, a 100,000 H100 training cluster—has officially come online.
What exactly are AI clusters?
An AI cluster as a giant brain made up of thousands of computers working together to process massive amounts of information at lightning speed. Instead of one single computer, clusters like Colossus use thousands of specialized machines, each equipped with powerful chips (called GPUs), designed to handle the incredibly complex calculations needed for artificial intelligence.
These clusters train AI models by feeding them vast amounts of data—think of it like teaching a student by giving them thousands of books to read in a short time.
All details regarding xAI’s ColossusMusk didn’t hold back on the bragging rights, claiming that Colossus is “the most powerful AI training system in the world.” Even more impressive is the fact that this mammoth project was built “from start to finish” in just 122 days.
Considering the scale and complexity involved, that’s no small feat. Servers for the xAI cluster were provided by Dell and Supermicro, and while Musk didn’t drop an exact number, estimates place the cost between a staggering $3 to $4 billion.
This weekend, the @xAI team brought our Colossus 100k H100 training cluster online. From start to finish, it was done in 122 days.
Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200k (50k H200s) in a few months.
Excellent…
— Elon Musk (@elonmusk) September 2, 2024
Now, here’s where things get really interesting. Although the system is operational, it’s unclear exactly how many of these clusters are fully functional today. That’s not uncommon with systems of this magnitude, as they require extensive debugging and optimization before they’re running at full throttle. But when you’re dealing with something on the scale of Colossus, every detail counts, and even a fraction of its full potential could outperform most other systems.
The future looks even more intense. Colossus is set to double in size, with plans to add another 100,000 GPUs—split between Nvidia’s current H100 units and the highly anticipated H200 chips. This upgrade will primarily power the training of xAI’s latest and most advanced AI model, Grok-3, which aims to push the boundaries of what we consider possible in AI.
Featured image credit: BoliviaInteligente/Unsplash