CPU Rich v's CPU Poor
The new silicon divide probably won't be as bad as it looks
I was actually planning to write about the futility of watermarking AI-generated content today, but this week’s hot debate on GPU costs derailed me!
The scarcity of access to powerful GPU computer chips needed to train the most influential models is a central industry talking point. The pinch in chip resources has driven giant seed investments in some companies and high valuations for GPU-centric cloud providers, as well as making NVIDIA shareholders very happy.
This week, the industry analyst blog SemiAnalysis published a piece (partly paywalled) arguing that GPU Rich v’s Poor gaps are already significant. The post namechecks multiple players (EU countries and a slew of high-profile startups) that it says are on the wrong side of the divide. Other commentators also chimed in.
There’s no doubt that access to GPUs is a significant pain point for many AI teams (both commercial and R&D), and there are have’s and have-nots. This will also have some effect on competitive outcomes.
I doubt the situation is as bad as it sounds, however….
SemiAnalysis
The team at SemiAnalysis essentially argues that:
- There are only between 5 and 10 organizations worldwide with 20,000+ GPUs and the freedom to pursue all the R&D they would like. (They likely lack people rather than chips.)
- Very large numbers of teams with few chips (access to 10s, 100s, or less, or only via cloud services they do not have the budget for) are hence struggling to progress due to wasted time-solving problems that would not exist without more chip access.
- Poor choices are being made by GPU-poor teams as they bump up against hardware limits trying to use dense/compute expensive models when they should perhaps switch to other model types.
- European startups and governments are especially behind in the GPU race. (*1)
- Some well-known companies such as HuggingFace, Databricks (MosaicML), and Together are also in the same trap. They would need hundreds of millions more than they can raise to compete with the more prominent players.
- There is a risk that many smaller applications will be relegated to simply making API calls to prominent players such as OpenAI.
- NVIDIA is cleaning up by expanding its software, API, and cloud solutions to capture markets beyond chips.
- Lastly, Google is potentially in the best position to challenge NVIDIA’s strength due to its long-running chip efforts and experience in cloud infrastructure.
This paints quite a dark picture ahead.
AI Inequality would be terrible, but there are countervailing forces
One of the biggest dangers with AI, in my view, is if there ends up being a high level of inequality with who has access to both use it and further develop it. Specifically, the idea of just a few companies or governments controlling powerful AI systems and very few others with access could end up being very dystopian.
However, I don’t think we’re there yet, and there are countervailing forces to SemiAnalysis’ take:
- Constraints are good. More GPU access would, no doubt, help many teams. However, the last 12 months have seen giant leaps in finetuning, training, inference, and in other compute-intensive processes. Many of these innovations are driven by constraints. I’m vividly reminded of two research colleagues who worked on execution environments for handheld devices in the late 1990s. Their target device was the PalmPilot (yes, the 1990s had an “iPhone”). Their objective was to run Java Virtual machines on the device. Each time they neared their goal with a new version of the Java VM (an impressive engineering feat), Palm would release a new handheld with roughly double the compute resources, obsoleting their work.
- HuggingFace, DataBricks, and many others don’t need to run compute for their customers. Each of these players is helping its customers build models. While it’s possible to run compute for them, in most cases they act as a facilitator to help manage code & models that are running on cloud hyper-scaler infrastructure. A large Databricks customer, for example, may have Petabytes of data managed with Databricks, but much of this data will sit on Amazon, Azure, or Google Cloud. Further, when models are trained (e.g., using Mosaic) using this data, the training cycles will likely run on the customers’ Amazon, Azure, or Google Cloud.
- Other players will up their chip game. Given NVIDIA’s massive lead, it will take time for other players to reach similar performance levels, but the whole industry is now aligned to try to do this. AMD, Apple, Arm Amazon, Google, Meta, and others will redouble their efforts. While they may not beat NVIDIA in a straight horse race anytime soon, they are all optimizing for slightly different use cases, so many improvements will be made.
- The speed advantage is real but somewhat bounded. As SemiAnalysis points out, within 12-18 months, the number of chips will have increased significantly. Couple this with point #3; there will likely be a slow increase in available capacity v’s demand. The question is - what is a two-year head start worth in R&D or product rollout? For some applications, it may be enough to determine market winners. In many, though, this seems unlikely to be the case. LLMs have properties that make them impressive out of the box and hard to tame in the long run. Later entrants may have less adapted technology, but they will have the benefit of a more robust toolset, lower costs, and seeing everyone else’s failures. Given the choice, one would still rather start now and run fast, but given the data needs, the challenges in controlling LLMs, and the newness of so many applications, it’s unclear that it would be game over to start later.
Application trickle down
The last, but perhaps strongest, countervailing force against chip lockup is that in the medium and long term, chips need to be paid for by delivering value to someone. Users (individuals and organizations) will need to see value in the applications and services that rely on the AI systems being developed.
Most companies buying outsized numbers of chips today are (at least in part) infrastructure companies that will need to resell access to this infrastructure to others.
Much of the “value” of AI to the end user will come through applications and services they already know, as well as through well-known brands (high street banks, fashion, auto manufacturers, etc.). The infrastructure players will actively want to ensure this value can be realized; otherwise, their chip investments won’t be worth the silicon they are printed on.
The grey zone is potentially that each of the largest tech companies in the world (Apple, Amazon, Google, Meta, Microsoft) is active in a slew of business areas that could directly benefit from AI. This may make competing with them in these areas a particularly unequal fight. This is arguably an expansion of the existing concern that many large brands have in working with dominant cloud players that might actually be competitors (Walmart to Amazon, for example). Look out for anti-trust lawsuits around AI access, though probably not around chips, but trained models.
So we’re all good?
The GPU Rich vs. Poor is a serious debate, and there are genuinely valid concerns. For startups currently in the field, GPU cost is a very serious issue. (Particularly in the EU where GPPR and other concerns limit where processes can be hosted.). The sooner costs come down, the better.
However, in the medium and long term, there should be corrective forces that will play out.
In the meantime, where possible, it makes sense to:
- Spend more on smart people / novel techniques than hardware. Those optimization techniques might be worth more than a pile of chips. Even better, they’ll burn less fossil fuels.
- Try to use the latest tools and techniques as they come out. Right now, many things are being built from scratch (probably multiple times), so try to avoid building something that isn’t differentiating. 100s of venture investments have been made in tooling, some of which will reduce costs for early adopters.
- Look at your long-term goal. If there are moats other than the raw power of your trained model (such as collected data, visualization techniques, etc.), focus on those while the model catches up.
If you’re fighting for GPUs, where are you finding them? How much of a competitive barrier is access going to be?
Have a wonderful week!
Watermarking up next week!
Notes
- The call out for European countries and governments being on-chip availability does have some truth to it, but there are also other indicators. Last week, the well-funded startup Poolside moved to France, and one of the key motivators was access to infrastructure providers such as Scaleway.
Thank you for reading SteampunkAI. This post is public, so feel free to share it.