By Steven Woo, Rambus Fellow
Last week, I had the pleasure of hosting a panel at the AI Hardware & Edge AI Summit on the topic of “Memory Challenges for Next-Generation AI/ML Computing.” I was joined by David Kanter of MLCommons, Brett Dodds of Microsoft, and Nuwan Jayasena of AMD, three accomplished experts that brought differing views on the importance of memory for AI/ML. Our discussion focused on some of the challenges and opportunities for DRAMs and memory systems. As the performance requirements for AI/ML continue growing rapidly, the importance of memory continues to grow as well.
In fact, we’re seeing demands for “all of the above” when it comes to memory for AI, specifically:
- More capacity – model sizes are huge and growing rapidly. David cited embedding tables used by Baidu in their recommender system requiring 10 TB. Assets of that magnitude require a growing amount of DDR main memory capacity.
- More bandwidth – with the enormous amount of data to be moved, we’re witnessing the continued race to higher data rates across all DRAM types to provide more memory bandwidth.
- Lower latency – another aspect of this need for speed is lower latency so processor cores aren’t left idle waiting for data.
- Lower power – unfortunately, we’re running up against the limits of physics, and power has become an important limiter in AI systems. The demand for higher data rates is driving up power consumption. To mitigate this, IO voltages are being reduced, but this lowers voltage margins and increases the chance of errors, which bring us to…
- Higher reliability – to address increasing error rates at higher speeds, lower voltages, and smaller process geometries, we’re seeing increasing use of on-die ECC and advanced signaling techniques to compensate.
Another big topic we discussed was the challenges and opportunities for new memory technologies in AI. New technologies have many potential benefits, including:
- Optimizing capacity, bandwidth, latency, and power for a focused set of use cases. AI is a large and important market with a lot of money behind it, a great combination that can drive the development of new memory technologies. In the past, GDDR (developed for the graphics market), LPDDR (developed for the mobile market), and HBM (developed for high-bandwidth applications including AI) have been created to meet the needs of use cases that could not be satisfied with existing memories.
- CXL™ – CXL offers the opportunity to greatly scale up memory capacity and improve bandwidth, while also abstracting the memory type from the processor. In this way, CXL provides a great interface for incorporating new memory technologies. The CXL memory controller provides the translation layer between the processor and memory, allowing a new memory tier to be inserted after locally attached memory.
While new memory types targeting specific use cases can be beneficial for many applications, they face additional challenges:
- DRAM, on-chip SRAM, and Flash memory are here to stay for the foreseeable future, so don’t expect anything to completely replace any of these technologies. Yearly R&D and Capex investment in these technologies, together with decades of experience in high-yield manufacturing, make it essentially impossible to replace any of these technologies in the near-term. Any new memory technology must work well together with these memories in order to be adopted.
- The scale of AI deployments and risk associated with developing new memory technologies make it difficult to adopt brand new memories. The timeline for memory development is typically 2-3 years, but AI is advancing so fast it can be difficult to predict specific features that may be needed that far into the future. The stakes are high, and so is the risk of relying on a new technology being enabled and available.
- The performance benefits of any new technology must be high enough to offset any additional cost and risk. Given the demands on infrastructure engineering and deployment teams, this translates to a very high hurdle that new memory technologies need to overcome.
Memory will continue to be a key enabler for future AI systems. Our industry must continue to innovate for future systems to deliver faster and more meaningful AI, and the industry is responding.
Leave a Reply