Apple is partnering with Nvidia to enhance the performance speed of artificial intelligence (AI) models.
The company is advancing its generative artificial intelligence (AI) strategy with a Recurrent Drafter (ReDrafter) method to enhance the inferencing speed of large language models (LLMs) on Nvidia GPUs.
Apple used ReDrafter, which was introduced in a research paper earlier this year, along with Nvidia’s TensorRT-LLM framework, to make AI models run faster. The aim is to enhance the speed of AI models while maintaining their accuracy.
Earlier this year, Apple published and open-sourced Recurrent Drafter (ReDrafter), a novel approach to speculative decoding that achieves state-of-the-art performance, according to the company.
The ReDrafter method combines two approaches: beam search (which helps AI consider multiple possible answers) and dynamic tree attention (a way of processing complex data). According to Apple’s research, ReDrafter could speed up LLM token generation by up to 3.5 tokens per generation step for open-source models, surpassing the performance of prior speculative decoding techniques.
It improved the model’s efficiency but it didn’t boost the speed as much as expected. So, Apple integrates ReDraftter with Nvidia’s TensorRT-LLM framework.
As part of this partnership, Nvidia added new features to improve the decoding process. Apple saw a 2.7x jump in the speed of generating tokes for some AI tasks with ReDrafter and Nvidia’s platform. This can be used to reduce the latency of AI processing while using fewer GPUs and consuming less power says the company.