Spaces:

Arm
/

README

Running

App Files Files Community

ARMMcBrideT commited on Sep 26, 2024

Commit

e32329a

•

1 Parent(s): 4735641

Update README.md

Browse files

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -10,7 +10,7 @@ pinned: false
 <p>Arm’s AI development resources ensure you can deploy at pace, achieving best performance on Arm by default. Our aim is to make your AI development easier, ensuring integration with all major operating systems and AI frameworks, enabling portability for deploying AI on Arm at scale.</p>
     <p>Discover below some key resources and content from Arm, including our software libraries and tools, that enable you to optimize for Arm architectures and pass-on significant performance uplift for models – from traditional ML and computer vision workloads to small and large language models - running on Arm-based devices.</p>
     <br>
-    <strong>Arm and Meta: Llama 3.2<br>Accelerated cloud to edge AI performance</strong>
     <p>The availability of smaller LLMs that enable fundamental text-based generative AI workloads, such as Llama 3.2 1B and 3B, are critical to enabling AI inference at scale. Running the new Llama 3.2 3B LLM on Arm-powered mobile devices through the Arm CPU optimized kernel leads to a 5x improvement in prompt processing and 3x improvement in token generation, achieving 19.92 tokens per second in the generation phase. This means less latency when processing AI workloads on the device and a far faster overall user experience. Also, the more AI processed at the edge, the more power that is saved from data traveling to and from the cloud, leading to energy and cost savings.</p>
     <p>Alongside running small models at the edge, we are also able to run larger models, such as Llama 3.2 11B and 90B, in the cloud. The 11B and 90B models are a great fit for CPU based inference workloads in the cloud that generate text and image, as our data on Arm Neoverse V2 shows. When we run the 11B image and text model on the Arm-based AWS Graviton4, we can achieve 29.3 tokens per second in the generation phase. When you consider that the human reading speed is around 5 tokens per second, it’s far outpacing that.</p>
   <ul><p>

 <p>Arm’s AI development resources ensure you can deploy at pace, achieving best performance on Arm by default. Our aim is to make your AI development easier, ensuring integration with all major operating systems and AI frameworks, enabling portability for deploying AI on Arm at scale.</p>
     <p>Discover below some key resources and content from Arm, including our software libraries and tools, that enable you to optimize for Arm architectures and pass-on significant performance uplift for models – from traditional ML and computer vision workloads to small and large language models - running on Arm-based devices.</p>
     <br>
+    <strong>Arm and Meta: <a href="https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf" target="_blank">Llama 3.2</a><br>Accelerated cloud to edge AI performance</strong>
     <p>The availability of smaller LLMs that enable fundamental text-based generative AI workloads, such as Llama 3.2 1B and 3B, are critical to enabling AI inference at scale. Running the new Llama 3.2 3B LLM on Arm-powered mobile devices through the Arm CPU optimized kernel leads to a 5x improvement in prompt processing and 3x improvement in token generation, achieving 19.92 tokens per second in the generation phase. This means less latency when processing AI workloads on the device and a far faster overall user experience. Also, the more AI processed at the edge, the more power that is saved from data traveling to and from the cloud, leading to energy and cost savings.</p>
     <p>Alongside running small models at the edge, we are also able to run larger models, such as Llama 3.2 11B and 90B, in the cloud. The 11B and 90B models are a great fit for CPU based inference workloads in the cloud that generate text and image, as our data on Arm Neoverse V2 shows. When we run the 11B image and text model on the Arm-based AWS Graviton4, we can achieve 29.3 tokens per second in the generation phase. When you consider that the human reading speed is around 5 tokens per second, it’s far outpacing that.</p>
   <ul><p>