Spelling Whizz

Exchange

Tax

Cars

German

Inside Meta’s Llama 3: What’s packed in the latest AI model?


Meta recently released Llama Guard 3 to address safety concerns related to its latest AI model, Llama 3.

WEB DESK: Meta recently released Llama Guard 3 to address safety concerns related to its latest AI model, Llama 3.

While Llama 3 includes advanced features, such as image, video, and speech integration, these capabilities are still under development and not yet widely available.

Read more: Apple to unveil ‘Apple Intelligence’ with Dynamic AI Processing at WWDC 2024

Llama 3 AI model incorporates a 405 billion parameter transformer and a 128,000 token context window. It has shown performance comparable to leading systems such as GPT-4.

Architecture and training

Llama 3 is built using a sophisticated design called a “dense transformer architecture,” which helps it efficiently process and understand large amounts of information. The model improves how it handles long pieces of data by using techniques like grouped query attention (GQA) and special attention masks.

It supports a wide range of languages by using a large vocabulary of 128,000 words and symbols. The model’s ability to handle long and complex text is boosted by a special encoding method called rotary positional encoding (RoPE), which helps it keep track of word order over extended contexts.

The model itself is incredibly powerful, with 405 billion parameters, which are like the model’s “knobs” for tuning its performance. It has 126 layers, each with specific functions, and 128 attention heads, which help it focus on different parts of the data simultaneously.

Training Llama 3 required massive computational power. Meta used up to 16,000 advanced graphics processors (GPUs) to train the model. This process was optimized to make the most efficient use of these resources, allowing the model to learn effectively while managing data and memory efficiently.

To determine the best size for the model, Meta followed guidelines that balanced how well the model performs on various tasks with the computing power needed for training. This approach ensures that Llama 3 is both powerful and practical for real-world applications.

Post-training refinement

Llama 3 AI model underwent a thorough post-training process to better align the model with human feedback. This involved two key techniques: Supervised Finetuning (SFT) and Direct Preference Optimisation (DPO). SFT provided the model with additional training using specific examples to improve its performance. DPO adjusted the model based on user preferences, ensuring it met human expectations more effectively.

The post-training phase also included creating a reward model, which used feedback from human annotators to refine the model’s responses. Both human-generated and synthetic data were used in the finetuning process to enhance the model’s capabilities. New features were added, such as improved handling of multiple chat messages and better formatting, to make interactions more fluid and efficient.

To ensure high data quality, the training data was carefully curated through methods like topic classification, quality scoring, and semantic deduplication. Techniques such as rejection sampling and pagedattention were employed to boost the model’s efficiency, enabling it to process and respond to large volumes of data more swiftly.

Evaluation and performance

The evaluation of Llama 3 included assessments during both the pre-trained and post-trained stages, focusing also on safety.

The model demonstrated high performance on standard tests like reading comprehension and coding, showing improved reliability and consistency. It excelled in handling complex tasks, although its performance differed between scenarios involving adversarial challenges and those without such challenges.

Analysis of training data contamination indicated that overlaps in the data influenced the evaluation results, with varying effects on different benchmarks.

Optimisation Techniques

To enhance its processing efficiency, Llama 3 employs two key techniques: pipeline parallelism and 8-bit floating point (FP8) quantisation. Because the brain floating point 16 (BF16) representation required more memory than a single GPU could provide, the model uses parallel processing across multiple GPUs.

Pipeline parallelism helps manage communication between different machines, while tensor parallelism is used within each machine to handle large amounts of data efficiently. FP8 quantisation improves the speed of data processing by up to 50 per cent during the pre-fill stage, while still delivering performance levels similar to BF16.

Multimodal capabilities

Integrating visual recognition into Llama 3 involves a two-step process. Initially, a pre-trained image encoder is combined with cross-attention layers to analyze visual data.

This is followed by adding video cross-attention layers to enable the model to understand sequences over time.

This approach sidesteps the difficulties associated with joint pre-training and boosts efficiency in processing visual information. The model is trained on pairs of images and text, and later fine-tuned with higher-resolution data to enhance its performance.

Speech Integration

Llama 3 integrates speech capabilities using a layered approach. It combines a speech encoder with a text-to-speech system to handle both speech recognition and generation.

Read more: AI washing: What is it and why you should worry

The model supports 34 languages and includes features for speech recognition and translation. Evaluations show that Llama 3 performs well in translating speech and producing natural-sounding speech output.

You May Also Like