At the just-concluded Worldwide Developers Conference, Apple announced Apple intelligence, a new personalized intelligent system that is deeply integrated into iOS 18, iPadOS 18, and macOS Sequoia.
Apple Intelligence consists of a variety of highly intelligent generative models designed for users’ daily tasks. In a blog just updated by Apple, they detailed two of the models:
- An on-device language model with approximately 3 billion parameters;
- A larger server-based language model that runs on Apple servers via private cloud computing.
These two base models are part of Apple’s generative model family, and Apple said they will share more information about this model family in the near future.
In this blog, Apple spends a lot of time explaining how they developed high-performance, fast, and energy-efficient models; how to train these models; how to fine-tune the adapters for specific user needs; and how to evaluate the performance of the models in providing assistance and avoiding accidental injuries.
Overview of modeling the Apple base model
Pre-training
The base model is trained on the AXLearn framework, an open source project released by Apple in 2023. The framework is built on JAX and XLA, enabling users to efficiently and scalably train models on a variety of hardware and cloud platforms, including TPUs and cloud and local GPUs. In addition, Apple uses technologies such as data parallelism, tensor parallelism, sequence parallelism, and FSDP to scale training along multiple dimensions such as data, model, and sequence length.
Apple uses authorized data to train its basic models, including data specifically selected to enhance certain features and data collected from the public web by Apple’s web crawler AppleBot. Publishers of web content can choose not to have their web content used to train Apple Intelligence by setting data usage controls.
Apple never uses users’ private data when training its underlying models. To protect privacy, they use filters to remove personally identifiable information that is publicly available on the Internet, such as credit card numbers. In addition, they filter out vulgar language and other low-quality content to prevent it from entering the training dataset. In addition to these filtering measures, Apple also performs data extraction and deduplication, and uses model-based classifiers to identify and select high-quality documents for training.
Post-training
Apple found that data quality is critical to the model, so it adopted a hybrid data strategy in the training process, that is, manually annotated data and synthetic data, and carried out comprehensive data management and filtering procedures. Apple developed two new algorithms in the post-training stage: (1) a rejection sampling fine-tuning algorithm with a “teacher committee”, and (2) a reinforcement learning from human feedback (RLHF) algorithm with mirror descent policy optimization and a leave-one-out advantage estimator. These two algorithms significantly improved the instruction following quality of the model.
optimization
In addition to ensuring the high performance of the generated model itself, Apple also uses a variety of innovative technologies to optimize the model on the device side and private cloud to improve speed and efficiency. In particular, they have made a lot of optimizations to the model’s reasoning process in generating the first token (the basic unit of a single character or word) and subsequent tokens to ensure the model’s fast response and efficient operation.
Apple uses grouped query attention in both the on-device and server models for increased efficiency. To reduce memory requirements and inference costs, they use shared input and output vocabulary embedding tables that are not repeated in mapping. The on-device model has a vocabulary size of 49,000, while the server model has a vocabulary size of 100,000.
For device-side reasoning, Apple used low-bit palletization, a key optimization technique that meets the necessary memory, power, and performance requirements. To maintain model quality, Apple also developed a new framework that uses a LoRA adapter and combines a mixed 2-bit and 4-bit configuration strategy – an average of 3.5 bits per weight – to achieve the same accuracy as the uncompressed model.
In addition, Apple also used Talaria, an interactive model latency and power analysis tool, as well as activation quantization and embedding quantization, and developed a method for efficient key-value (KV) cache updates on the neural engine.
Through this series of optimizations, on an iPhone 15 Pro, when the model receives a prompt word, the time required from receiving the prompt word to generating the first token is about 0.6 milliseconds. This delay time is very short, indicating that the model is very fast in generating responses at a rate of 30 tokens per second.
Model Adaptation
Apple fine-tunes the base model to the user’s daily activities, and can dynamically specialize for the task at hand.
The research team used adapters (small neural network modules that can be plugged into the layers of a pre-trained model) to fine-tune the model for a specific task. Specifically, the research team adjusted the attention matrix, the attention projection matrix, and the fully connected layers in the point-wise feed-forward network.
By fine-tuning only the adapter layer, the original parameters of the pre-trained base model remain unchanged, preserving the general knowledge of the model while tailoring the adapter layer to support specific tasks.
The research team used 16 bits to represent the values of adapter parameters. For a device model with about 3 billion parameters, the parameters of 16 adapters typically require 10 megabytes. Adapter models can be dynamically loaded, temporarily cached in memory, and swapped. This enables the underlying model to dynamically specialize for the current task while efficiently managing memory and ensuring the responsiveness of the operating system.
To facilitate training of adapters, Apple created an efficient infrastructure to quickly retrain, test, and deploy adapters when the base model or training data is updated.
Performance Evaluation
Apple focuses on human evaluation when benchmarking its models because the results of human evaluation are highly correlated with the user experience of the product.
To evaluate the product-specific summarization features, the research team used a set of 750 responses carefully sampled for each use case. The evaluation dataset emphasizes the variety of inputs that the product features may face in production and includes a hierarchical mixture of single and stacked documents of different content types and lengths. Experimental results found that the model with the adapter was able to generate better summaries than similar models.
As part of responsible development, Apple identified and evaluated specific risks inherent in summarization. For example, summarization can sometimes remove important nuance or other details. However, the research team found that the summary adapter did not amplify sensitive content in more than 99% of the targeted adversarial examples.
Figure 3: Share of “good” and “bad” responses for the summary use case.
In addition to evaluating the specific features supported by the base model and adapters, the research team also evaluated the general functionality of the on-device model and the server-based model. Specifically, the research team used a comprehensive set of real-world prompts to test the model capabilities, covering tasks such as brainstorming, classification, closed question answering, encoding, extraction, mathematical reasoning, open question answering, rewriting, security, summarization, and writing.
The research team compared the model with open source models (Phi-3, Gemma, Mistral, DBRX) and commercial models of comparable size (GPT-3.5-Turbo, GPT-4-Turbo). The results showed that Apple’s model was more favored by human evaluators than most competing models. For example, Apple’s on-device model has about 3B parameters and outperforms larger models, including Phi-3-mini, Mistral-7B, and Gemma-7B; the server model compares favorably with DBRX-Instruct, Mixtral-8x22B, and GPT-3.5-Turbo, while being highly efficient.
Figure 4: Proportion of preferred responses in evaluations of the Apple-based model and comparable models.
The research team also used a set of different adversarial prompts to test the model’s performance on harmful content, sensitive topics, and facts, measuring the model’s violation rate as assessed by human evaluators, with lower numbers being better. In the face of adversarial prompts, both the on-device model and the server model were robust, with lower violation rates than the open source and commercial models.
Figure 5: Proportion of responses for harmful content, sensitive topics, and factual violations (lower is better). Apple’s model is very robust when faced with adversarial prompts.
Given the extensive capabilities of large language models, Apple is actively engaging with internal and external teams on manual and automated red teaming to further assess the security of the models.
Figure 6: Proportion of preferred responses in a side-by-side evaluation of the Apple Base model and similar models for safety prompts. Human evaluators found the Apple Base model’s responses to be safer and more helpful.
To further evaluate the model, the research team used the Instruction Tracing Evaluation (IFEval) benchmark to compare its instruction tracing capabilities with similarly sized models. The results showed that both the on-device and server models followed detailed instructions better than similarly sized open source and commercial models.
Figure 7: Instruction traceability of the Apple base model and similarly sized models (using the IFEval benchmark).
Apple also evaluated the model’s writing abilities, involving a variety of writing instructions.
Figure 8: Writing skills (the higher the better ).
Reference link: machinelearning.apple.com/research/in…