Apple has released a detailed technical paper outlining the development of its Apple Intelligence models, which power a variety of new AI features coming soon to iOS, macOS, and iPadOS. In this paper, Apple addresses concerns about the ethical implications of its training methods, emphasizing its commitment to user privacy and responsible data usage.
Apple’s paper comes as a response to accusations that it used questionable methods to train some of its AI models. The company insists that no private user data was used in the training process. Instead, it relied on publicly available and licensed data. “The pre-training data set consists of data we have licensed from publishers, curated publicly available or open-sourced datasets, and publicly available information crawled by our web crawler, Applebot,” Apple states. The company underscores its focus on user privacy by ensuring no private Apple user data was included in the training set.
In a report by Proof News, it was alleged that Apple used a dataset called The Pile, which includes subtitles from a vast number of YouTube videos. This led to concerns as many YouTube creators were unaware and did not consent to their content being used in this manner. Apple responded, clarifying that models trained with such data would not be used in its AI features.
The technical paper sheds light on the Apple Foundation Models (AFM), first introduced at WWDC 2024. Apple emphasizes that the AFM models were trained using data sourced in a “responsible” manner. This includes publicly available web data and licensed data from undisclosed publishers. According to reports, Apple engaged in negotiations with several major publishers, such as NBC and Condé Nast, for multi-year deals to use their news archives for model training.
Apple also trained its AFM models on open-source code from GitHub, including languages like Swift, Python, C, and JavaScript. This practice has sparked debate among developers, as some open-source licenses do not permit AI training. Apple claims it made efforts to filter and use only code repositories with minimal usage restrictions, such as those under the MIT, ISC, or Apache licenses.
To enhance the models’ mathematical capabilities, Apple included math questions and answers sourced from various online platforms, math forums, blogs, and tutorials. Additionally, it utilized high-quality, publicly available datasets that permit use for training AI models, with sensitive information filtered out.
The training dataset for AFM models comprises about 6.3 trillion tokens, a significant yet smaller number compared to Meta’s Llama 3.1 405B model, which was trained on 15 trillion tokens. Apple also incorporated human feedback and synthetic data to fine-tune the AFM models, aiming to reduce undesirable behaviors like toxicity.
Apple asserts that its AI models are designed to assist users with everyday tasks across Apple products, adhering to the company’s core values and responsible AI principles. “Our models have been created with the purpose of helping users do everyday activities across their Apple products, grounded in Apple’s core values, and rooted in our responsible AI principles at every stage,” the company states.
While the paper doesn’t reveal any groundbreaking insights, it aligns with Apple’s strategy to avoid legal complications while positioning itself as an ethical leader in the AI industry. The company notes that webmasters can block its crawler from accessing their data, though this doesn’t entirely resolve concerns for individual creators whose content may be used without consent.
The legality of training AI models with scraped public web data remains a contentious issue, subject to ongoing lawsuits and debates over fair use doctrine. Apple aims to navigate these complexities by maintaining transparency and ethical practices in its AI development.
Apple’s technical paper serves as a reassurance to its users and stakeholders, highlighting its responsible approach to AI model training. As the landscape of generative AI continues to evolve, Apple strives to set a standard for ethical data use and user privacy.