Building trust into AI

At Amazon, AI now touches everything from warehouse logistics to customer service chatbots to AWS cloud services used by thousands of enterprises, making it a business-critical technology. It’s therefore imperative that the models Amazon develops and deploys are as safe, fair, and robust as possible: responsible AI (RAI) is not an optional add-on. As Rahul Gupta, senior science manager and RAI lead for Amazon’s Artificial General Intelligence (AGI) organization, puts it, “Responsibility is baked into the product design from day one.” Amazon’s commitment to safety and responsibility goes back long before the generative-AI boom. Gupta and researchers on his team worked in the Alexa AI organization, where the company “developed some muscle on defining how RAI should be done.” The focus, he recalls, was on developing policies and implementations as well as methods to evaluate their effectiveness. As Amazon began building its own large models, the RAI expertise from Alexa proved a valuable resource. In concert with Amazon’s policy team, AGI scientists have built an RAI pipeline that addresses four phases of model development: pretraining, post-training, evaluation, and third-party monitoring. At each stage, researchers grapple with distinct challenges to ensure that trustworthy systems can adapt, at scale, across situations, applications, and geographies. From this framework, Amazon has built over 70 internal and external RAI tools, funded or published more than 500 research papers, and delivered tens of thousands of hours of RAI-focused training to its employees. Amazon has a three-pronged approach to RAI: anticipate risks before they materialize, teach models to navigate ambiguity, and build systems that can adapt — to government transitions, high-profile incidents, new regulations, and other social changes. Below are some of the scientists across Amazon’s responsible-AI and policy teams who put this approach into practice — each tackling a different phase of the AI lifecycle. Teaching foundations: Pretraining Chentao Ye is a senior applied scientist on the AGI RAI team, working on pretraining, the earliest stage of LLM training, where the model develops general linguistic competences. It’s become increasingly critical to address RAI at this stage, says Ye, to ensure that the model has the information necessary to adapt to policies established by Amazon’s policy team. “Pretraining is the stage where we teach our most fundamental concepts of RAI,” Ye says. “It’s like teaching a child about the world before we expect them to make some decisions.” Pretraining typically involves large volumes of public data, but the RAI team augments that data with datasets specifically designed to instill principles of safety, security, and fairness. Those datasets are vast and diverse — a “rich diet” of content including internal and public RAI guidance, best practices, RAI-related news and incidents, information about domains such as chemical and nuclear engineering and coding security, text, audio, and images. Also included in the corpus is information in different languages and from different cultures, to ensure the model is global and multilingual. To help the model better incorporate this array of information, researchers create training tasks, also known as learning exercises, for it. “Having this data isn’t enough. We need to help the model process and understand it effectively,” Ye says. For instance, Ye and his colleagues might take a policy document about privacy and convert it into multiple learning exercises: explaining privacy concepts, answering questions about compliance, and determining whether certain actions would violate privacy guidelines. These varied tasks help the model develop a deeper, more nuanced understanding of RAI principles. Another active area of research is how to handle potentially harmful content in the training corpus. “It’s not simply about filtering everything out,” Ye explains. “If a model has never encountered certain harmful concepts during pretraining, it won’t recognize them as sensitive, making post-training guardrails less effective.” The team is exploring approaches that add educational context to certain filtered content before reintroducing it — teaching the model what harm looks like and why it should be avoided, rather than leaving it entirely unaware. In addition to RAI acquisition, another area of focus is what’s called RAI modality alignment. LLMs need to understand how to apply RAI principles across all the modalities they encounter. Modality alignment maps other modalities into a semantic space they share with text, which is often more readily available, Ye explains. For example, a college textbook might include figures of high-risk chemical, biological, radiological, and nuclear materials (CBRN) and text descriptions of the same concepts. The team designs a range of LLM tasks that effectively encode the data into the same space. One active research area is developing a variety of techniques to test for pretraining quality, says Ye. The team is taking two complementary approaches. The first tests whether the model has actually acquired RAI knowledge during pretraining. “We use metrics like perplexity” — which quantifies how well a probability distribution predicts a given sample — “to measure how well the model can generate content in specific RAI domains,” Ye explains. The second approach tests the way that the model responds to sparse questions that might appear in later testing exercises, where the expected responses — like refusals or deflections — weren’t explicitly taught during pretraining. “This helps us test whether the RAI knowledge it gained during pretraining enables it to generalize to real-world scenarios with just limited examples or instructions,” Ye says. Post-training: Reinforcement learning from human feedback Once models learn to follow instructions and produce both helpful and harmless responses, they advance to reinforcement learning from human feedback (RLHF). Senior applied scientist Charith Peris, who leads this phase of model development, and applied scientist Yao Ma explain that RLHF focuses on using feedback from or preference comparison with humans to give models a sense of judgement. “RLHF is done to make sure the foundation model aligns with the behavior expected by humans,” says Peris. This stage of training provides the model with a reward based on how well its response to a query meets a predetermined criterion. The rewards are provided by various response verification systems. One approach uses so-called auxiliary-reward models, which are trained on outputs that humans have ranked. For responsible AI, this stage offers the ability to optimize the model to generate responses that are “policy adherent,” hewing to the rules and guidelines devised by Amazon’s policy team. “Providing the right rewards is a critical part of RLHF,” says Ma. In one case, the core model itself is used to generate multiple responses to a range of unsafe and borderline safe queries. These responses are ranked and rated by humans based on their helpfulness and policy adherence and then used to train auxiliary-reward models. Another response verification approach uses an independent LLM as a judge. The model generates a response for each prompt in the training set, and this response, together with a set of rubrics about what makes a response policy adherent, is passed to the judge. The judge is then instructed to provide a score based on how well the response aligns with the rubrics. Both the auxiliary-reward models and the judge-based systems can be used individually or in combination to provide RLHF rewards. The model is evaluated in two phases: during and after training. In the first phase, the model is tested at frequent, short intervals using lightweight benchmarks that provide directional signals on performance across critical capabilities. In the second phase, saved checkpoints, each a complete snapshot of the model’s state and parameters at a given point in training, are systematically evaluated against a broader set of test data to identify which checkpoint achieved the best overall performance. Behavior in check: Evaluations A major focus of the evaluations team is to build model-breaking datasets — robust collections of prompts that trigger inappropriate, unsafe, or policy-violating responses. “We know models are improving month over month,” says Jwala Dhamala, a senior scientist with Amazon AGI . Bigger, better responsible-AI datasets are playing a large part in this, she says, as well as improved mechanisms to capture how well the models incorporate responsible-AI principles spanning multiple modalities and regions. Working closely with Amazon’s policy team, Dhamala says, is key to developing evaluations for RAI. Amazon’s RAI work has eight pillars: privacy and security; safety; fairness; veracity and robustness; explainability; controllability; governance; and transparency. “For each pillar, we focus on tests that could lead the model to output something that violates responsible-AI policies. Simultaneously, we focus on testing if a model is refusing excessively or refusing to respond to benign requests,” Dhamala explains. The data comes from everywhere: human experts known as red teamers who try to break models, external security partners, public benchmarks from universities, even social media where real-world problems surface organically. The RAI team evaluates models throughout the model-training and deployment cycle, Dhamala explains, from pretraining to post-training and predeployment, when all scaffolding is attached. Each stage has its own specially designed evaluation processes, and more testing happens in the later stages, when the model is closer to end users. “We collect datasets, evaluate, then collect new datasets, evaluate again,” Dhamala says. She adds that the team is currently working to automate more of the evaluation process. It’s also pushing into newer areas of research. Deception in conversations that require many back-and-forth interactions over weeks or months (also called long-horizon interactions) is emerging as a concern, but there aren’t many established benchmarks for detecting it. Creating them requires an understanding of what deception means across different long-horizon contexts, an understanding grounded in social-science research. Another open area of research is an automatic red-teaming framework to evaluate emerging responsible-AI risks. The idea is that an autonomous agent or a system of agents would compete or collaborate in attempts to provoke undesired behaviors. Third-party collaborations: Frontier risks While most RAI work addresses common misuse patterns, Tong Wang, a senior applied scientist with AGI, focuses on a different category of risk: frontier risks, or “systemic risks that could take down entire systems.” These include the use of AI models to research CBRN (chemical biological, radiological, and nuclear) attacks and to research or launch cyberattacks. These are scenarios where AI capabilities could enable nonexperts to cause catastrophic harm. The evaluation process for frontier risks is exacting. First, automated benchmarks test whether the model has acquired dangerous knowledge. If it passes certain thresholds — answering questions about weapons of mass destruction with concerning accuracy — that triggers human review. Third-party experts in relevant domains evaluate whether the model has crossed safety boundaries. And the process is ongoing: with each model update, the team compares the new model’s capabilities against those of earlier models. “We have to be very careful,” Wang says. “False positives and false negatives both have costs.” With public models, identified risks are mitigated by guardrails: when a person asks about a particular topic at a particular level of specificity, the model simply won’t respond. But legitimate researchers — scientists at universities and labs with relevant expertise and appropriate oversight — may need access to restricted information for their work. Wang’s team is exploring mechanisms to provide “specialized access with heavy monitoring” for these trusted users. Those mechanisms involve what Wang calls “configurability”, using techniques like low-rank adaptors (LoRA) to make surgical changes to a model’s behavior for specific use cases, without retraining the entire model. “We add configuration on top that doesn’t touch the base model itself,” he says. “You’re not retraining a billion parameters, just a few.” Today, this approach is already in use for certain content policies. But extending it to frontier risks like CBRN is a harder problem; both the data collection and computational costs are significantly higher. “It’s an open research area, studying which approaches work best,” Wang notes. Agreed-upon values: Writing the policies “We partner with the Amazon science team throughout the entire model development lifecycle,” explains Claire O’Brien Rajkumar, leader of the responsible-AI policy and product team. The process starts with understanding what a product team wants to launch — whether it’s an image generation model or a large language model — and mapping potential harms against Amazon’s eight core dimensions of responsible AI. Before building an image generator, for instance, the team might anticipate risks such as deepfakes, bias amplification (for instance, an image depicting doctors only as white males), or attempts to generate disturbing content. Identified risks are translated into specific policies that define behavioral boundaries for the model under development. These policies become “backward-working guidelines,” O’Brien Rajkumar says, that inform every subsequent decision during model building. For instance, rather than sourcing images from a single vendor that might show only white male doctors, the team ensures diverse data collection that reflects the complexity of the real world. Amazon’s policies are informed by factors including industry trends, customer requests, regulations, and legal requirements (particularly around copyright and content licensing). The team actively participates in industry groups like the Frontier Model Forum and Partnership on AI, collaborating with competitors to establish best practices in an under-regulated space. Academic partnerships help identify emerging risks through the development of benchmarks as well as engagements such as the Trusted AI track of the Amazon Nova AI Challenge, where university students compete to identify safety vulnerabilities in Nova models and the associated fixes. Customer feedback shapes practical policy decisions, such as carving out exceptions for legitimate use cases such as LLM-based security testing, even when the general policy prohibits malware generation. The policy team operates through cross-functional working groups that include legal, public-policy, product, security, and RAI experts. Regulatory developments like the EU AI Act and California’s AI Transparency Act directly influence policy evolution. “These are living, breathing things,” O’Brien Rajkumar notes, acknowledging that policies must adapt as society becomes more comfortable or less comfortable with certain AI risks. Beyond policy development, and specific responsible-product guidelines, the team manages the implementation of AI safeguards and oversees red-teaming operations using both in-house experts and third-party vendors. It also conducts manual reviews of model outputs to assess real-world risk. “These are high-judgement decisions, working on the boundaries of what violates policy or not,” says O’Brien Rajkumar. “We have to really understand what each policy means in practice.”

Related topics

Northwest Quantum Nexus Formalizes Structure with New Executive Leadership Team

Regional quantum collaborative appoints Joseph Williams as Executive

Preserving the privacy of AI training data

Large language models, the highest-profile machine learning (ML)

How catastrophic is your LLM?

As large language models (LLMs) become increasingly useful