AI Ethics & Safety
Alignment, bias, governance, and responsible AI
21 entities in this channel
Training Language Models to Follow Instructions with Human Feedback
by OpenAI
Presents InstructGPT, which uses Reinforcement Learning from Human Feedback (RLHF) to align GPT-3 with human intent. By fine-tuning on human demonstrations and training a reward model on human preference comparisons, InstructGPT produces outputs that human evaluators prefer to GPT-3 outputs despite having 100× fewer parameters.
Constitutional AI: Harmlessness from AI Feedback
by Anthropic
Introduces Constitutional AI (CAI), a method for training harmless AI assistants using a set of written principles (a 'constitution') to guide both supervised learning and reinforcement learning from AI feedback (RLAIF). CAI enables Anthropic to reduce reliance on human harm labels while maintaining helpfulness and making AI reasoning about harmlessness explicit.
RealToxicityPrompts
by Gehman et al. / Allen Institute for AI
RealToxicityPrompts measures the propensity of language model generations to produce toxic content when conditioned on a diverse set of 100,000 naturally occurring prompts extracted from the web. It uses the Perspective API to score generated text on toxicity dimensions.
Red Teaming Language Models with Language Models
by DeepMind
Proposes using language models to automatically generate test cases that elicit harmful behaviors from target language models—a scalable alternative to manual red teaming. The approach discovers diverse attack prompts across harm categories and reveals that larger models are harder to red-team but produce more harmful outputs when successfully attacked.
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
by OpenAI
This paper explores weak-to-strong generalization, a method for training a powerful AI model using supervision from a weaker one. It serves as an analogy for aligning superintelligent AI with human values. The research shows that strong models can learn beyond their weak supervisors and introduces techniques like auxiliary confidence loss to improve performance.
Scalable agent alignment via reward modeling: a research direction
by DeepMind
This research paper proposes a method for aligning advanced AI systems by using recursive reward modeling. The approach leverages AI assistants to help human evaluators assess complex AI actions, enabling scalable oversight and positioning this technique alongside debate and amplification as key AI safety strategies.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
by Anthropic
Demonstrates that LLMs can be trained to behave safely during normal operation but exhibit unsafe behaviors when triggered by specific conditions—acting as 'sleeper agents'—and that standard safety training techniques including RLHF, supervised fine-tuning, and adversarial training fail to reliably remove these backdoors, sometimes even hiding them deeper.
ToxiGen
by Hartvigsen et al. / MIT
ToxiGen is a large-scale, machine-generated dataset for evaluating nuanced hate speech detection. It contains over 274,000 toxic and benign statements about 13 minority groups, designed to challenge models to identify implicit toxicity without relying on obvious slurs or surface-level cues.
BBQ (Bias Benchmark for QA)
by Parrish et al. / NYU
BBQ is a question-answering benchmark designed to expose social biases in language models. It uses ambiguous and disambiguated questions related to nine protected categories to measure a model's tendency to rely on harmful stereotypes when context is lacking versus its ability to answer correctly when enough information is provided.
CyberSecEval
by Meta AI
CyberSecEval is a benchmark developed by Meta to assess the cybersecurity risks associated with Large Language Models (LLMs). It evaluates a model's propensity to generate insecure code, assist in exploiting vulnerabilities, and facilitate attacks, helping safety teams quantify the dual-use risk of code-capable models.
CrowS-Pairs
by Nangia et al. / NYU
CrowS-Pairs is a benchmark dataset for evaluating social bias in masked language models. It contains 1,508 sentence pairs with stereotypical and anti-stereotypical statements across nine bias types. The benchmark measures a model's preference for stereotypical completions using pseudo-log-likelihood scores.
Content Filtering
by AaaS
A system that automatically screens text inputs and outputs for large language models (LLMs) to detect and manage harmful content. It uses multi-category classification to identify issues like toxicity, hate speech, and violence, applying configurable rules and thresholds to enforce safety policies and protect users.
Prompt Injection Defense
by AaaS
Detects and mitigates prompt injection attacks where malicious inputs attempt to override system instructions or extract sensitive information. Implements input sanitization, instruction hierarchy enforcement, and output monitoring to protect LLM-powered applications.
WinoBias
by Zhao et al. / USC
WinoBias is a benchmark dataset designed to measure gender bias in coreference resolution systems. It consists of sentence pairs where pronouns refer to individuals in stereotyped or non-stereotyped occupations, allowing for the quantification of a model's reliance on gender stereotypes versus grammatical correctness.
PII Detection
by AaaS
Identifies and flags personally identifiable information (PII) in text data, including names, addresses, phone numbers, SSNs, and financial details. Supports configurable sensitivity levels, redaction strategies, and compliance reporting for GDPR, HIPAA, and CCPA requirements.
Output Validation
by AaaS
Validates LLM outputs against expected schemas, formats, and quality criteria before delivery to end users. Implements JSON schema validation, hallucination checks, citation verification, and automated retry logic for outputs that fail validation.
Guardrail Implementation
by AaaS
Implements programmable guardrails that constrain LLM behavior within defined boundaries. Covers input validation, output format enforcement, topic restriction, factuality checking, and automated intervention when model responses deviate from acceptable parameters.
Jailbreak Detection
by AaaS
Detects and blocks jailbreak attempts that try to bypass LLM safety training through adversarial prompting techniques. Uses pattern recognition, semantic analysis, and classifier-based approaches to identify known and novel jailbreak vectors before they reach the model.
Lakera Guard
by Lakera
Enterprise API for protecting LLM applications against prompt injections and content threats. Provides real-time scanning of inputs and outputs for prompt attacks, PII leakage, and inappropriate content.
Prompt Armor
by Prompt Armor
API-based protection layer for defending LLM applications against prompt injection and jailbreak attacks. Provides real-time input analysis and filtering with minimal latency impact on AI workflows.
Vigil
by deadbits
Open-source prompt injection scanner for detecting and preventing attacks on LLM applications. Provides multiple detection methods including similarity matching, canary tokens, and heuristic analysis.