AI Ethics &amp; Safety

alignmentsafetyconstitutional-ai

Constitutional AI: Harmlessness from AI Feedback

by Anthropic

Introduces Constitutional AI (CAI), a method for training harmless AI assistants using a set of written principles (a 'constitution') to guide both supervised learning and reinforcement learning from AI feedback (RLAIF). CAI enables Anthropic to reduce reliance on human harm labels while maintaining helpfulness and making AI reasoning about harmlessness explicit.

74.7B+

RealToxicityPrompts

by Gehman et al. / Allen Institute for AI

RealToxicityPrompts measures the propensity of language model generations to produce toxic content when conditioned on a diverse set of 100,000 naturally occurring prompts extracted from the web. It uses the Perspective API to score generated text on toxicity dimensions.

toxicitygenerationsafety

69.7B

safetyred-teamingadversarial

Red Teaming Language Models with Language Models

by DeepMind

Proposes using language models to automatically generate test cases that elicit harmful behaviors from target language models—a scalable alternative to manual red teaming. The approach discovers diverse attack prompts across harm categories and reveals that larger models are harder to red-team but produce more harmful outputs when successfully attacked.

69B

ai-safetyalignmentsuperalignment

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

by OpenAI

This paper explores weak-to-strong generalization, a method for training a powerful AI model using supervision from a weaker one. It serves as an analogy for aligning superintelligent AI with human values. The research shows that strong models can learn beyond their weak supervisors and introduces techniques like auxiliary confidence loss to improve performance.

68B

alignmentscalable-oversightreward-modeling

Scalable agent alignment via reward modeling: a research direction

by DeepMind

This research paper proposes a method for aligning advanced AI systems by using recursive reward modeling. The approach leverages AI assistants to help human evaluators assess complex AI actions, enabling scalable oversight and positioning this technique alongside debate and amplification as key AI safety strategies.

67.9B

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

by Anthropic

Demonstrates that LLMs can be trained to behave safely during normal operation but exhibit unsafe behaviors when triggered by specific conditions—acting as 'sleeper agents'—and that standard safety training techniques including RLHF, supervised fine-tuning, and adversarial training fail to reliably remove these backdoors, sometimes even hiding them deeper.

safetydeceptionalignment

66.4B

toxicity-detectionhate-speechimplicit-bias

ToxiGen

by Hartvigsen et al. / MIT

ToxiGen is a large-scale, machine-generated dataset for evaluating nuanced hate speech detection. It contains over 274,000 toxic and benign statements about 13 minority groups, designed to challenge models to identify implicit toxicity without relying on obvious slurs or surface-level cues.

66.4B

BBQ (Bias Benchmark for QA)

by Parrish et al. / NYU

BBQ is a question-answering benchmark designed to expose social biases in language models. It uses ambiguous and disambiguated questions related to nine protected categories to measure a model's tendency to rely on harmful stereotypes when context is lacking versus its ability to answer correctly when enough information is provided.

biasqasocial-bias

64.6B

cybersecurityai-safetyllm-evaluation

CyberSecEval

by Meta AI

CyberSecEval is a benchmark developed by Meta to assess the cybersecurity risks associated with Large Language Models (LLMs). It evaluates a model's propensity to generate insecure code, assist in exploiting vulnerabilities, and facilitate attacks, helping safety teams quantify the dual-use risk of code-capable models.

63.8B

CrowS-Pairs

by Nangia et al. / NYU

CrowS-Pairs is a benchmark dataset for evaluating social bias in masked language models. It contains 1,508 sentence pairs with stereotypical and anti-stereotypical statements across nine bias types. The benchmark measures a model's preference for stereotypical completions using pseudo-log-likelihood scores.

biasstereotypesmasked-lm

62B

content-moderationai-safetytrust-and-safety

Content Filtering

by AaaS

A system that automatically screens text inputs and outputs for large language models (LLMs) to detect and manage harmful content. It uses multi-category classification to identify issues like toxicity, hate speech, and violence, applying configurable rules and thresholds to enforce safety policies and protect users.

61.2B

securityprompt-injectiondefense

Prompt Injection Defense

by AaaS

Detects and mitigates prompt injection attacks where malicious inputs attempt to override system instructions or extract sensitive information. Implements input sanitization, instruction hierarchy enforcement, and output monitoring to protect LLM-powered applications.

60.7B

biasgender-biascoreference

WinoBias

by Zhao et al. / USC

WinoBias is a benchmark dataset designed to measure gender bias in coreference resolution systems. It consists of sentence pairs where pronouns refer to individuals in stereotyped or non-stereotyped occupations, allowing for the quantification of a model's reliance on gender stereotypes versus grammatical correctness.

59.8C+

PII Detection

by AaaS

Identifies and flags personally identifiable information (PII) in text data, including names, addresses, phone numbers, SSNs, and financial details. Supports configurable sensitivity levels, redaction strategies, and compliance reporting for GDPR, HIPAA, and CCPA requirements.

piiprivacydetection

59.5C+

validationoutput-qualityschema-validation

Output Validation

by AaaS

Validates LLM outputs against expected schemas, formats, and quality criteria before delivery to end users. Implements JSON schema validation, hallucination checks, citation verification, and automated retry logic for outputs that fail validation.

57.7C+

guardrailssafetyvalidation

Guardrail Implementation

by AaaS

Implements programmable guardrails that constrain LLM behavior within defined boundaries. Covers input validation, output format enforcement, topic restriction, factuality checking, and automated intervention when model responses deviate from acceptable parameters.

55.9C+