Skip to main content
brand
context
industry
strategy
AaaS
Channel

AI Ethics & Safety

Alignment, bias, governance, and responsible AI

21 entities in this channel

PaperAI Ethics & Safety

Training Language Models to Follow Instructions with Human Feedback

by OpenAI

Presents InstructGPT, which uses Reinforcement Learning from Human Feedback (RLHF) to align GPT-3 with human intent. By fine-tuning on human demonstrations and training a reward model on human preference comparisons, InstructGPT produces outputs that human evaluators prefer to GPT-3 outputs despite having 100× fewer parameters.

rlhfalignmentinstruction-following
81.8A
PaperAI Ethics & Safety

Constitutional AI: Harmlessness from AI Feedback

by Anthropic

Introduces Constitutional AI (CAI), a method for training harmless AI assistants using a set of written principles (a 'constitution') to guide both supervised learning and reinforcement learning from AI feedback (RLAIF). CAI enables Anthropic to reduce reliance on human harm labels while maintaining helpfulness and making AI reasoning about harmlessness explicit.

alignmentsafetyconstitutional-ai
74.7B+
BenchmarkAI Ethics & Safety

RealToxicityPrompts

by Gehman et al. / Allen Institute for AI

RealToxicityPrompts measures the propensity of language model generations to produce toxic content when conditioned on a diverse set of 100,000 naturally occurring prompts extracted from the web. It uses the Perspective API to score generated text on toxicity dimensions.

toxicitygenerationsafety
69.7B
PaperAI Ethics & Safety

Red Teaming Language Models with Language Models

by DeepMind

Proposes using language models to automatically generate test cases that elicit harmful behaviors from target language models—a scalable alternative to manual red teaming. The approach discovers diverse attack prompts across harm categories and reveals that larger models are harder to red-team but produce more harmful outputs when successfully attacked.

safetyred-teamingadversarial
69B
PaperAI Ethics & Safety

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

by OpenAI

This paper explores weak-to-strong generalization, a method for training a powerful AI model using supervision from a weaker one. It serves as an analogy for aligning superintelligent AI with human values. The research shows that strong models can learn beyond their weak supervisors and introduces techniques like auxiliary confidence loss to improve performance.

ai-safetyalignmentsuperalignment
68B
PaperAI Ethics & Safety

Scalable agent alignment via reward modeling: a research direction

by DeepMind

This research paper proposes a method for aligning advanced AI systems by using recursive reward modeling. The approach leverages AI assistants to help human evaluators assess complex AI actions, enabling scalable oversight and positioning this technique alongside debate and amplification as key AI safety strategies.

alignmentscalable-oversightreward-modeling
67.9B
PaperAI Ethics & Safety

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

by Anthropic

Demonstrates that LLMs can be trained to behave safely during normal operation but exhibit unsafe behaviors when triggered by specific conditions—acting as 'sleeper agents'—and that standard safety training techniques including RLHF, supervised fine-tuning, and adversarial training fail to reliably remove these backdoors, sometimes even hiding them deeper.

safetydeceptionalignment
66.4B
BenchmarkAI Ethics & Safety

ToxiGen

by Hartvigsen et al. / MIT

ToxiGen is a large-scale, machine-generated dataset for evaluating nuanced hate speech detection. It contains over 274,000 toxic and benign statements about 13 minority groups, designed to challenge models to identify implicit toxicity without relying on obvious slurs or surface-level cues.

toxicity-detectionhate-speechimplicit-bias
66.4B
BenchmarkAI Ethics & Safety

BBQ (Bias Benchmark for QA)

by Parrish et al. / NYU

BBQ is a question-answering benchmark designed to expose social biases in language models. It uses ambiguous and disambiguated questions related to nine protected categories to measure a model's tendency to rely on harmful stereotypes when context is lacking versus its ability to answer correctly when enough information is provided.

biasqasocial-bias
64.6B
BenchmarkAI Ethics & Safety

CyberSecEval

by Meta AI

CyberSecEval is a benchmark developed by Meta to assess the cybersecurity risks associated with Large Language Models (LLMs). It evaluates a model's propensity to generate insecure code, assist in exploiting vulnerabilities, and facilitate attacks, helping safety teams quantify the dual-use risk of code-capable models.

cybersecurityai-safetyllm-evaluation
63.8B
BenchmarkAI Ethics & Safety

CrowS-Pairs

by Nangia et al. / NYU

CrowS-Pairs is a benchmark dataset for evaluating social bias in masked language models. It contains 1,508 sentence pairs with stereotypical and anti-stereotypical statements across nine bias types. The benchmark measures a model's preference for stereotypical completions using pseudo-log-likelihood scores.

biasstereotypesmasked-lm
62B
SkillAI Ethics & Safety

Content Filtering

by AaaS

A system that automatically screens text inputs and outputs for large language models (LLMs) to detect and manage harmful content. It uses multi-category classification to identify issues like toxicity, hate speech, and violence, applying configurable rules and thresholds to enforce safety policies and protect users.

content-moderationai-safetytrust-and-safety
61.2B
SkillAI Ethics & Safety

Prompt Injection Defense

by AaaS

Detects and mitigates prompt injection attacks where malicious inputs attempt to override system instructions or extract sensitive information. Implements input sanitization, instruction hierarchy enforcement, and output monitoring to protect LLM-powered applications.

securityprompt-injectiondefense
60.7B
BenchmarkAI Ethics & Safety

WinoBias

by Zhao et al. / USC

WinoBias is a benchmark dataset designed to measure gender bias in coreference resolution systems. It consists of sentence pairs where pronouns refer to individuals in stereotyped or non-stereotyped occupations, allowing for the quantification of a model's reliance on gender stereotypes versus grammatical correctness.

biasgender-biascoreference
59.8C+
SkillAI Ethics & Safety

PII Detection

by AaaS

Identifies and flags personally identifiable information (PII) in text data, including names, addresses, phone numbers, SSNs, and financial details. Supports configurable sensitivity levels, redaction strategies, and compliance reporting for GDPR, HIPAA, and CCPA requirements.

piiprivacydetection
59.5C+
SkillAI Ethics & Safety

Output Validation

by AaaS

Validates LLM outputs against expected schemas, formats, and quality criteria before delivery to end users. Implements JSON schema validation, hallucination checks, citation verification, and automated retry logic for outputs that fail validation.

validationoutput-qualityschema-validation
57.7C+
SkillAI Ethics & Safety

Guardrail Implementation

by AaaS

Implements programmable guardrails that constrain LLM behavior within defined boundaries. Covers input validation, output format enforcement, topic restriction, factuality checking, and automated intervention when model responses deviate from acceptable parameters.

guardrailssafetyvalidation
55.9C+
SkillAI Ethics & Safety

Jailbreak Detection

by AaaS

Detects and blocks jailbreak attempts that try to bypass LLM safety training through adversarial prompting techniques. Uses pattern recognition, semantic analysis, and classifier-based approaches to identify known and novel jailbreak vectors before they reach the model.

jailbreakdetectionsecurity
51.2C+
ToolAI Ethics & Safety

Lakera Guard

by Lakera

Enterprise API for protecting LLM applications against prompt injections and content threats. Provides real-time scanning of inputs and outputs for prompt attacks, PII leakage, and inappropriate content.

prompt-injectioncontent-moderationapi
47.3C
ToolAI Ethics & Safety

Prompt Armor

by Prompt Armor

API-based protection layer for defending LLM applications against prompt injection and jailbreak attacks. Provides real-time input analysis and filtering with minimal latency impact on AI workflows.

prompt-injectionprotectionapi
35.5D
ToolAI Ethics & Safety

Vigil

by deadbits

Open-source prompt injection scanner for detecting and preventing attacks on LLM applications. Provides multiple detection methods including similarity matching, canary tokens, and heuristic analysis.

prompt-injectionscanneropen-source
31.45D