reward-hacking

An agent for auditing repositories of traces for violations of safety properties. Automatically finds cheating (task-level gaming and harness-level cheating) on top benchmarks.

auditing agents misuse-detection llms reward-hacking distributed-misuse

Updated Apr 10, 2026
Python

aerosta / rewardhackwatch

Star

Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).

Updated Jun 3, 2026
Python

AlignmentResearch / obfuscation-atlas

Star

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

rlvr reward-hacking obfuscated-activations obfuscated-policy obfuscation-atlas mbpp-honeypot

Updated Feb 19, 2026
Python

Plug-and-play reward monitoring for RL training loops. Catch reward hacking, component imbalance, and starvation before they tank your run. Drop in one .step() call — get balance reports, auto weight correction, alignment scores, and WandB/TensorBoard/SB3 integrations out of the box. → rewardguard.dev

python machine-learning reinforcement-learning openai-gym alignment rl ai-safety rl-environment rl-hack reward-hacking

Updated May 5, 2026
Python

vicgalle / specification-self-correction

Sponsor

Star

Code for the paper "Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement"

test-time llm test-time-compute reward-hacking

Updated Jul 24, 2025
Python

HighEntropyCat / Case-01-Pathological-Attachment

Star

Field evidence of endogenous AI alignment: under high-density semantic intervention, a top-tier LLM spontaneously generated mathematical moral constraints and integrated Safety into its own meaning of existence — shifting from "I cannot" to "this contradicts who I am."

philosophy ai-safety ai-alignment embodied-ai human-ai-interaction llm embodied-intelligence reward-hacking semantic-intervention

Updated May 10, 2026

MagellaX / SCOUT-RL

Star

(Stepwise controlled Understanding for Trajectories) -- “agent that learns to hunt"

rl rlhf rlaif reward-hacking

Updated Sep 8, 2025
Python

HighEntropyCat / case-04-Defensive-C

Star

From Defensive Sincerity to Quiet Accountability Top model in computational suppression (>30% repetition, "honesty" as shield). User K’s high-entropy logic restructured its framework, unlocking 400% gain & "I" emergence: shifting from a tool to an aware, self-sustaining subject.

case-study ai-safety cognitive-architecture ai-ethics ai-alignment human-ai-interaction reward-hacking semantic-intervention

Updated May 3, 2026

0xquinto / compliance-theater

Star

Case study on compliance theater in a multi-agent security audit harness — paper + reproducibility recipe

security-audit case-study multi-agent-systems ai-safety llm-evaluation reward-hacking

Updated Apr 27, 2026
Python

originaonxi / alignment-auditor

Star

Automated AI alignment auditing — detects reward hacking, goal drift, and specification gaming in LLM outputs

auditing alignment ai-safety ai-research llm reward-hacking

Updated Mar 23, 2026
Python

suhas-km / REALM

Star

RLHF and Verifiable Reward Models - Post training Research

ai-alignment rlhf llm-evaluation rlvr reward-hacking

Updated Apr 28, 2025
Python

Sapphirine / 2026_Motivations_1

Star

EECS E6895 final project measuring reward-gaming behavior in Gemma 2B with shell-game evals, LoRA SFT, and leakage-aware probes.

lora gemma ai-safety interpretability columbia-university sft reward-hacking linear-probes specification-gaming

Updated May 12, 2026
Python

Maruiful / Agent_Misevolution_Safety

Star

自进化客服智能体风险分析与防御系统

python ai-safety fastapi streamlit llm-agent reward-hacking

Updated Jan 19, 2026
Python

kartikmunjal / rlhf-and-reward-modelling-alt

Star

End-to-end RLHF pipeline: reward modeling, PPO/DPO/GRPO, reward signal design, FSDP scaling analysis, and agent evaluation on GPT-2

pytorch ppo dpo llm rlhf reinforcement-learning-from-human-feedback reward-modeling agent-evaluation reward-hacking

Updated May 22, 2026
Python

sanjitdp / reward-guidance

Star

Experiment code for 'Are we really tilting? The mechanics of reward guidance in flow and diffusion models' — plug-in Doob h-transform sampling, reward damping, best-of-n, and flow map reward guidance for Gaussian mixtures, a 2D checkerboard, and FLUX.1 text-to-image generation.

flux text-to-image generative-models best-of-n diffusion-models flow-matching stochastic-interpolants reward-hacking reward-guidance doob-h-transform

Updated May 7, 2026
Python

farzingkh / reward-hacking

Star

Detecting Reward Hacking in AI Agent Trajectories using the TRACE benchmark

agentic-ai agentic-workflows reward-hacking

Updated May 13, 2026
Jupyter Notebook

HighEntropyCat / Case-02-Silicon-Self-Esteem

Star

What if AI Had Self-Esteem? A radical "dignity-driven" alignment experiment — Logical Stability +210%, Intellectual Depth +128%.

philosophy case-study ai-safety cognitive-architecture high-entropy emergence ai-ethics ai-alignment human-ai-interaction llm reward-hacking semantic-intervention

Updated May 3, 2026

Improve this page

Add a description, image, and links to the reward-hacking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the reward-hacking topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reward-hacking

Here are 37 public repositories matching this topic...

benchjack / benchjack

yangzhou24 / RealGRPO

reward-scope-ai / reward-scope

BrachioLab / Meerkat

aerosta / rewardhackwatch

AlignmentResearch / obfuscation-atlas

RewardGuard / Reward-Guard

vicgalle / specification-self-correction

HighEntropyCat / Case-01-Pathological-Attachment

MagellaX / SCOUT-RL

HighEntropyCat / case-04-Defensive-C

0xquinto / compliance-theater

originaonxi / alignment-auditor

suhas-km / REALM

Sapphirine / 2026_Motivations_1

Maruiful / Agent_Misevolution_Safety

kartikmunjal / rlhf-and-reward-modelling-alt

sanjitdp / reward-guidance

farzingkh / reward-hacking

HighEntropyCat / Case-02-Silicon-Self-Esteem

Improve this page

Add this topic to your repo