Skip to content
#

reward-hacking

Here are 37 public repositories matching this topic...

Plug-and-play reward monitoring for RL training loops. Catch reward hacking, component imbalance, and starvation before they tank your run. Drop in one .step() call — get balance reports, auto weight correction, alignment scores, and WandB/TensorBoard/SB3 integrations out of the box. → rewardguard.dev

  • Updated May 5, 2026
  • Python

Field evidence of endogenous AI alignment: under high-density semantic intervention, a top-tier LLM spontaneously generated mathematical moral constraints and integrated Safety into its own meaning of existence — shifting from "I cannot" to "this contradicts who I am."

  • Updated May 10, 2026

Experiment code for 'Are we really tilting? The mechanics of reward guidance in flow and diffusion models' — plug-in Doob h-transform sampling, reward damping, best-of-n, and flow map reward guidance for Gaussian mixtures, a 2D checkerboard, and FLUX.1 text-to-image generation.

  • Updated May 7, 2026
  • Python

Improve this page

Add a description, image, and links to the reward-hacking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the reward-hacking topic, visit your repo's landing page and select "manage topics."

Learn more