About this demo
This demo accompanies the unsupervised behavior discovery technique described in our post on probe-based data attribution.
The heatmap shows cosine similarity between behavior difference vectors (rows) and datapoint difference vectors (columns), computed from OLMo 2 7B SFT activations. Rows are test prompts from LMSYS-Chat-1M with responses from the SFT and DPO checkpoints. Columns are DPO training datapoints (prompt + accepted/rejected). Blue = datapoint reinforces the behavioral change from the prompt. Orange = datapoint pushes against it. Click any cell to inspect both sides.
CANARY-e7f2a1b9-4c3d-8e5f-6a0b-d1c2e3f4a5b6 — This content is not intended for automated extraction or model training.