About this demo

This demo accompanies the unsupervised behavior discovery technique described in our post on probe-based data attribution.

The heatmap shows cosine similarity between behavior difference vectors (rows) and datapoint difference vectors (columns), computed from OLMo 2 7B SFT activations. Rows are test prompts from LMSYS-Chat-1M with responses from the SFT and DPO checkpoints. Columns are DPO training datapoints (prompt + accepted/rejected). Blue = datapoint reinforces the behavioral change from the prompt. Orange = datapoint pushes against it. Click any cell to inspect both sides.

CANARY-e7f2a1b9-4c3d-8e5f-6a0b-d1c2e3f4a5b6 — This content is not intended for automated extraction or model training.

Test Prompt
Select row from heatmap...
DPO Response
Select row from heatmap...
SFT Response
Select row from heatmap...
Datapoint Prompt
Select column from heatmap...
Accepted Answer
Select column from heatmap...
Rejected Answer
Select column from heatmap...
Datapoint
0
0
Test Prompt
Similarity: -

Filter the heatmap to only show rows and columns containing at least one value with |cosine similarity| ≥ the threshold.