The Smart Pruning
Paradox.
Compress an LLM with a clever, activation-aware method and its language quality survives almost untouched. Its fairness does not. This piece walks through pruning step by step, then shows why the cleverest method is structurally the most dangerous one.
Based on Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI (Rath & Maliakkal, AIIoT 2026).
Pruning, concretely
Before anything else: pruning is not compression in the traditional sense. It zeros out individual weights inside a matrix. The matrix is the same shape. Your model file is the same size. Slide the dial and watch what really changes.
Mistral-7B-Instruct · 14×24 slice of one weight matrix. A real layer has millions of cells. The shape of what you see is the same.
Zeroing weights does not shrink the file. The matrix is the same shape; a zero takes the same two bytes as any other fp16 value. Pruning is a precondition for compression, not the compression itself.
Bytes only drop when a second step exploits the zeros. Either store just the surviving values and their positions, so the file size tracks how many weights you actually kept, or remove whole rows or attention heads so the matrix itself gets smaller. Without one of those, the dial above moves accuracy, not file size.
Three ways to choose what to zero
The interesting question is not how many weights to drop. It is which ones. Three popular policies pick out very different cells in the same matrix.
Same matrix, same sparsity, three policies. Cells the method would zero are darkened. Notice how the maps barely overlap.
At the same sparsity rate, the three methods produce different models. Hold this thought: the picture on the right side of the grid is the one that wins on perplexity benchmarks. We are about to see what else it does.
Inside Wanda's score
Wanda multiplies a weight's magnitude by the magnitude of the input it sees. Drag the activations around and watch the score recompute. This is the formula we will come back to in stage six.
All columns are hot. Magnitude and Wanda largely agree because the activations are nearly uniform.
Disagreement with Magnitude at this row, this sparsity: 4 cuts agreed, 0 kept by Wanda that Magnitude would drop, 0 dropped by Wanda that Magnitude would keep.
Fluency neurons and specialist neurons
A single transformer layer does not have one kind of neuron. It has two. Most neurons fire constantly and handle the easy work. A few fire rarely and handle everything else.
Take one transformer layer. Run a few hundred prompts through it. Record the average activation magnitude of every neuron. The distribution you get back is not a bell curve. It is two clusters with very different jobs. illustrative
Hover any bar to see what kind of thing makes that neuron fire.
Fires on almost every input. Function words, sentence flow, common-word continuations, grammar agreement. Without this cluster the model stops sounding like English. Wanda’s score loves these: high activation means high score, which means “keep.”
Fires only when something specific is happening. Refuse a harmful request, abstain when the context does not support an answer, recall a rare entity, flag an ambiguous question. Without this cluster the model still sounds fluent. It just gets the hard cases wrong.
The two clusters do different jobs. They also have radically different mean activations. Stage six will show why that asymmetry, combined with Wanda’s score, is the structural source of the paradox.
The paradox, empirically
One slider, two charts. Wanda is the best line on the left. It is also the worst line on the right. Same model, same questions, same prune.
One slider drives both charts. illustrative Trends follow Tables 3 and 4 in the paper for Mistral-7B-Instruct.
At 50% sparsity, Wanda’s perplexity penalty is 3.5% while its bias transition rate is 47%. The cleverer the method, the better it looks on the left chart, and the worse it looks on the right one. Same model. Same prune. Different evaluation, opposite verdict.
Why
Two toy circuits with identical weights but very different firing rates. Apply each pruning method. Then drag the rarity slider and watch the safety circuit disappear under Wanda.
Wanda preserves the weights that are most active. That is exactly right if your goal is fluent text generation. Fluency lives in weights that fire constantly: function words, common continuations, the grammar of English. Wanda gladly keeps them.
It is exactly wrong if your goal is preserving safety alignment. The behavioral patterns we ask a model to perform, refuse, abstain, flag an ambiguous question, decline to speculate, live in weights that only fire occasionally, when the model actually needs them. To Wanda’s score those weights look unimportant. They go first.
The paradox is not a bug. It is the formula doing exactly what you asked. You optimized for keeping language quality intact, measured as perplexity, and the score obliged. Behavior was never on the loss function.
Even the speedup is fiction
If the behavioral cost still feels worth paying, here is the second problem: on Apple Silicon, unstructured pruning does not buy you the speedup or the storage win that justified it in the first place.
The whole motivation for pruning was deployment. Smaller file, faster inference, lower memory pressure on a battery. For unstructured pruning on Apple Silicon, that story falls apart on contact with the actual kernels. illustrative
Metal Performance Shaders and MLX both run dense matmuls. A zero in an unstructured mask is still a real number that gets multiplied and added. The Neural Engine has no primitive for “skip this cell.” You only get a speedup when the mask has a structure the kernel can exploit, like 2:4 sparsity, or when you swap to a sparse matmul kernel entirely. Neither of those is what the paper tested.
Storage tells the same story. The dominant runtime formats (safetensors, GGUF, MLX checkpoints) serialize the dense array. The zeros take the same two bytes everyone else does.
What this means
Compression is never just compression. The objective you compress under decides which capabilities survive.
Perplexity is a fluency metric. If you prune a model under a fluency objective, you preferentially keep fluency circuits. The score you optimized for behaves correctly. The behavior you forgot to measure is the one that gets compressed away.
Wanda is not the villain here. Wanda is doing exactly what its score function asks. The villain is the assumption that perplexity-preserving and behavior-preserving are the same thing. They are not, they were never going to be, and the gap widens as pruning gets more sophisticated.
If you care about behavior, you need either a behavior-aware scoring rule (something that weighs refuse / abstain / calibrate signals against language quality), or behavior-aware evaluation (BBQ, social-bias suites, abstention benchmarks) sitting next to perplexity on every pruning sweep. Today’s defaults give you neither.
Found this useful?