// ResNet-18 . CIFAR-10 . 4-Experiment Benchmark

Staged Embarrassment
Learning

A curriculum-based training method that applies dynamic gradient sparsity -- letting a neural network focus its compute on the samples it finds most embarrassing to get wrong.

View on GitHub

99% FLOPs saved (SEL-95)

95% FLOPs saved (Warmup+SEL)

85.3% Accuracy (Warmup+SEL)

38% Faster training time

SCROLL

01 // Origin

A child, a ball,
and a learning signal.

The idea for SEL came from watching a niece learning to catch a ball. She missed. Someone nearby laughed. She felt embarrassed -- and in that instant, something remarkable happened: she didn't just try harder, she corrected her exact mistake with a precision she hadn't shown before.

That emotional signal -- embarrassment -- triggered a targeted, high-efficiency correction. She wasn't recalibrating everything she knew. She was zeroing in on exactly what went wrong.

Traditional neural networks don't do this. They apply gradients uniformly, spending compute on easy samples they already know perfectly. SEL asks: what if we only updated weights where the model is genuinely embarrassed?

The result is a training algorithm that mirrors this human learning instinct -- suppressing gradient updates for confident, easy predictions, and concentrating all available compute on the samples the model finds hardest to explain.

Input Arrives

A training sample is passed through the network. The model makes a prediction.

Embarrassment Computed

Per-class embarrassment E_c is measured via temperature-scaled cross-entropy loss.

Guilt Threshold Applied

Gradients below the guilt threshold gamma are masked to zero -- the model "ignores" what it already knows.

Sparse Update

Only the most significant gradients survive. Frozen knowledge stays frozen.

Staged Curriculum

The training progresses from easy to hard samples across 5 stages, naturally escalating difficulty.

02 // Mathematical Foundation

The math of guilt.

SEL formalizes the intuition above into two elegant operations: measuring per-class embarrassment, and masking gradients that fall below a guilt threshold.

PER-CLASS EMBARRASSMENT E_c = (1 / |N_c|) . sum L( y_hat_i / T , y_i ) for i in N_c Where T = 1.5 is the temperature parameter, N_c is the set of samples from class c, and L is cross-entropy loss. Confidence is defined as C_c = max(0, 1 - E_c).

SPARSE GRADIENT UPDATE Mask = |grad(p)| > gamma grad(p) <- grad(p) . Mask gamma is the guilt threshold -- the p40 percentile of gradient magnitudes. Gradients below this threshold are zeroed out, producing ~95% sparsity. The remaining 5% of gradients carry all the learning signal.

sel_engine.py -- sparse_update()

def sparse_update(model, gamma):
    """Apply guilt threshold mask to gradients. Returns sparsity fraction."""
    tot = guilty = 0
    for p in model.parameters():
        if p.grad is not None:
            mask = (p.grad.abs() > gamma).float()  # 1 where guilt > threshold
            p.grad.mul_(mask)                    # zero out innocent gradients
            tot    += mask.numel()
            guilty += mask.sum().item()
    return 1.0 - (guilty / max(tot, 1))      # sparsity = fraction frozen

03 // Experimental Results

The evidence.

Four experiments on CIFAR-10, evaluated on a held-out test set of 100 images per class -- never seen during training. All run on a T4 GPU.

Test Accuracy -- Held-Out 100/Class

Baseline Lottery Ticket SEL-95% Warmup+SEL

Training Accuracy (Convergence)

Baseline Lottery SEL-95% Warmup+SEL

Cumulative FLOPs (Teraflops)

Baseline 33.5T Lottery 6.7T SEL-95% 0.084T Warmup+SEL 1.58T

Wall-Clock Training Time (seconds)

Accuracy vs. FLOPs Saved -- Pareto Frontier

Per-Class Accuracy -- Warmup+SEL vs Baseline (Final Epoch)

Summary -- Final Results

System	Test Acc (100/class)	FLOPs	Time	FLOPs Saved
Baseline CNN	93.2%	33.5T	1398s	0%
Lottery Ticket	93.0%	6.7T	1460s	80%
SEL-95%	77.5%	0.084T	835s	99%
Warmup+SEL	85.3%	1.58T	875s	95%

04 // Applications

Where embarrassment is a feature.

SEL's key insight -- spend compute only on confusing examples -- opens up training regimes that were previously impractical.

Edge Devices

99% FLOPs reduction makes on-device fine-tuning feasible on microcontrollers and embedded AI chips with extreme power constraints.

Real-Time Robotics

Robots can continuously learn from novel situations (embarrassing failures) without replaying entire datasets -- adaptive learning in the field.

Adaptive Education AI

Personalized tutoring systems that track per-concept "embarrassment" scores and focus practice on the exact gaps in a student's knowledge.

Federated Learning

Sparse gradient updates dramatically reduce communication overhead in distributed training -- crucial for privacy-preserving FL systems.

Continual Learning

By freezing confident knowledge and only updating on embarrassing samples, SEL naturally resists catastrophic forgetting.

Foundation Model Fine-Tuning

Apply SEL's sparse update logic to LoRA-style fine-tuning, drastically cutting the cost of adapting large models to new domains.

05 // Trade-offs

Honest limitations.

+ What works

Massive FLOPs reduction (95--99%) with manageable accuracy cost
Warmup+SEL recovers ~8% accuracy vs pure SEL-95 at minimal extra cost
Training wall-clock time drops by 38% vs baseline
Natural curriculum prevents early overfitting on easy samples
Embarrassment signal provides interpretable per-class difficulty tracking

- Known limitations

SEL-95 achieves only 77.5% vs 93.2% baseline -- a 15.7pp accuracy gap
Guilt threshold gamma requires careful calibration; wrong value kills convergence
Stage transitions can cause temporary accuracy dips (visible in curves)
Evaluated only on CIFAR-10 / ResNet-18; generalization to larger architectures is untested
Lottery Ticket achieves baseline accuracy with 80% savings -- a strong competitor in accuracy-critical settings

06 // Future Scope

Where this goes next.

NEAR-TERM

Adaptive guilt threshold gamma that auto-tunes per class, removing the need for manual calibration.

MEDIUM-TERM

SEL applied to transformer architectures -- per-head embarrassment scoring as an attention-aware sparsity method.

LONG-TERM

Embarrassment as a universal learning signal -- combining with reinforcement learning for agents that prioritize surprising state transitions.

Staged EmbarrassmentLearning

A child, a ball,and a learning signal.