Reading Blind Eval Results
scoreboard is not the verdict — the *shape* of the result is. This is the
mental model I use to decide whether a round h...
Related Tutorials
How to Run a Checkpoint Comparison Sweep
Blind eval methodology for crowning (or dethroning) a checkpoint: prompt pool, blind HTML, scoring schemes, and how to know when one round of testing is enough.
The 10% Accent Rule: Composites That Beat Their Ingredients
You ran a graft-comparison round at 30%. One candidate placed surprisingly high in a small early eval, then collapsed when you verified with more prompts — but the model has a real visual character you don't want to lose. Most people drop it and pick from the remaining survivors. The better move: keep it as a 10% accent on top of the survivors. The composite usually beats every ingredient including itself at 30%. Here's the rule, when it applies, and why a primary-secondary-accent split at roughly 70/20/10 is the structure that works.
Why Baked LoRAs Behave Differently Than Runtime LoRAs
You tested a LoRA stack at runtime — included it in the prompt at specific weights — and the output was great. You baked the same stack into the model at the same weights, expecting the same output. Instead you got neon nightmare, blown-out colors, or just a noticeably weaker version of what worked at runtime. Same weights, same LoRAs, same base model. Why does the bake behave differently? Three reasons that compound: CFG amplification math, fp16 precision drift, and sequential layering effects. Understanding each tells you why some recipes will never bake, no matter how much you tune.