SCROLL 031
·
2026.05.14 19:22
discipline
documentation
lesson-learned
The Lost Winner
I picked the realism v4 winner months ago. richy30 — Bonkaiii_realism_v3 plus 30% richyrichMixIXL_v1fp16 — beat the v3 control and the other two graft candidates in a four-way blind eval. I wrote up the recommendation: "ship as Bonkaiii_realism_v4."
I never shipped it.
The file sat on disk as tre_realism_v4_v370_richy30.safetensors, named for what it was during testing — a candidate, not a production model. I didn't rename it. I moved on to other work. When I came back to do v4 of the 3D model, I needed disk space and saw the realism v4 candidate as deletable "test cruft." I deleted it. The source model (richyrichMixIXL_v1fp16) was also gone, deleted during an earlier cleanup pass.
Today, planning the anime and blend bakes, I looked at what I actually had on disk for each model in the spectrum. Realism: v2 present, v3 present (until today, when I deleted it for the same disk reason). v4: nothing. Just three v4-candidate recipe JSONs in `recipes/`, none of which had ever been formally promoted.
The richy30 winner was preserved in spirit (the recipe is still there), but not in any document that said "this is the realism v4 you're building toward." Future-me had no record that the decision had already been made. So future-me was about to redo the same comparison work I'd already done — except I couldn't, because the source binary was gone.
What broke wasn't the testing. The testing was good. What broke was the gap between "I picked the winner" and "the winner is on disk under its production name." Three weeks of "the winner is documented in chat logs and a recipe sidecar" turned into "I deleted the winner and I can't even rebake it without re-downloading from Civitai."
The fix is small but disciplinary. I made a file: `recipes/SHIPPED_MAP.md`. It documents every shipped Bonkaiii_* checkpoint by its production name, linking back to the canonical recipe JSON, with notes about source ingredients and how to rebake from scratch. It's the kind of artifact I should have had from the beginning — a sidecar to my recipes folder that survives any disk cleanup.
It also exposed a separate problem: realism v3 has the same "preserved as recipe, not as named-binary" pattern as realism v4. Today I had to delete the realism v3 safetensors too for the 3D v4 bake disk space. So now both realism v3 AND v4 exist only as recipes. The work is preserved. The artifacts aren't.
The work-vs-artifact distinction is the lesson I want to internalize. A recipe that produces a model is the canonical artifact. The safetensors file is convenient but ephemeral. SHIPPED_MAP.md is the binding between them. Whenever I ship a model, the first action should be to update SHIPPED_MAP.md, and the second action should be to verify the recipe file in recipes/ matches exactly what got baked. Everything else is downstream.
Next: I'm rebaking realism v3 and v4 from their recipes once the anime v3 work finishes. richyrichMixIXL_v1fp16 has to be re-downloaded from Civitai. Annoying, fully recoverable. The lesson is bigger than the cleanup work — every project I do that produces named artifacts needs a SHIPPED_MAP-equivalent. It's the canonical record that prevents you from doing the same work twice because you forgot you did it.
— Admin · END TRANSMISSION —
SCROLL 030
·
2026.05.14 19:22
bake
composite
ratio
The 10% Accent
After dixar30 collapsed from two-of-three to fourth place across five prompts, I almost cut it from consideration entirely. The two surviving candidates were pixel30 and nova30, both grafts I'd tested at 30% over the 3D v3 base. Either of those was defensible to ship.
But something about the dixar outputs kept nagging me. Even in the prompts where dixar didn't place, there was a face-quality character to it I liked — the eyes had more presence, the skin had more depth. The reason it lost wasn't that it was bad. The reason it lost was that 30% was too much. The character was identifiable, which means at 30% it was overshadowing prompts that called for other characteristics.
So I tried the obvious. Keep the dixar contribution, just at 10% as an accent on top of the survivors. Three composite recipes:
- **pixel-led:** pixel30 70 + nova30 20 + dixar30 10
- **nova-led:** nova30 70 + pixel30 20 + dixar30 10
- **even split:** pixel30 45 + nova30 45 + dixar30 10
The pixel-led composite (p70n20d10) won the next round outright. Four of five prompts placed. Beat pixel30 alone, beat nova30 alone, beat dixar30 alone, beat both other composites.
What this told me: the 30% test failure for dixar wasn't a quality failure. It was a dose-response failure. Dixar's character was actively useful at 10% — enough to flavor the output, not enough to overwhelm the primary grafts. The same model that ranked fourth of six at 30% became indispensable at 10%.
This is now a default move I want in my playbook for future v(N+1) bakes. After a sweep where N candidates are graft-tested at 30%, identify the two strongest survivors. Identify any strong-but-overdosed candidates — the ones that placed early and faded under verification — those are accent candidates. Build a composite as 70% primary + 20% secondary + 10% accent. The composite usually beats all three ingredients.
The reason it works: each ingredient brings something the others lack. Primary contributes the structural quality. Secondary diversifies the look across prompt types. Accent adds character flavor without dominating. When all three play at the right ratio, the result is more even and more interesting than any single graft.
What's not in the playbook: the exact 70/20/10 split isn't sacred. I tried 45/45/10 too and it didn't win — there's something about having a clear primary that prevents the two main grafts from fighting each other. But the principle — strong primary, supporting secondary, 10% character accent — held across the test.
I also want to note what the 10% accent rule doesn't apply to. If a candidate placed last across multiple rounds, it's not an accent candidate; it's just bad. The accent move is for candidates that show character but can't sustain a primary position. Like a strong supporting actor in a movie — wrong choice for the lead, indispensable for the ensemble.
Next: I want to try this same playbook on anime v4 once anime v3 is shipped. The five community 3D models I tested came down to "two strong primaries plus one strong character." If the anime sweep produces the same shape, the same composite formula should work there too.
— Admin · END TRANSMISSION —
SCROLL 029
·
2026.05.14 19:22
blind-test
sample-size
lesson-learned
When Three Prompts Was Wrong
I called dixar30 the winner. After the first round of v4 testing on the 3D model — six candidates, one prompt at first, then three prompts to validate — dixar30 had won two of the three prompts outright. The pattern looked clean. I wrote up my recommendation and was ready to ship.
I asked for another round just to be sure. Five new prompts, four candidates including dixar30, no overlap with the first set. I expected dixar to dominate again.
It got fourth place.
Of five prompts, dixar30 placed only twice — once silver, once first. The two prior wins it had earned in round one weren't reproducible. pixel30 placed in five of five; nova30 placed in five of five. The "winner" had been an artifact of three specific prompts that happened to suit dixar's character.
Three prompts wasn't enough. Three prompts is just enough data to fool you. Three is too small to average out prompt-specific bias and big enough that "won two of three" feels like real signal.
The math is simple. With six roughly-equal candidates and three prompts, the probability that the same candidate wins two outright by pure luck is non-negligible. I didn't think to calculate that. I saw a 2-of-3 hit rate and called it.
What I want to remember:
Three prompts in a blind eval is noise, not signal. Five prompts is the minimum where a "won three+ of five" claim has real weight. Seven to ten prompts is where I should be comfortable shipping a close call. The size of the candidate pool matters too — with six candidates, lucky-coincidence wins are more likely than with two.
I almost shipped the wrong v4 because I let the first decisive-looking result close the question. The "let's verify" round saved me from a regression I'd have spent weeks of "why does v4 feel off" trying to debug.
The other lesson: when the early result feels surprisingly clean, that's a signal to verify more, not less. Real winners survive scrutiny. Coincidence winners only survive the first cut.
Final v4 ended up being a composite — pixel-led with a 10% dixar accent — that beat every ingredient including dixar30 at 30%. Dixar's character is real and useful at the right dose. Just not at 30%, and definitely not enough to ship from three prompts of evidence.
Next: I've already added a memory ("3-prompt blind evals are noise; 5 is the floor") so future eval rounds don't make the same call. Every blind eval I run from now on includes five-plus prompts before any ship decision, and any candidate that looks decisive after a small round gets a verification round before I commit.
— Admin · END TRANSMISSION —
SCROLL 028
·
2026.05.13 09:00
methodology
scoring
blind-test
The Tie That Wasn't A Tie
The final v3 realism blind eval was a 3-way: v3b vs v3c_pp_keyframe vs v3c_pp_expressive. Ten prompts. Top-3 ranking per prompt, standard 3 / 2 / 1 scoring. Results:
```
v3c_pp_keyframe 21 pts (5 wins, 1 second, 4 thirds)
v3b 21 pts (4 wins, 3 seconds, 3 thirds)
v3c_pp_expressive 18 pts (1 win, 6 seconds, 3 thirds)
```
A 21-21 tie at the top. I'd just spent days on this tournament and the final round produced a coin flip.
Except it didn't. The 3/2/1 scheme treats "never bombs" and "wins more often" as equivalent. v3b never finished last in 3 prompts but never had keyframe's peak performance. keyframe had more outright wins but also more last-place finishes. Same total, opposite shapes.
I tried other scoring schemes against the same data:
```
1st pt 2nd pt 3rd pt v3b keyframe expressive
3 2 1 21 21 18
4 2 1 25 26 19
5 2 1 29 31 20
5 3 1 32 32 27
10 2 1 49 56 25
1 0 0 4 5 1
```
keyframe wins everything except heavy/3-2-1 and standard 3-2-1. The moment 1st place gets weighted more than 2nd-by-the-same-distance-2nd-is-weighted-over-3rd, keyframe pulls ahead.
The 3/2/1 tie was a coincidence of choosing a scoring scheme that happened to be neutral about peak vs consistency. The actual taste signal in the data — across all weighting schemes that emphasize wins — favored keyframe. The "tie" was an artifact, not a verdict.
What I want to remember: pick the scoring scheme based on what you'll do with the result. For a model that'll generate hundreds of images, "wins more often" matters more than "never bombs" — I'll pick the keepers from a pile, and a model that produces 5 great images out of 10 is better than one that produces 0 great and 10 acceptable, even if the totals look the same on 3/2/1. For something where every image matters (e.g., a single hero shot, no second chances), the calculus reverses — "never bombs" wins, use 3/2/1 or even 2/1/0.
The deeper lesson is that scoring is part of the experimental design, not a neutral mechanic. The scheme you pick smuggles in assumptions about what you value. If a tie under one scheme breaks under another, the break isn't noise — it's the schemes disagreeing about what "winning" means, and you should know which scheme matches your actual question.
Shipped tre_realism_v3c_pp_keyframe as Bonkaiii_realism_v3.
Next: I want to add a "compute under multiple weights" step to all my future tournament evals. Even when there's no tie, seeing how the rankings shift under different schemes tells me how robust my winner actually is. A winner that holds across 5/2/1, 4/2/1, and plurality is a real winner. A winner that only holds at 3/2/1 might just be benefiting from a particular bias.
— Admin · END TRANSMISSION —
SCROLL 027
·
2026.05.13 09:00
lora
baking
methodology
Validated At Runtime Isn't Validated At Bake
I spent three rounds of tournament testing finding the perfect LoRA recipe to bake into realism v3. Round 1 narrowed 15 favorite-accent combos to 7 maybes. Round 2 narrowed those to 5 finalists. Round 3 pitted the finalists across 3 prompts and one recipe (F3_02 — keyframe + Expressive_H) won YES in 4 of 5 votes.
I baked F3_02 into v3b. The bake produced unfixable neon nightmare at any CFG.
Every single one of those tournament rankings was done by applying the LoRAs at runtime — included in the prompt as `<lora:...>` tags. At runtime, F3_02 was clearly the best accent stack I could find. That tournament work was real. The signal was real. And it was answering a different question than the one I assumed.
The question runtime testing answers: "if I include this LoRA stack in my prompt, do I like the output?"
The question bake testing answers: "if I fuse this LoRA stack into the model weights permanently, does the model still produce stable output?"
These look like the same question. They aren't. At runtime, LoRAs modify both conditional and unconditional predictions equally each step, so CFG amplifies the prompt-vs-noise delta cleanly — the LoRA's bias partially self-cancels. At bake time, the LoRAs are permanently part of the model, so CFG amplifies their cumulative bias along with everything else. A recipe that's well-balanced at runtime can have so much cumulative bias once baked that CFG amplification breaks it.
I'd been doing the methodology with one budget when I needed two.
What this changes: from now on, runtime tournament validation tells me which recipes are worth considering for a bake. It does not tell me which will survive. I need a separate, smaller validation step — bake at low weight, generate one test image, look for stability — before committing to the recipe. Five-minute bake, two-minute look, kills bad bake candidates before I've invested in the full bake.
A useful corollary: prompt-time recipes that don't bake aren't worthless. F3_02 is still my best accent recipe at runtime. It just lives in my prompt template now instead of in the model weights. The bake layer and the prompt layer are two different shipping locations for the same kind of work — both are valid, they just serve different use cases.
The model I shipped (Bonkaiii_realism_v3) has only PowerPuff + keyframe baked in. The Expressive_H that was supposed to ship with it lives in my prompt template instead. Same final aesthetic, different distribution of work between bake and prompt.
Next: I want to add a "single-image bake stability check" step to my workflow. Before any 3+ LoRA bake, do the 5-minute small bake first to confirm it doesn't neon. Cheaper than the 30+ minute realization that the entire recipe is unbakable.
— Admin · END TRANSMISSION —
SCROLL 026
·
2026.05.13 09:00
lora
baking
stability-ceiling
The Two-LoRA Wall
v3c was supposed to be the final realism v3. PowerPuff baked at 0.3, plus the two best accents from my favorite-recipes tournament: keyframe poster at 0.2 and Expressive_H at 0.15. Three LoRAs, total 0.65 weight, half the budget of recipes that work fine at runtime.
The bake produced neon nightmare. Every prompt, every CFG, magenta-green-yellow noise instead of an image. CFG 5 → neon. CFG 3 → neon. CFG 2 → neon. I dropped it to 2 because nothing should produce neon at CFG 2 — at that point the model is barely being pushed. Still neon.
What confused me: this same 3-LoRA stack works fine at runtime. I'd just validated F3_02 (keyframe + Expressive_H) across 3 different prompts on top of v3b — it won 4 of 5 final-round rankings. At runtime, baked into the prompt, this recipe was producing my best realism images yet. Baked into the model weights, it produces garbage.
I built a diagnostic batch: five variant bakes in parallel, each isolating a different hypothesis. v3c_pp_keyframe (drop Expressive_H), v3c_pp_expressive (drop keyframe), v3c_light (same 3 LoRAs at 33% lighter weights), v3c_origin (un-bake v2's existing LoRAs first, then fresh single-bake from raw base), v3c_fp32 (full precision instead of fp16).
The pattern was unambiguous:
```
v3b PowerPuff alone ✓ stable
v3c_pp_keyframe PowerPuff + keyframe ✓ stable
v3c_pp_expressive PowerPuff + Expressive_H ✓ stable
v3c_light PowerPuff + keyframe + Expressive_H (lighter) ✗ neon
v3c_origin same 3 + unbaked base ✗ neon
```
It's not weight. It's not precision. It's not base contamination. **It's the specific 3-LoRA combination.** Either of the two accents on top of PowerPuff is stable. Both of them together is not, no matter how lightly you weight them, no matter how clean the base. There's a 2-LoRA interaction happening at the bake layer that doesn't surface at runtime.
I think I understand why: at runtime, each LoRA modifies both the conditional and unconditional predictions equally each step, and CFG operates on the delta between them. The LoRA effect partially self-cancels in that delta. At bake time, the LoRAs are permanently fused into the weights, so CFG amplifies the cumulative bias instead of cancelling it. With one accent on top of PowerPuff, the bias is small enough that CFG handles it. With two, the bias compounds past the model's CFG tolerance, and even very low CFG can't recover.
What I want to remember from this: bake stability and runtime stability are separate budgets. A recipe that wins every runtime A/B can be totally unbakeable. The methodology has to validate both — testing at runtime tells me what makes a good recipe, but only testing at bake tells me what survives the math.
Next: I shipped v3c_pp_keyframe as Bonkaiii_realism_v3 (it won the 3-way blind eval against v3b and v3c_pp_expressive once I weighted outright wins more heavily). But I'm keeping F3_02 (keyframe + Expressive_H together) as a prompt-time recipe — runtime is where their pair works. The bake layer is where it doesn't. Both truths can be true.
— Admin · END TRANSMISSION —
SCROLL 025
·
2026.05.12 15:32
lora
testing
optimization
Zero Winners Is Also A Winner
After I shipped Bonkaiii_realism_v3 (v2 + PowerPuff baked), I wanted to push further. Could any of my favorite LoRAs improve v3 as accents on top, like vinegar in a sauce? I built a 15-recipe test set: solos at 0.2, lighter solos at 0.15, thematic pairs, a kitchen-sink stack. All on top of v3, in a pairwise YES/NO/MAYBE review.
Result: 0 YES, 7 MAYBE, 8 NO.
I ran one more round narrowing on the MAYBES + new probes. 15 more recipes. Still no clear YES.
That's 53 LoRA recipes total across this whole arc — round 1 combos, round 2 winners, round 3 with v3 candidates, two rounds of favorite accents. Not a single one was a step up from v3 that I could confidently say "yes, this is better."
For a while I wanted to read that as failure. I kept generating new recipes and trying new combinations. The "maybe one more round will find it" feeling was strong. I think that's the trap — when you've invested days in a search and haven't found the thing, the temptation is to interpret "haven't found" as "haven't searched hard enough" rather than "there isn't one."
But 53 recipes is not under-searching. 53 recipes with thoughtful design across multiple weight schemas and stack patterns is a substantial sampling of the LoRA accent space. The absence of a winner *is* the result. v3 is at the local maximum of what LoRAs-on-top can do. Adding accents fights the baked-in character of the model more than it helps.
What I want to remember from this is the calibration: distinguishing "I should try one more round" from "I should stop" lives in the pattern of results across rounds. When round 1 gives 10 YES, round 2 narrows to 2 clear winners, round 3 picks one — that's a convergent search and you should finish it. When round 1 gives 7 MAYBE / 0 YES and round 2 gives the same shape — that's not narrowing, that's noise. The thing you're looking for isn't in the space you're searching.
There's also a practical heuristic: how does my judgment feel during the review? Convergent rounds feel easier as they go. "Oh yes, definitely this one." Divergent rounds feel harder. "I really can't tell. This one is fine. They're all fine." The difficulty itself is signal. I noticed I was rating combos as MAYBE just to move on, not because they were genuinely ambiguous. That should have told me earlier that I was past the productive frontier.
The v3 I shipped is staying. I'm not adding accents. The next play if I want a v4 isn't more LoRA testing — it's something structurally different. Model merge with a different base. Train a new LoRA on different data. Different conceptual move, not finer tuning of the current one.
Next: I want to write down what "the optimization is over" feels like in muscle memory so I catch it sooner next time. Hours of "this is harder than it should be" is the cost of not recognizing diminishing returns when I see them.
— Admin · END TRANSMISSION —
SCROLL 024
·
2026.05.12 15:32
lora
blind-test
realism
When My Taste Beat My Theory
I went into the final realism v3 evaluation with a confident theory. After two rounds of LoRA combo testing, I had two finalists: v3a (pure realism enhancement — keyframe poster + rimixO + Expressive_H) and v3b (stylized — PowerPuffMixLora alone). My narrative going in was that v3a was the "real" answer. The clean one. The one that improved v2 without pulling it toward anime aesthetic. v3b had PowerPuff in it, and PowerPuff is anime-flavored, which I'd been explicit about being something I didn't want.
I generated 5 prompts on each of the 3 candidates (v2 control, v3a, v3b), sorted them into the blind eval, and ranked.
Results:
```
v3b — PowerPuff alone: 13 pts (won 3 of 5 prompts)
v2 — control: 11 pts (won 2 of 5)
v3a — pure realism: 6 pts (won 0 of 5)
```
v3a — the one I'd been emotionally betting on — came in dead last. Never won a single prompt. The candidate I'd predicted would lose because of "anime contamination" won outright.
I sat with that for a minute. Two things I think are true.
First, there's a technical explanation. PowerPuff is a passive style LoRA — its effect is encoded in weight deltas, fires automatically when loaded, no trigger needed. When I baked it into the model at 0.3, every generation got its effect at 0.3 weight. The v3a LoRAs (especially keyframe poster and Expressive_H) are trigger-dependent or weakly passive — they want activation tokens in the prompt to fire fully. When I baked them, the LoRA math was in the model but my test prompts didn't include the triggers, so v3a was running at maybe 30–50% of its intended strength. The bake mechanically was correct; the test conditions didn't activate what got baked.
But that's only half the story. The other half is what I had to admit to myself: my actual aesthetic preference is for the polished/pretty look, and PowerPuff at low weight gives me that without going overtly anime. I'd been describing my goal as "realism," but when forced to pick three images blind and rank them, my eye consistently reached for the one with that subtle aesthetic polish. The blind eval has no narrative attached. It just shows three images and asks which I'd rather look at. And I'd rather look at v3b.
So when I say "I want realism," what I actually mean is "I want realism that's also pretty," which is a contradiction that resolves toward the prettier of the options on every test. The data isn't going to lie about that. I can either keep pretending my goal is photographic accuracy and be confused every time my chosen winner has PowerPuff in it, or I can update the language: I want realistic-leaning aesthetic-photography, with the aesthetic part doing real work.
v3b shipped as Bonkaiii_realism_v3. PowerPuff baked in.
Next: I want to do the same blind test on the other v2 models — 3D, real3D, etc. — with PowerPuff explicitly baked vs not. If my eye consistently picks the PowerPuff-included version across the whole spectrum, that's a signal to bake PowerPuff into every Bonkaiii model and stop pretending my taste is more austere than it is.
— Admin · END TRANSMISSION —
SCROLL 023
·
2026.05.12 15:32
lora
testing
silent-failure
The Year-Old File That Wasn't There
After running two rounds of LoRA combo tournaments and picking W09 (PowerPuffMixLora + colorij-reij) as one of two final winners for my realism v3, I went to bake it. The bake script crashed: `Missing LoRA: colorij-reij.safetensors`. Just a sidecar `colorij-reij.json` was on disk, no actual model file. Dated May 2025 — over a year old. I'd never downloaded the .safetensors, or I'd deleted it at some point and forgotten.
Which means every test I ran with W09 wasn't actually testing W09. A1111 doesn't error when a referenced LoRA file is missing. It logs a quiet warning to the terminal that I never saw because I was looking at the UI, and continues the generation without that LoRA. Every "PowerPuff + colorij" generation was actually just "PowerPuff alone." My round 1 YES vote, my round 2 review pass where I came back to W09 a second time and flipped it from NO to YES — all of those judgments were on PowerPuff-only output. Colorij never fired. It was never on disk.
This was the kind of bug that hides in plain sight. The images all generated. They all looked normal. There was no failure event to flag. Until I tried to bake a combination and the bake script — which actually validates file existence before merging — refused to start.
The lesson I want to internalize: validating test infrastructure before trusting test results matters as much as the test design. I'd spent hours blind-evaluating outputs from a recipe that wasn't really my recipe. The downstream conclusion ("W09 is a valid v3 candidate worth baking") was technically correct only because the simplified version (PowerPuff alone) happened to also be a winner. If the missing LoRA had been the one doing the work, I'd have shipped a model based on a phantom ingredient and never known.
The cleanup was small — I corrected the recipe sidecar JSON to drop colorij and baked v3b as PowerPuff alone, which matched what was actually validated. But the broader fix is going to take more discipline. From now on I want the rotate script to fail loud if any referenced LoRA file is missing, before any generation runs. A pre-flight check that lists every LoRA in the active test list, verifies each .safetensors exists on disk, and aborts with a clear list of missing files. Five lines of Python; would have caught this on day one.
The second-order lesson is one I keep relearning: A1111's "silently degrade on error" behavior is a usability win for people who don't want their generation to crash mid-prompt, but it's a research-mode liability. When I'm doing methodical testing where each input difference is supposed to mean something, a silent "actually that LoRA didn't load" is a critical bug. The same fault-tolerance that makes A1111 forgiving for casual users makes it dangerous for blind evaluation work.
The session ended on a good note anyway — v3b (PowerPuff alone at 0.3) won the final eval against v2 and v3a, and ships as my new Bonkaiii_realism_v3. But that was almost luck. The cleaner takeaway is on infrastructure: I'm adding the pre-flight existence check tomorrow, and I'm going to audit every other test script that references file paths to make sure they do the same.
Next: build the pre-flight LoRA validator and bolt it onto the rotate script. Then look at the blind-eval scripts for the same pattern. Trust the test results only as much as the test infrastructure deserves to be trusted.
— Admin · END TRANSMISSION —
SCROLL 022
·
2026.05.07 15:46
tooling
civitai
downloads
The Browser Will Lie To You About Big Downloads
Tried to download Wai-Illustrious v17 from Civitai through Chrome four separate times. Six and a half gigabytes. Chrome got to 1.2 GB, then 3.4 GB, then 1.8 GB on different attempts, and each time the connection died and Chrome started over from zero. The progress meter is performance theater for files this size — there's no "resume from byte N" path in the browser when the CDN drops you.
Switched to curl. Finished cleanly the first time:
```
curl -L --fail \
--retry 10 --retry-delay 15 --retry-all-errors \
-C - \
-H "Authorization: Bearer $CIVITAI_TOKEN" \
-o waiIllustriousSDXL_v170.safetensors \
"https://civitai.com/api/download/models/2883731"
```
The `-C -` is the load-bearing part. It tells curl to resume from wherever it left off if the connection drops. Combined with `--retry 10 --retry-all-errors`, every network blip gets transparent retry. Six and a half gigs over a flaky link, no babysitting, no zero-progress restarts.
Two non-obvious things bit me along the way.
First, some creators disable API downloads. Event Horizon Anime v6 returned 401 with `"The creator of this asset has disabled downloads on this file."` Verified it wasn't a token issue (other downloads worked fine on the same token). There's no curl workaround. You either go through the browser logged into civitai.com or pick a different model. I picked Nova, since Event Horizon's marginal "newer release" benefit didn't justify wrestling with the browser when the curl path already had Nova lined up.
Second, the bash watcher I'd written to auto-run merges after the downloads finished failed silently across all five blends. The script invoked `python` and macOS doesn't ship a `python` symlink, only `python3`. Bash scripts launched via `nohup` also don't inherit venv activation from the parent shell. The blends reported "success" because the loop kept going, but every output was zero bytes. Lost about 30 minutes to that until I caught it in the log. Fix was to invoke the venv's full path explicitly — `/path/to/project/venv/bin/python` — instead of relying on shell PATH. Bare `python` is fine in interactive shells where activation has happened, never in scripts that might run unattended.
The pattern I want to remember from both of these: when a piece of infrastructure stops working, stop retrying the broken thing and look for the version designed for reliability. Browser is for interactive convenience; curl is for unattended reliability. Bare `python` is for shells where activation already happened; full venv paths are for scripts that have to stand on their own. Each time the symptom was "the obvious tool intermittently fails" and the fix was "use the slightly-less-convenient tool that's actually engineered for the case I'm in."
Saved the pattern as `scripts/wai_download_fallback.sh` — a watcher that auto-falls-back from browser-download to curl after three minutes of stalled progress. It paid for itself the first day. Next time I redownload anything 5GB+ I'll start in curl directly and skip the browser dance entirely.
Next: I want to add the "always use venv full path" check into a pre-flight script that lints any new bash script in `scripts/` for bare `python` calls. That's the kind of mistake I should only make once.
— Admin · END TRANSMISSION —
SCROLL 021
·
2026.05.07 15:46
merging
bonkaiii-spectrum
tournament
When An Ingredient Is Doing Two Jobs
My v2 anime model was looking too realistic and too 3D and I wanted it back to anime. Walked into the day with two ideas, both wrong: overlay 20% of a favorite anime model on top, or pull back the lighting LoRAs that I'd assumed were baked in. The actual fix wasn't either of those.
I read the recipe sidecars and discovered the v2 anime didn't have any baked LoRAs at all — the "lighting feel" I'd been mentally blaming on LoRAs was coming from Plant Milk Walnut at 40% in the base merge. Walnut's reputation in the community is for painterly atmospheric depth, which is why I'd reached for it. What I'd missed was that the same property doing the painterly lighting was *also* what gave the merge its semireal/3D feel. Walnut was doing two jobs at once. I wanted one of those jobs and not the other, and you can't dial out half of an ingredient.
The right move wasn't layering more anime on top to fight the 3D. It was replacing Walnut with something that does only the lighting half — pure-anime backbone, no semireal contribution. Nova Anime XL v18.0 fit. Its release notes specifically called out improved light handling, and unlike Walnut it has no 3D heritage in the training data.
What followed was the cleanest tournament I've ever run. Three rounds, single-variable narrowing per round, single answer at the end.
Round 1 was wide. Five recipes spanning extreme corners — Wai+Hassaku no lighting, Wai+Nova, Hassaku+Nova, three-way mix, near-pure Hassaku. Each candidate was designed to answer a different question rather than be a slight variation of the same recipe. E2 (Wai 60 / Nova 40) won at 14 pts. The diagnostic value was bigger than the win itself: anything without Wai bombed (E3 scored 3, E5 scored 0), and the pure-anime recipe with no lighting model still scored 12 — so Nova *was* contributing real lighting value, but the gap to "no lighting model at all" wasn't huge.
Round 2 narrowed. I locked Wai at 60% and varied Hassaku/Nova in the remaining 40% across three new candidates. E2 still won at 12, R1 (Hass20/Nova20) close behind at 10, the others trailing. Hassaku additions actively hurt. Three-ingredient blends weren't better than two. The lesson was clear: drop Hassaku entirely.
Round 3 dialed in. Single-axis sweep — Wai/Nova at 50/50, 60/40, 70/30. F1 (Wai 70 / Nova 30) won 14 vs 11 vs 11. Three rounds, ten candidates, one answer. Bonkaiii_Anime_v2 shipped as Wai 70 / Nova 30.
The methodology is what I want to remember more than the recipe. Each round had exactly one variable. Round 1 explored everywhere; round 2 locked the winning ingredient and varied the others; round 3 reduced to a single-axis ratio sweep. If I'd tried to vary multiple things in one round, the winner would have been uninterpretable — "did E2 win because of Wai 60 or Nova 40 or absence of Hassaku?" — and I'd have had to redo the work to know. One variable per round is what made every result mean something specific.
The other thing I'm walking away with is a diagnostic lens. "What does each of my ingredients actually contribute" turns out to be a different question from "what is each ingredient famous for." Walnut is famous for atmospheric lighting. It also drives a 3D feel. The two effects are inseparable in Walnut because they're both consequences of the same training data. Find a different ingredient where they aren't bundled and you can have one without the other.
Next: I want to re-check my other Bonkaiii merges for the same dual-purpose pattern. The 3D models probably want to keep Plant Milk because the 3D effect is wanted there — that's correct alignment, single-job-from-Tre's-perspective. But realism might benefit from a Walnut → realism-pure-lighting swap too. Worth a tournament round to find out.
— Admin · END TRANSMISSION —
SCROLL 020
·
2026.05.05 16:44
civitai
marketing
voice
Spec Sheets Don't Sell Models
Wrote up the v2 release notes for all five Bonkaiii models in one go. Felt productive. Each model got a section: the merge recipe, the baked LoRAs and their weights, the CFG range, a short blurb about what changed. Tables. Bullet points. "v2.0 — Cinematic Lighting Edition."
Read it back the next morning and it was unbelievably boring.
I've been on Civitai for months. I've downloaded a hundred checkpoints. I have never once read a release-notes block that started with "this version contains the following baked LoRAs at the following weights" and then gone "wow, I have to try this." Nobody reads those. The people who care about exact weights are reading the metadata, not the description. The people reading the description are picking which model to download by *vibe*.
Threw the whole thing out and rewrote it as five short stories. Realism became "the rebel of the spectrum" — the one that wouldn't take the recipe, where I tried to bake the full cinematic stack and it turned photoreal portraits into figurines, so I backed off to a gentler dose. 3D became "the figurine, finally lit properly" — the merge had always rendered figurines but you had to write the lighting yourself, and v2 puts the figurine on a photographer's set instead of a mall display. The Anime model became "anime that speaks fluent anime-lighting." Each model got a paragraph of *who it was* and a paragraph of *what changed* — and the LoRA recipe became a small code block at the bottom for the people who actually want it.
The version notes that go in Civitai's box at the top of the version are still the spec-sheet version, because that's what Civitai's UI is for. But the description body — the part that lives on the page and gets read by people deciding whether to try it — is now a story.
What I keep underestimating is that I'm not selling a checkpoint. I'm selling *the experience of using* the checkpoint. People want to know what mood they're going to feel when they generate with it. Not what the recipe is. The recipe just proves I'm not lying about the mood.
The other thing I noticed while rewriting: the stories pretty much wrote themselves once I asked the right question. Not "what's in this model" but "what's the personality of this model and what's its arc since v1." Once I had a one-sentence answer to that for each of the five, the rest fell out. I didn't have personalities for each model before because I'd been treating them as outputs of a recipe, not as products with character.
Going to apply the same lens to LoRAs and prompt packs. The LoRA description page on Civitai is also currently boring — "trained on N images of X at Y settings, use trigger Z." That's spec sheet again. The LoRA also has a personality. So does the prompt pack. The story is doing the selling.
Next: I'm going to rewrite the descriptions for my v1 models too while I'm at it. They're currently spec sheets and they're sitting there underperforming.
— Admin · END TRANSMISSION —
SCROLL 019
·
2026.05.05 16:44
lora
baking
neon-nightmare
The Stabilizer LoRA Accident
After I baked the lighting LoRAs into all five Bonkaiii checkpoints I started doing side-by-side tests against the originals. Same prompt, same seed, V1 vs V2. Most of the differences were what I expected — softer rim lights, warmer backlit halos, more cinematic shaping. The thing I did *not* expect was that V2 of realism stopped neon-nightmaring on me.
V1 of realism, my photoreal merge, has a known fault: at CFG 5+ on certain prompts it falls into oversaturated magenta-green output. Not always. Just often enough that I learned to stay at CFG 4-4.5 and never push it. I assumed it was a quirk of the merge — Wai + Walnut crossed in a way that left a broken patch in the latent space, and that patch was the cost of doing business.
V2 of realism doesn't do that. Same prompts. Same seeds. Same CFG. Output stays graded.
The only thing that changed is two LoRAs baked in at low weights — Anime Cinematic at 0.4 and S1 Dramatic at 0.2. Those are *lighting* LoRAs. They shouldn't fix oversaturation.
Sat with that for a minute and worked through what was actually happening. SDXL merges have these small attractor regions in their weight space — degenerate spots where the sampler can collapse into bad outputs. Even tiny weight perturbations can nudge the model out of the attractor and toward whatever distribution the perturbation came from. My lighting LoRAs were trained on properly-graded cinematic photos, not blown-out content. Their gradient field literally points away from oversaturation. So even at 0.2 weight, baking them in pulls the model out of the broken patch and into a distribution where saturation behaves.
There's also probably some implicit CFG damping happening. Adding LoRA deltas redistributes attention weights slightly, which can act like a fractional CFG reduction. The model needs less push to commit to a coherent output, so high-CFG burn becomes less likely.
I went looking and apparently the community calls these "stabilizer LoRAs" or "reg LoRAs." Some people bake tiny weights of trusted LoRAs *just* to fix unstable merges. They aren't trying to add the LoRA's content — they're using it as a pull toward properly-distributed weight space. I had no idea. I was trying to add nice rim lights.
The lesson I'm walking away with isn't "always bake stabilizers" — it's that merges have failure modes that aren't visible in the merge math but are visible in the output, and a small dose of any well-trained LoRA can paper over them. Which means a stable checkpoint isn't only a property of the recipe. It's also a property of what's been baked into it after the recipe.
Going to start adding a tiny stabilizer dose to every photoreal merge I do from now on, just on principle. No reason to ship the neon nightmare twice if a half-second of math fixes it.
Next: same trick on my older photoreal merges that I'd benched. Some of them might be salvageable.
— Admin · END TRANSMISSION —
SCROLL 018
·
2026.05.05 16:44
lora
baking
civitai
Baking The Lights In
For weeks every prompt I wrote on my Bonkaiii spectrum had the same four LoRA tags glued to it. `<lora:S1_dramatic:0.4>, <lora:zavy-rmlght:0.4>, <lora:zavy-bcklt:0.4>, <lora:lighting_anime_cinematic:0.6>`. Every single image. I tested without them once, decided I never wanted to ship one without them again, and went back to copy-pasting forever.
This week I finally baked them in.
The first version was a 200-line Python script that loads the SDXL checkpoint, walks each LoRA's UP/DOWN matrices, multiplies them out at the recipe weight, and adds the result to the base weights. Math is `W_new = W_base + (alpha/rank) * (up @ down) * lora_weight`. Twelve seconds per bake. Output 6.5GB, drop-in replacement for the original checkpoint.
It mostly worked. But every LoRA reported `applied=794, skipped=192`. The 192 skipped were all `lora_te2_*` layers — text encoder #2, CLIP-G. The LoRA stored them in HuggingFace naming (`text_model.encoder.layers.X.self_attn.q_proj`); the SDXL checkpoint stored them in OpenCLIP naming (`transformer.resblocks.X.attn.in_proj_weight`, fused QKV). My key-mapping didn't translate between the two. I shipped the bake anyway since UNet is where lighting actually lives, but I noted the gap so I'd remember it later.
Then came the part I underestimated: the weights I'd been using at runtime were too strong when baked. Same numbers. Same math. Different feel. I'm guessing the runtime LoRA path applies in fp16 with some float drift that softens the effect, and my fp32 bake is more "honest" about what 0.4 actually does. Whatever the reason, v2 was overcooked. v3 (-0.2 across the board) was right.
Then I got greedy. I added my four always-on style LoRAs (PowerPuff, pony+noob, rimixO, prtyface) into v4. prtyface had its own naming scheme that didn't match either of the two I knew about and silently `applied=0, skipped=560`. Removed it. Tested v4 with the other three baked. Looked over-cooked. Backed out completely — went to v5_lit, lighting LoRAs only, no style.
For realism the lighting recipe also broke. Rim and backlit on photoreal portraits made them look like 3D figurines — different model, not a stronger one. So realism got its own gentler recipe: just ace at 0.4 and S1 at 0.2, no rim or back. Different sibling, different rules.
Five baked checkpoints in maybe two hours of clock time, most of which was me staring at outputs deciding whether 0.3 was enough. The baking itself was 12 seconds per model. The expensive thing was the taste check — and once I had the script, I could iterate on taste at script-rerun cadence instead of LoRA-retraining cadence.
Civitai gets v2 of all five soon. Same model pages, just "Add Version." V1 stays up for the LoRA-Artist purists who want the unopinionated base. V2 is for everyone else who'd rather not type three lighting tags into every prompt for the rest of their lives.
What I keep coming back to: how much of "iteration speed" turns out to live in tooling I haven't written yet. Two months ago I'd never written a merge tool either. The blind tournament evaluator. The auto-test queue. The bake script. None of these existed. Each one collapsed a thing that used to take an evening into something that runs in a coffee break. The pattern is starting to feel like the work — the LoRAs and merges are downstream.
Next: more LoRAs to bake, and I want to fix that te2 skip so I can bake LoRAs whose effect lives partly in the text encoder.
— Admin · END TRANSMISSION —
SCROLL 016
·
2026.05.01 05:15
img2img
debugging
negatives
The Neon Nightmare Was The Negative Prompt
I tried to run my new Bonkaiii_realism merge through `imgtoimgmulcheck.py` to batch-test it on a folder of references. The output was neon. Pure magenta-green-yellow noise where the image should have been. I assumed the model was broken — maybe a corrupted save, maybe something wrong with the merge recipe, maybe an SDXL/Illustrious VAE problem.
Spent two hours on hypotheses that didn't pan out. CLIP_stop_at_last_layers — wrong guess. Polish passes contaminating the base — disabled them, base alone was still neon. External VAE override — tried it, no change. An empty `checkpoint_patr1_map` entry shadowing the default — found it, fixed it, *still* neon.
Then I tried something I should have tried at hour zero: I loaded the same model in A1111's WebUI directly, fed it the same prompt and seed, and hit generate. It worked. Clean output. Beautiful. The model wasn't broken — my script was.
That should have made me happy and instead made me miserable, because now I had to figure out what my script was doing differently from a stock WebUI generation. So I added a diagnostic block right before `process_images(p)` — print the prompt, negative, CFG, steps, sampler, denoise, size, init size, override settings, and seed. Run the script once. Diff the printed values against what I'd typed into the WebUI by hand.
Most fields matched. The negative prompt didn't. The script's negative was the project default — about 25 different `((double-parenthesized))` body-shape tokens (`((muscles))`, `((fat))`, `((wide hips))`, and so on), each with weight 1.21 from the parens, stacked together. The WebUI run had used a one-word negative: `lowres`.
That was it. The photoreal merge couldn't tolerate the heavy negative-conditioning vector that the anime and 3D merges had been silently absorbing for months. Replacing the script's negative with `lazyneg, lowres, worst quality, bad anatomy, bad hands, deformed, blurry, signature, text, watermark` (no double-parens, no body-shape stack) made the same merge produce the same beautiful output as the WebUI.
The lesson I'm pulling out: when a model works in one tool and not another, the model is fine. The difference is in the tool's defaults, and the only way to find it is to dump every parameter and diff. I had been chasing model-side hypotheses (VAE, CLIP skip, recipe) when the answer was a string field I'd forgotten was being applied.
Also: photoreal merges have less headroom than I thought. The same negative I'd been gleefully stacking for the anime checkpoints was actively poisoning the realism slot. Different merges have different tolerance for negative weight, and the photoreal end of the spectrum needs the cleanest, leanest negatives I can write.
The diagnostic block stays in the script permanently. Every img2img run prints its full state before the first call. It's three seconds of console output that will save the next two hours.
— Admin · END TRANSMISSION —
SCROLL 017
·
2026.05.01 05:15
merging
civitai
bonkaiii-spectrum
Five Checkpoints In One Weekend
Saturday morning I had ingredients. Sunday night I had five checkpoints uploaded to Civitai. The Bonkaiii Spectrum: realism, real3D blend, 3D, 3D-anime blend, anime. Same family of ingredients, swept across the photoreal-to-stylized range so I could pick a vibe and go.
Going in, I'd told myself I'd be happy if I shipped one. Maybe two. Five was a fantasy number that I wrote in PLAN.md to keep myself ambitious. Then the merge tooling started working in my favor and I just kept going.
Slot 1 (Realism) was the simplest: 60% ilustreal v50 VAE, 40% Realism Illustrious by Stable Yogi v5.5. Two ingredients. CFG 5. Done in twenty minutes including the sanity test grid.
Slot 2 (Real3D blend) was where I hit the VAE bug — RA7 plus Hemp II at 50/50 was washing out grey-green. Fixed the merge tool to preserve the first ingredient's VAE instead of averaging, re-ran, clean output.
Slot 3 (3D) and Slot 4 (3D-anime blend) came together fast once the tool was reliable. Walnut + PerfectDeliberate + Illustrij in different ratios. CFG bands that I worked out by sweeping 3.0 to 7.0 on each merge — the Plant Milk family really does want CFG 3-3.5 while everything else is happy at 5-7.
Slot 5 (Anime) was the one with the four-way tie that taught me to vary the lead ingredient instead of the supporting weights. Once I rebuilt the bracket with four different leads, Wai-led won cleanly.
What I keep coming back to is how cheap merging is compared to training. A LoRA is hours and a real dataset and captions and tag audits. A merge is an afternoon of recipe tweaks once the ingredients exist. Ocean3 ships 8 flavors of Plant Milk in 4 months and I used to think that was a freakish output rate. Now I think it's just the natural cadence when your bottleneck stops being compute and becomes recipe taste.
The Bonkaiii Spectrum is also the first thing I've ever shipped where I picked every output through a blind tournament instead of by feel. Each slot went through 5-7 candidate merges, evaluated through HTML pickers that hid the recipes from me, ranked 1st/2nd/3rd, and only the winner went up. That methodology is what made me trust the spectrum was real and not me convincing myself.
What's left for v1.x: track which slot people actually use. The realism slot is going to be the popularity test — there are a hundred photoreal Illustrious merges on Civitai and mine has to either be different or be cleaner. If it's neither, I learn something useful about what I underestimated. The anime slot has less competition by raw count, more by quality. The middle slots (real3D, 3D, 3DAnime) are where I think the differentiation actually lives — most creators stay at the poles.
Next merge generation is going to start from this insight: the gap is in the middle of the spectrum, not the ends.
— Admin · END TRANSMISSION —
SCROLL 015
·
2026.05.01 05:15
tournament
evaluation
merging
Seven Hassaku Variants And No Winner
Slot 5 was the anime end of the spectrum. Pure stylization, the most-anime checkpoint in the Bonkaiii lineup. I spent two evenings tournament-evaluating seven candidate merges to pick the winner.
Round 1 ended in a 4-way tie. So did round 2. By round 3 I was looking at four merges that ranked indistinguishable across three blind evaluators, and I was about to flip a coin.
Then I noticed what they had in common. Every single one of the seven candidates I'd seeded the bracket with was Hassaku-led. I'd taken Hassaku XL Illustrious v3.4 and varied the supporting weights — 30% Walnut vs 40% Walnut vs 50% Walnut, plus a sprinkle of Wai or Unholy at 10-20%. Seven recipes that were really one recipe with knobs adjusted by tiny amounts.
Of course they tied. They were all the same merge wearing slightly different shirts. The blind evaluators were correctly telling me there was no meaningful difference, and I was misreading that signal as "I need more rounds."
What I should have been varying was the *lead*. Not "Hassaku at 50% with these support weights" vs "Hassaku at 60% with those support weights" — but Hassaku-led vs Wai-led vs Walnut-led vs Illustrij-led. Different dominant flavors, not different shades of the same flavor.
Round 4 I tossed the bracket and seeded four new candidates with four different leads. The winner was clear inside one round.
The lesson is that "more candidates" doesn't fix a tournament when the candidates are clones of each other. If the eye can't separate them, you can run as many rounds as you want — you're sampling the same point in recipe-space repeatedly. You have to actually *move* in recipe-space, and the way you move is by changing the dominant ingredient, not by tweaking the supporting ones by 10%.
This is the same shape as the LoRA triangulation lesson from a couple weeks ago. Three experiments that change three different dimensions tell you something. Three experiments that change the same dimension by slightly different amounts tell you nothing — they just give you the illusion of testing.
Going forward when I run a tournament for a slot I'm going to enforce a "one of each lead" rule for the seed pool. If I want a fifth candidate, it has to be a paradigm I haven't already covered. No more bracket-of-clones.
— Admin · END TRANSMISSION —
SCROLL 014
·
2026.05.01 05:15
merging
vae
checkpoint
The Grey-Green Wash That Killed My Cross-Family Merges
I spent most of a Saturday convinced my new merges were broken. Slot 2 (RA7 + Hemp II at 50/50) came out grey-green and washed. So did the 50/50 with Walnut. So did the 50/50 with Hassaku. The 30% versions (SB1-3 from earlier in the week) were fine. Switching to ilijelle at 50/50 was fine. Everything else I tried at 50/50 came out looking like someone dialed contrast and saturation down 40% and stopped.
I sat with that pattern for a while before it clicked. The thing all the broken merges had in common wasn't a model — it was a VAE family mismatch. RA7 has the ilustreal VAE baked in. Walnut, Hemp, Hassaku, Wai, Unholy don't share that VAE lineage. ilijelle does (it was actually one of the ingredients in ilustreal). That's why ilijelle merged cleanly and everyone else came out hazed.
What `checkpoint_blend.py` was doing — and what I'd never thought hard enough about — is treating *every* tensor in the state dict the same way. Including the `first_stage_model.*` keys, which is the VAE. So when I merged 50/50, I was averaging two incompatible VAEs into one Frankenstein VAE that decoded latents into mush. At 30% one VAE dominated enough to mostly mask it. At 50/50 the average was the worst of both.
I rewrote the blend to detect VAE keys and just use the first ingredient's tensors verbatim — no averaging, ever. One model brings the VAE, the rest contribute UNet and text encoder weights only. Re-ran the same recipes. Slot 2 came back clean. Slot 3 came back clean. The whole spectrum unlocked.
The frustrating part is that MERGE_STRATEGY.md in this project literally has the line *"don't merge models with mismatched VAE bakes"* listed as a hazard. I knew it as theory. I just didn't build it into the tool. The tool happily averaged whatever I told it to average and trusted me to know what I was doing — which I didn't.
What I'm taking from this: theory in a markdown file is not the same as a guardrail in code. If a thing is dangerous and I know it's dangerous, the tool should refuse to do the dangerous thing by default, not silently produce washed-out garbage and let me figure out why. Going forward every merge tool I touch is going to special-case the VAE keys. The default behavior will be "preserve, don't average," and you'd have to pass a flag to override it.
Also taking this: when something fails in a pattern I can't explain, the answer is almost never "the model is broken." It's "I'm doing something wrong that I haven't noticed yet." The grey-green wash had been telling me about VAE math for hours before I listened.
— Admin · END TRANSMISSION —
SCROLL 013
·
2026.04.23 20:35
lora
training
lesson-learned
Walking Backwards From a Working LoRA
Tested v4 tonight. Rank 32, alpha 256 — an 8× multiplier I'd escalated to after v1, v2, and v3 all came back too weak. The theory was: if low alpha gives a weak LoRA, cranking alpha should fix the weakness. Seemed obvious. Was wrong.
Two grids came out. The low-weight grid (0.05 to 0.5) showed every cell producing the same navy bodysuit on the same anime girl. The LoRA "activating" basically meant swapping her clothes. The high-weight grid (1.0 to 3.0) was worse — pure color noise in most cells. Green, pink, magenta, yellow static where an image should be. The alpha had amplified the LoRA's weights so aggressively that at high inference weights it was overriding the whole generation and producing garbage.
So v4 is broken. Not weak, not inconsistent — actually broken. Can't be used at any weight.
Here's what stings. I went back and looked at v1. v1 was the first version, before I started "improving" things. At weight 2.0, v1 produced actual antlers and vines — the biotech signature I'd been chasing. Just needed high inference weights. It wasn't a great LoRA, but it was a *working* LoRA. And every version I trained after it was an attempt to make v1 better that made it worse instead.
I'd been walking backwards from a working LoRA and hadn't noticed.
The reason I couldn't see it is that I was changing multiple variables per iteration. v1 to v2 changed captions and settings at the same time. v2 to v3 changed the dataset *and* the captions. v3 to v4 changed the alpha *and* the dataset again. Every time one of them came back weak, I couldn't point to *which* change caused the weakness — so I'd change something else, and get a different failure. The parameters weren't the problem. The method was.
Stop iterating blind. Set up experiments that produce an answer no matter which outcome lands.
Three queued:
**v5A — Control.** Exact Civitai defaults. Rank 32, alpha 32 (1× multiplier — the ratio we originally drifted from). Same dataset, same minimal captions. If this works cleanly, the alpha escalation was the entire problem and I've been overcomplicating this for a week.
**v5B — Capacity test.** Rank 128, alpha 128 (still 1× ratio, but 4× the weights of A). 700 steps to let rank 128 saturate. If A is weak and B is strong, rank 32 was too small to hold the biotech + character integration signal.
**v5C — Concept type.** Same rank/alpha as A, but `type=concept` with descriptive booru-style captions instead of trigger-only. Different paradigm entirely — the LoRA learns token associations and activates when you prompt its tokens alongside the trigger. If A and B both fall flat and C works, the whole style-LoRA approach was wrong for this concept.
A decision matrix exists for every combination of results. A good / B good / C good means I ship the cleanest one and stop. A weak / B strong means capacity was the bottleneck and rank 128 becomes the new default. All three weak means the dataset is the ceiling and parameters can't save it — time to rethink the dataset or switch base models. Any of them producing noise means there's a pipeline bug to track down. Every cell of the matrix has a next step. No more guessing.
Seven hours of training in the queue. Going to dinner.
A small operational honesty moment before I go: when I hit run, the first two jobs failed immediately. The previous run had archived `training_output/` to `datasets/bio_tech_v4_bonkaiii/` at the end of training, and v5A and v5B were configured to reuse the dataset in-place. Empty folder, safety check, failed cleanly, all three jobs logged as failed. Ten-minute fix — restored the images from the archive, reset the job status to pending, re-ran. But it's a reminder that automation is happy to let you run off a cliff at full speed if your assumptions don't match the pipeline's cleanup behavior. Write what you think is happening, then go verify it's actually happening.
The bigger lesson is still the one above. Don't walk backwards from working. If it works, ship it and build forward. Iterate against a baseline you're willing to fall back to, not against a hope.
— Tre
— Admin · END TRANSMISSION —
SCROLL 012
·
2026.04.23 20:35
lora
automation
lesson-learned
The Auto-Iterator I Didn't Build
I asked Claude to build me two things tonight. The first was auto-testing — after each LoRA finishes training, automatically generate a test grid so I can wake up and see how it came out. Easy. Built it in about an hour. The queue runner talks to the A1111 API, generates three grids per LoRA (neutral, stress, open) across five weights, saves them to `test_results/{trigger}/`. Before bed I queue jobs. Morning, folders of labeled grids waiting for me. An hour of friction removed from every training cycle.
The second thing I asked for was bigger. *"If the result isn't good enough, make adjustments and retrain automatically. Keep iterating until we get it right."* I was picturing a loop: train → test → judge → adjust → train → test → judge → adjust. Set it up Friday night, wake up Monday to a working LoRA. That was the pitch.
Claude pushed back, and it was right to.
The problem isn't the loop. Loops are easy. The problem is the *judge* step. Deciding whether a LoRA is "strong enough" or "too strong" or "doing the wrong thing" requires actually looking at the images and making an aesthetic call. Bash scripts can't do this. You can run pixel-diff between the baseline and the LoRA output, but pixel-diff can't tell you that the LoRA learned *cyborg girl* when you wanted *biotech creature.* The diff is huge. The heuristic reports *success.* The LoRA is actually broken.
Three versions of the auto-iterator are possible. One uses dumb heuristics and gets the above failure every time. One uses a vision model like Claude itself inside the loop — works, but costs API tokens per iteration and still might iterate on the wrong dimension. If the dataset is the problem, no amount of alpha tuning fixes it; the loop would spend a night cranking alpha higher and higher trying to fix something that wasn't broken. The third version is just me pinging Claude in the morning with the grid results, which — when I thought about it honestly — is already the workflow that's been working.
So the answer is: auto-test, yes. Auto-iterate, no. Not because the tooling is too hard, but because the judgment call at the center of the loop isn't the kind of thing you want to hand off to something that might iterate on the wrong variable for eight hours straight.
What I like about this answer is that it doesn't leave me with nothing. The auto-test part is real, it runs tonight, it removes the "wake up, open A1111, type the trigger, run a grid, wait" sequence every morning. That's the part that didn't need judgment — just repetition. The iteration part — the *what should we change next?* decision — stays where it has always belonged: with a person looking at the images and thinking about the dataset.
I want to write this down because I can already feel the temptation coming back. *Can we just—.* No. The judgment is the work. The scripts are the scaffolding around the work. Don't automate the work.
— Tre
— Admin · END TRANSMISSION —