Stable Diffusion in 2026 is not a single model anymore. It is an ecosystem. SDXL family checkpoints carry most production cover work. Flux.1 raised the ceiling on prompt adherence and complex compositions but at the cost of VRAM and speed. Pony Diffusion dominates stylized illustration. Civitai is full of niche checkpoints that, for a specific genre, beat anything cloud tools produce. Around the models sit ControlNet, LoRA training, inpainting, upscaling, and a half-dozen front ends with different tradeoffs.
This is the working playbook for KDP authors who have outgrown Midjourney. It assumes you already understand basic prompting, that you have a GPU with at least 12 GB of VRAM, and that you want the level of control no subscription tool gives you. If that is not you yet, start with the Midjourney book covers guide instead. Both tools have a place; the wrong one for your stage wastes weeks.
Who this guide is for
- You publish 10+ covers a year and your Midjourney bill is climbing past $360 annually with no end in sight.
- You run a series where character consistency or brand visual style requires training a custom LoRA.
- You have privacy or content sensitivity concerns that rule out cloud tools.
- You want control over composition, layout and typography integration that ControlNet provides.
- You have time to invest. Plan 30-50 hours to reach productivity. If you need a cover this week, hire a designer or use Midjourney.
Local vs cloud Stable Diffusion: the honest tradeoffs
Stable Diffusion runs in three places and the right one depends on your situation, not on which is "best".
| Environment | Hardware required | Cost | Best for |
|---|---|---|---|
| Local on your own GPU | NVIDIA 12-24 GB VRAM | $0 ongoing after hardware | High-volume publishers, custom LoRA training, privacy |
| RunPod / Vast.ai (rented GPU) | Web browser | $0.30-$0.80 per hour of GPU time | Heavy intermittent use, LoRA training without owning hardware |
| Replicate / Fal / Together API | Web browser | $0.005-$0.05 per image | Light use, programmatic integration |
| Civitai / Tensor.Art (browser SD) | Web browser | Freemium | Trying out checkpoints, beginner experimentation |
Production recommendation in 2026: own the GPU if you publish more than 15 covers a year, rent if you publish less but train the occasional LoRA, use API services if your workflow is programmatic, use browser tools only for experimentation. Most professional indie publishers have a $1,500-$2,500 home rig (RTX 4070 Ti Super or 4080) and that pays back inside 12-18 months versus a year of Midjourney plus Photoshop subscriptions.
Front ends: Forge, ComfyUI, A1111, SwarmUI
The "WebUI" you use is more important than people think. Different front ends have different feature sets, performance characteristics, and ceilings.
| Front end | Strength | Weakness | Best for |
|---|---|---|---|
| Forge | Faster than A1111 on the same hardware, familiar UI, supports SDXL and Flux | Fewer experimental extensions than A1111 | Most KDP authors, day-to-day production |
| ComfyUI | Node-based, infinitely flexible, fastest generation, every advanced workflow available | Steep learning curve, harder to share with non-technical collaborators | Power users, complex pipelines, automation |
| Automatic1111 (A1111) | Largest extension ecosystem, most tutorials, mature | Slower than Forge, less Flux support | Users with existing A1111 workflows |
| SwarmUI | UI front end over ComfyUI backend, native support for SDXL and Flux | Newer, smaller community | Power users who want UI not nodes |
| InvokeAI | Polished commercial UI, strong canvas / inpainting | Less extension ecosystem | Authors who prioritize UX |
| Fooocus | Simplest, opinionated, MJ-like UX | Less control, no Flux | Beginners stepping up from Midjourney |
Production recommendation: install Forge first. Use it for 30-50 covers. If you find yourself wanting batch automation, conditional branching, or multiple ControlNets in series, move to ComfyUI or SwarmUI. If you want a Midjourney-like experience locally, Fooocus is the gentlest start.
Checkpoints: best models for KDP book covers by genre
The "best Stable Diffusion model" is the wrong question. The right question is: which checkpoint for which genre. SDXL-family models dominate production work in 2026, with Flux.1 [dev] as the prompt-adherence specialist for the hardest compositions.
| Genre | Primary checkpoint | Backup checkpoint | Notes |
|---|---|---|---|
| Contemporary romance | RealVisXL v5 | Juggernaut XL v9 | RealVis handles skin tones better |
| Thriller / suspense | Juggernaut XL v9 | RealVisXL v5 | Stronger contrast and shadow |
| Historical romance | DreamShaper XL | Juggernaut XL | Painterly mode for oil-on-canvas feel |
| Epic fantasy | Juggernaut XL v9 | DreamShaper XL Lightning | Flux for hardest compositions |
| Cozy / urban fantasy | AnythingXL or Pony Diffusion v6 | DreamShaper XL | Illustration-friendly |
| Sci-fi / space opera | Juggernaut XL v9 | Flux.1 [dev] | Flux for complex hard-surface scenes |
| Horror / supernatural | RealVisXL v5 | Juggernaut XL | Photographic dread reads better |
| Children\'s picture books | Pony Diffusion v6 XL | AnythingXL | Pair with a stylistic LoRA |
| Cookbook / lifestyle | RealVisXL v5 | SDXL base 1.0 | Photographic food and table |
| Non-fiction / business | SDXL base 1.0 or RealVisXL | Flux.1 [dev] | Clean minimal aesthetics |
| Comics / manga / anime | Pony Diffusion v6 XL | AnythingXL | Heavy stylistic specialization |
| When prompts are difficult | Flux.1 [dev] | SD 3.5 Large | Best prompt adherence in the ecosystem |
Checkpoint licensing matters
- SDXL 1.0 base: CreativeML Open RAIL++-M. Commercial use allowed.
- Juggernaut XL, DreamShaper XL, RealVisXL: Generally commercial-friendly but check each model card on Civitai before shipping.
- Pony Diffusion v6 XL: Fair AI Public License 1.0 SD. Allows commercial use.
- Flux.1 [schnell]: Apache 2.0. Full commercial use.
- Flux.1 [dev]: Non-commercial only. Use Flux.1 [schnell] or Flux.1 [pro] (paid license) for commercial KDP work.
- SD 3.5 Large / Medium: Stability AI Community License. Free for individuals and businesses under $1M revenue.
Always check the model card on Civitai or Hugging Face before shipping a cover. Some community checkpoints carry "non-commercial" tags that authors miss.
The book cover prompt structure for Stable Diffusion
Stable Diffusion prompts behave differently from Midjourney. SDXL models reward verbose, descriptive prompts. Flux rewards natural language. Pony rewards tag-style structured prompts. The same six-part skeleton works across all three but the syntax shifts.
SDXL prompt skeleton
[Subject], [Style], [Composition], [Lighting], [Color palette], [Mood], [Quality tags]
Negative prompt (always)
text, letters, words, typography, watermark, signature, low quality, blurry, deformed, extra fingers, bad anatomy, oversaturated, plastic skin
Key SDXL parameters to set:
- Sampler: DPM++ 2M Karras for photographic, Euler A for painterly, DPM++ SDE Karras for high detail.
- Steps: 28-40 for SDXL family, 4-8 for Lightning variants, 20-30 for Flux Dev.
- CFG Scale: 5-7 for SDXL, 2-4 for Flux, 7-10 for older SD 1.5 checkpoints.
- Resolution: 1024x1024 native, 1024x1536 for portrait covers, 1024x1792 close to 6:9.
- Refiner: SDXL Refiner on for final 20% of steps if your checkpoint supports it.
- Hires Fix: 1.5x or 2x with denoising 0.25-0.4 for sharper detail.
For tag-style prompts on Pony Diffusion specifically:
score_9, score_8_up, score_7_up, [subject tags], [style tags], [color tags], [composition tags], rating_safe
The score and rating tokens are required for Pony. Skipping them produces low-quality output. Pony also requires explicit safety tagging for KDP-compliant covers.

ControlNet: the single biggest control upgrade
ControlNet is the feature that separates serious Stable Diffusion users from everyone else. It lets you control composition, pose, depth and layout before you control aesthetics. For book covers specifically, three ControlNet types do almost all the work.
Canny edge for layout preservation
Sketch the layout you want, including a rectangle where the title will sit. Run Canny edge preprocessing on the sketch. Feed it to ControlNet at a weight of 0.6-0.9. The model generates around your layout, preserving the title space. This is the workflow that ends "title fighting the artwork" forever.
OpenPose for character pose
Upload or generate a stick-figure pose, run OpenPose preprocessing, feed to ControlNet at weight 0.8-1.0. The character generates in exactly that pose. Useful for romance covers (specific embrace), action (running, drawing a weapon), and series continuity (same character pose across multiple covers).
Depth for 3D layering and title placement
Generate or paint a grayscale depth map (foreground white, background black). Feed to ControlNet at weight 0.5-0.7. The model respects your foreground-background separation, which means you can guarantee a clean background area in a specific zone for typography.
Other ControlNets worth knowing
- Scribble: Rough sketch to finished art. Good for prototyping cover ideas.
- SoftEdge / HED: Like Canny but softer, preserves more of the source aesthetics.
- Lineart: Clean line drawing to colored final, useful for illustration covers.
- IP-Adapter: Style reference from a single image, similar to Midjourney --sref.
- Tile: Used in upscaling pipelines to preserve detail without hallucination.
The professional Canny-edge cover workflow
- Sketch the cover layout in Photoshop, Procreate, or Krita. Mark the title placement as a black rectangle. Mark the author byline space. Mark the hero subject silhouette.
- Export the sketch as a black-and-white PNG.
- In Forge, switch to your chosen checkpoint, enable ControlNet, set Type to Canny, upload the sketch.
- Set ControlNet weight 0.7, Starting Step 0, Ending Step 0.8 (release control in the final 20% of steps so the model can polish the aesthetics).
- Generate. The output preserves your layout while filling in detail.
- Inpaint any problem areas.
- Upscale through the three-stage pipeline.
- Place the title text on the rectangle you reserved. It fits perfectly because you guaranteed the space.
This is the workflow professionals use and it is the workflow Midjourney cannot match. ControlNet is the single biggest reason to learn Stable Diffusion.
LoRA training for author brand consistency
LoRA (Low-Rank Adaptation) is the mechanism by which you train Stable Diffusion to know your specific character, your specific style, or your specific aesthetic. The output is a small file (50-200 MB) that you apply like a filter to any generation. For series authors, LoRA training is the single highest-leverage technique available.
When to train a LoRA
- Character LoRA: You need the same character across 6-30 cover and interior illustrations. Train on 15-30 images of that character.
- Style LoRA: You want a recognizable house style across an entire publishing imprint or 20+ book series. Train on 30-60 images of the target style.
- Subject LoRA: You want consistent rendering of a specific object, location, or aesthetic. Train on 15-30 images of the subject.
- Concept LoRA: You want to invoke a specific mood or composition pattern. Train on 30-60 examples.
LoRA training quick reference (using Kohya_ss or AI Toolkit)
- Collect training images. 15-30 for character, 30-60 for style. High resolution (at least 1024x1024 each). Diverse angles, lighting and contexts for character LoRAs to avoid overfitting.
- Crop and tag. Crop to 1024x1024 (SDXL) or 512x512 (SD 1.5). Tag each image with a unique trigger word (e.g., "ldra_jane") plus descriptive tags. Use a tagger like WD-1.4 to auto-generate tags, then edit.
- Configure training. Base model: matching your target use (SDXL 1.0 base, or specific checkpoint). Learning rate: 1e-4 for character, 5e-5 for style. Network rank: 32-64 for character, 64-128 for style. Steps: 1000-2000 (around 30-50 steps per image).
- Train. RTX 4070 Ti Super or higher trains SDXL LoRA in 30-90 minutes. Rent a RunPod A100 for 1-2 hours if you do not own the hardware.
- Test. Generate 20 test images using the trigger word at weights 0.5, 0.7 and 0.9. Look for: trigger word actually invoking the trained concept, no overfitting (outputs are not copies of training data), correct rendering at multiple weights.
- Iterate. If the LoRA is too weak, increase epochs or rank. If it overfits (outputs look identical to training images), reduce epochs or add regularization images.
The first LoRA you train will be bad. Plan for 2-3 attempts. By the third try the methodology clicks. A trained LoRA is reusable forever and the time investment is the moat that makes a Stable Diffusion workflow pay back.
Pair your Stable Diffusion output with print-ready KDP templates
KDPEasy handles the final 30% of the workflow: typography, KDP templates, spine width, CMYK-aware exports. So your local SD rig stays focused on imagery.
Inpainting: fix problems without re-rolling
Inpainting is how professionals handle the inevitable Stable Diffusion problems: extra fingers, weird eyes, garbled jewelry, soft anatomy in one specific area. Instead of regenerating the entire image, you mask the problem area and regenerate only that region.
The basic inpainting workflow
- Send the generated image to Inpaint. In Forge or A1111, click "Send to Inpaint" from the generation view.
- Mask the problem area. Paint over the hand, eye, mouth, or anomalous region. Soft brush, slightly larger than the problem.
- Write a focused prompt. Just describe what should be there. "A clean human left hand with four fingers and a thumb, photographic detail, natural skin." Do not repeat the entire original prompt.
- Set inpainting parameters. Mask mode "Inpaint masked", masked content "Original", inpaint area "Only masked", denoising strength 0.5-0.7 (lower preserves original, higher rebuilds).
- Generate 4-6 variations. Pick the best, send back to Inpaint if there is still a smaller problem.
Specific inpainting tools that matter
- ADetailer extension: Automatically detects faces and inpaints them at higher resolution. Solves 80% of weird-eye and soft-face problems in one click.
- FaceDetailer node (ComfyUI): The ComfyUI equivalent of ADetailer, with more control.
- Inpaint Anything: Segment-anything-based mask creation. Useful for precise selections.
- Outpainting: Like inpainting but extends the image beyond its original borders. Used in the full-wrap workflow to extend a front cover into a back cover.
The three-stage upscaling pipeline for KDP print
Stable Diffusion native generation is too small for print. A 1024x1536 SDXL output is roughly 170 DPI on a 6x9 paperback. KDP recommends 300 DPI. You need a three-stage upscaling pipeline.
- Stage 1: Native generation at 1024x1536 (SDXL) or 1024x1792 (close to 6:9). This is the base image you will refine.
- Stage 2: Latent SD upscale (Hires Fix or SD Upscale). 1.5x or 2x using the same checkpoint with denoising 0.25-0.4. This adds detail rather than just enlarging. Produces 2048x3072 or 2048x3584.
- Stage 3: External upscaler. Run the result through Real-ESRGAN x4plus, Topaz Gigapixel AI, or 4x-UltraSharp. This is the final enlargement to 4000-6000 pixels at 300+ DPI on a 6x9 paperback with crop headroom.
For most covers, Real-ESRGAN x4plus (free, runs inside Forge) is adequate. For maximum sharpness on premium covers, Topaz Gigapixel AI ($99 one-time) is currently the gold standard. Plan 30-60 seconds per cover for the external upscale pass.
For the full DPI math and how to spot a cover that will print soft, see the fix blurry KDP covers guide. The basic rule: your final cover file should have at least 300 DPI at the trim size you are printing, with bleed included.
CMYK conversion: the same gap Midjourney has
Stable Diffusion, like Midjourney, outputs sRGB. KDP print runs on CMYK. For maximum color fidelity on paperback covers, do the conversion yourself.
- Open the final upscaled image in Photoshop or Affinity Photo.
- View → Proof Setup → US Web Coated (SWOP) v2. Toggle Proof Colors to see the shift.
- Use View → Gamut Warning to identify out-of-gamut areas.
- Adjust Hue/Saturation or Selective Color on the out-of-gamut zones. Pull saturation down rather than lightness.
- Edit → Convert to Profile → US Web Coated (SWOP) v2. Relative Colorimetric with Black Point Compensation.
- Save the layered PSD for future edits, export the print PDF in CMYK.
For ebook covers, skip the CMYK pass. The sRGB file is correct as-is.
Commercial licensing: what you actually own
Stable Diffusion licensing is more nuanced than Midjourney because there are three license layers: the base model, any community checkpoint or LoRA you use, and the output itself.
Base model licenses (2026 state)
- SDXL 1.0 base: CreativeML Open RAIL++-M. Commercial use allowed. No royalties owed.
- SD 3.5 Large, Medium, Turbo: Stability AI Community License. Free for individuals and businesses with less than $1M annual revenue.
- Flux.1 [schnell]: Apache 2.0. Full unrestricted commercial use.
- Flux.1 [dev]: FLUX.1 [dev] Non-Commercial License. Cannot be used for commercial work.
- Flux.1 [pro]: Commercial license available via Black Forest Labs (paid).
Community checkpoints
Each Civitai or Hugging Face checkpoint carries its own license. Most are explicitly commercial-friendly (Juggernaut, DreamShaper, RealVis are all commercial-OK in their current versions), but always check the model card. Some checkpoints based on leaked or non-open data have ambiguous licensing. When in doubt, choose a checkpoint with explicit commercial language on the model card.
Output ownership
The U.S. Copyright Office has held that purely AI-generated images cannot themselves be copyrighted. You can copyright the combined cover (your typography, your layout decisions, the human-authored arrangement). For KDP\'s purposes, this is sufficient: you own the cover as a derivative work.
Practical compliance for KDP commercial work
- Use SDXL base, Flux.1 [schnell], or community checkpoints with explicit commercial licenses.
- Avoid Flux.1 [dev] for commercial KDP unless you have purchased the commercial license.
- Avoid generating named copyrighted characters, named living people, or directly imitating a living artist\'s signature style.
- If you train a LoRA on your own art or licensed training data, the resulting LoRA is yours.
- If you train a LoRA on copyrighted material you do not have rights to, the output carries that infringement risk.
Full Stable Diffusion KDP cover workflow, start to finish
- Genre research. Study top 20 covers in your Amazon category. Note palette, framing, lighting. Pair with the perfect KDP cover guide for conventions.
- Layout sketch. Hand-sketch the cover with title placement marked. Export as a black-and-white PNG.
- Choose checkpoint. Match the genre table above. SDXL family for production, Flux when prompts are difficult.
- Choose LoRAs. Load your character LoRA if applicable, plus any style LoRAs. Stack with care; total LoRA weight should not exceed 1.5-2.0.
- Set ControlNet. Upload the layout sketch, enable Canny ControlNet at weight 0.7. Optionally add Depth ControlNet for foreground-background.
- Write the prompt. Six-part structure: subject, style, composition, lighting, palette, mood, plus quality tags.
- Write the negative prompt. text, letters, words, typography, watermark, signature, low quality, blurry, deformed, extra fingers, bad anatomy.
- Configure generation. Resolution 1024x1536, sampler DPM++ 2M Karras, 30-40 steps, CFG 5-7, batch count 4-8.
- Generate. 8-20 candidates in the first pass.
- Inpaint. Fix hands, eyes, weird details on the best candidates. Use ADetailer for face refinement.
- Hires Fix / latent upscale. 1.5x with denoising 0.3 using the same checkpoint.
- External upscale. Real-ESRGAN x4plus or Topaz Gigapixel to final dimensions.
- CMYK pass. Soft proof, gamut correct, convert profile in Photoshop or Affinity Photo.
- Layout assembly. Place onto KDP cover template for your trim size and page count. Use the spine width calculator to confirm spine dimensions.
- Typography. Title, author, spine text, back cover description, barcode. Real fonts, real layout tool.
- Thumbnail test. View the cover at 100px wide. If the title and subject do not read, redesign.
- Export. PDF/X-1a at 300 DPI in CMYK.
- Upload. See the KDP cover upload guide for the cover review screen and common rejections.
Automation for high-volume publishers
Once you have a working pipeline, automate it. Stable Diffusion exposes APIs (A1111 API, ComfyUI API, Forge API) that let you script entire workflows.
- Batch cover generation: Define a prompt template with placeholders, iterate through a CSV of book titles, generate 4 variants per book overnight.
- Series consistency pipelines: A ComfyUI workflow with locked Canny ControlNet, locked LoRA stack and locked palette ensures every cover in a series is visually coherent.
- Coloring book interiors: A second ComfyUI workflow specifically for interior page generation at 600 DPI grayscale.
- A/B variations: Generate 6-8 cover variations programmatically, run them through a thumbnail-readability filter, ship the top 2 to KDP for split testing.
Most publishers shipping 20+ covers a year reach this stage within 6-12 months of starting Stable Diffusion. The leverage is real but the activation cost is also real. Do not over-engineer before you have the working baseline.
Where Stable Diffusion fits versus Midjourney, Leonardo, and Flux APIs
Stable Diffusion is the most powerful tool in the 2026 AI cover stack and also the slowest to learn. The honest comparison:
- Midjourney v6.1 / v7: Best out-of-the-box quality, easiest workflow, $30/month. The right tool for 1-15 covers per year. See the Midjourney book covers guide.
- Leonardo AI (with Leonardo Kino XL and Flux): Strong free tier, fast turnaround, browser-based. Right tool for high-volume coloring book interiors and casual cover work. The Kino XL model in particular handles cinematic photography prompts well.
- Flux.1 [pro] via Fal or Replicate API: Best prompt adherence in the ecosystem, no local hardware required, pay-per-generation. Right tool for publishers who want Flux quality without the GPU.
- Stable Diffusion local: Maximum control, zero ongoing cost, custom training. Right tool for high-volume publishers, series with custom LoRAs, and privacy-sensitive work.
For a full side-by-side, see the AI image generation for KDP guide. The right answer for almost everyone is "Midjourney plus Photoshop until 15 covers a year, then evaluate adding Stable Diffusion".
Common mistakes that waste hours
- Skipping the negative prompt. "text, letters, words, watermark, low quality, deformed, extra fingers" goes in every prompt. Always.
- Generating at 512x512 with SDXL. SDXL is trained for 1024x1024 minimum. Smaller produces visible quality degradation.
- Using SD 1.5 LoRAs with SDXL. Incompatible. Version-match always.
- CFG too high. CFG 12-15 on SDXL produces oversaturated, plasticky output. Stay at 5-7.
- Skipping inpainting. 90% of "Stable Diffusion looks bad" complaints are about details that inpainting fixes in 60 seconds.
- Skipping the three-stage upscale. Native 1024 output is not enough for print. Always upscale.
- Trusting AI text. Same rule as Midjourney. Type the title in Photoshop, Affinity, or KDPEasy.
- Over-stacking LoRAs. More than 3-4 active LoRAs typically degrades the output. Pick the two or three that matter and dial the rest to zero.
- Ignoring checkpoint licensing. Flux.1 [dev] is non-commercial. Some Civitai checkpoints have caveats. Read the model card.
- Trying to generate the full wrap in one image. Generate front, outpaint, assemble in Photoshop.
Final read
Stable Diffusion in 2026 is professional-grade infrastructure for serious KDP publishers. The activation cost is real: 30-50 hours to reach productivity, $1,500-$2,500 in hardware if you build local, 4-6 hours per LoRA you train. The payoff is also real: unlimited generation, custom character and style LoRAs, ControlNet composition control, zero ongoing software cost, and complete privacy.
The right rule of thumb is the same as it was in 2023, updated for current tooling: stay on Midjourney until you publish 15-20 covers a year, then add Stable Diffusion to the stack for the workflows Midjourney cannot do. ControlNet for layout precision. LoRA training for character and brand consistency. Inpainting for fixes Midjourney cannot do at all. Stable Diffusion does not replace Midjourney for most authors. It joins the stack when you outgrow what subscription tools can offer.
Pair local Stable Diffusion with print-ready KDP layout
KDPEasy handles the typography, KDP templates, and CMYK-aware exports. So your local rig stays focused on what it does best: imagery.
Frequently asked questions
For most authors publishing one to five books a year, no. Pay $30 a month for Midjourney and skip the setup. For authors publishing 10-plus covers a year, building a signature visual style across a series, or needing 100% local privacy, yes. The break-even moment is when you spend more on Midjourney annually than the time-amortized cost of running a local rig, or when you need a custom LoRA trained on a style that does not exist in any cloud tool. That is roughly 15-20 covers per year.
For photographic realism on contemporary romance, thriller, and lifestyle covers, RealVisXL v4 or v5 is the most reliable choice. For epic fantasy and sci-fi with painterly drama, Juggernaut XL or DreamShaper XL outperform. For stylized illustration and anime-adjacent covers, Pony Diffusion v6 XL plus a stylistic LoRA. Flux.1 (specifically Flux.1 [dev] running locally) is currently the strongest general-purpose model for prompt adherence and complex compositions, but it is slower and more VRAM-hungry. Default recommendation: SDXL-family checkpoints (RealVis, Juggernaut, DreamShaper) for production work, Flux when you need difficult prompts to land.
For beginners and most KDP authors, Forge (a faster A1111 fork) gives the best balance of speed, familiar UI, and SDXL or Flux support. For power users running complex pipelines (multiple ControlNets, batch automation, custom node graphs), ComfyUI is the only real choice. For users who want a friendlier ComfyUI with a UI front end on top of a node backend, SwarmUI is currently the strongest hybrid. Automatic1111 itself is still maintained but Forge is faster on the same hardware. Start with Forge, graduate to ComfyUI only when you hit its ceiling.
For SDXL at production quality (1024x1024 to 1536x2304 generation), 12 GB VRAM is the practical minimum and 16-24 GB is comfortable. An NVIDIA RTX 4070 (12 GB), 4070 Ti Super (16 GB), 4080 (16 GB) or 4090 (24 GB) are all viable. For Flux.1 [dev] at native quality, 24 GB is comfortable and 12-16 GB requires quantized GGUF variants. AMD GPUs work via ROCm or DirectML but performance lags. Apple Silicon (M2 Ultra or M3 Max with 64+ GB unified memory) runs SDXL acceptably but Flux is slower. If you do not have 12+ GB VRAM, rent a cloud GPU instead of buying.
Train a LoRA when you need a specific style or character across 10+ images and the variations matter. For a single one-off cover, do not train a LoRA. For a 6-book series with a single recurring character, train a character LoRA (15-30 reference images, 2-4 hours of training time). For a publisher brand visual style across 20+ books, train a style LoRA. A trained LoRA is reusable forever and gives you a level of consistency Midjourney --sref cannot match. The investment is 4-6 hours per LoRA but it pays back across every subsequent generation.
ControlNet lets you control composition before you control aesthetics. Three uses dominate for cover work. First, OpenPose: upload a stick-figure pose, get a character in that exact pose, useful for romance covers and dynamic action shots. Second, Depth: upload a depth map, control 3D layering and where your title text can sit. Third, Canny edge: sketch your cover layout with the title placement marked as a black box, ControlNet preserves your layout while generating around it. The Canny workflow is the most underused and the most useful for ensuring your title has clean space.
Inpainting. In Forge or A1111, send the generation to the Inpaint tab, mask the problem area (the hand, the eye), write a focused prompt for just that region ("a clean human hand with four fingers and a thumb, photographic detail"), and inpaint at 0.5-0.7 denoising strength. For face fixes specifically, the ADetailer extension automates this with the FaceDetailer node. Most professional Stable Diffusion covers are inpainted at least twice before final upscale. This is not optional, it is the workflow.
Three-stage pipeline. Stage 1: generate at 1024x1024 or 1024x1536 (SDXL native). Stage 2: latent SD upscale to 2x using the same model and a low denoising strength (0.25-0.4) for detail preservation. This is the "SD Upscale" or "Hires Fix" feature in Forge and A1111. Stage 3: run the result through Real-ESRGAN x4 or Topaz Gigapixel for the final upscale to 4000-6000 pixels at 300+ DPI. For a 6x9 paperback you want at least 1800x2700 in your final file but 3600x5400 gives crop headroom. Skipping stages 2 and 3 produces soft printed covers.
Yes, with checkpoint-specific caveats. The base Stable Diffusion model (SDXL 1.0, SD 3.5, Flux.1 [schnell] under Apache 2.0) is permissively licensed for commercial use. Flux.1 [dev] is licensed for non-commercial use only (Flux.1 [pro] or commercial license required for commercial work). Many community checkpoints on Civitai (RealVis, Juggernaut, etc.) carry their own license terms. Always check the model card. The output license is generally yours, but the model license dictates whether you can use it commercially in the first place. For paid KDP covers, default to checkpoints with explicit commercial licenses: SDXL base, Juggernaut XL, DreamShaper XL, RealVis XL, Flux.1 [schnell].
Technically yes with the right ControlNet setup, but it is rarely worth the effort. Generate the front cover at 1024x1536 (SDXL) or 1024x1792 (close to 6:9), then use outpainting (img2img with the image padded into a wider canvas) to extend leftward across the spine and back. Assemble the final wrap in Photoshop, Affinity Publisher, or KDPEasy against the official KDP template for your trim size and page count. Always do the typography and barcode placement in a real layout tool.
Train a character LoRA. Collect 15-30 high-quality reference images of the character (or generate them with --cref in Midjourney or with --sref locked across a session), tag each with descriptive captions, train in Kohya_ss or AI Toolkit at a low learning rate (1e-4 to 5e-5) for 1000-2000 steps. The resulting LoRA is roughly 50-200 MB and you load it on every generation for that character with the trigger word and a weight of 0.6-0.9. This is the single highest-leverage technique in series cover work and it is the main reason serious indie publishers run Stable Diffusion locally.
Setup: 4-8 hours for Forge or ComfyUI install, model downloads, ControlNet setup. First 5 covers: 4-6 hours each as you learn the parameters. After 15 covers: 60-120 minutes per cover. To reach productivity comparable to a Midjourney workflow, plan on 30-50 hours of practice. The breakeven point versus a Midjourney subscription is roughly 15-20 covers per year if you value your time at $25-$50 per hour. Below that, stay on Midjourney.

Written by Danielle Okonkwo
Marketing & Growth Lead at KDPEasy
Danielle is a published author with 12+ titles on Amazon KDP and a former book blogger. She writes KDPEasy's guides drawing from hands-on publishing experience and years of testing what actually works in the KDP marketplace.
View profile