How StableDiffusion works

Stable Diffusion is a txt2img and img2img AI, based on the principle of noise diffusion.

Textual Encoding (Tokenizing)

When inputting a prompt, a part of SD called the Textual Encoder tokenizes the prompt, meaning it cuts words into 1-4 letter sized chunks which the actual working model can understand. The popular version 1.5 of Stable Diffusion can take 75 such tokens as input, however, modern UIs like Automatic1111 can combine tokens to drastically increase the amount of tokens which can be used.

Generating Latent noise

After the Textual Encoding it start's with creating a noisy image, based on a random seed (kinda similar to Minecraft world generation). This noise image is created as a latent image which basically is a special kind of image SD works internally with (oversimplified).

Denoising

Now, the actual Diffusion Model starts to work. It uses the tokens of the prompt to denoise this noise into what it "recognizes" within the noise as your prompt. This is called a step. The output is then fed back into the Diffusion Model, which then repeats the process (doing a second step). To get decent results, you usually need at least 15 steps, but rarely more than 25.

Sampling Method

The denoising also uses a Sampling Method, which is basically a small programm influencing the denoising. Different samplers can result in different results.

Decoding

Up until now, the image has been in latent space, which still needs to be converted into a standart .png image. This is what the Decoder is doing. To do so, it should use a VAE (Variational Auto-Encoder). Many Models come with a VAE baked in, but others don't. VAE can also influence details and color of the resulting image. If your images come out desaturated, SD is most likely not using a VAE.

Model, Lora, Embedding, Hypernetwork, Lycoris

These are all trained models which influence the result of your generations.

Model/Checkpoint

The most important one! The model includes the Textual Encoder (The Prompts the AI can "understand") and defines the artstyle of the images. Usually saved as a .safetensors or sometimes .ckpt file. There are two categories in which we can put each model:

SD-Version

Categorizes on which version of Stable Diffusion the model is based. Most common is SD1.5, which is trained using 512x512p images, and therefore excels at them. SD2.0 is trained on higher resolutions, but is unpopular due to the inclusion of censoring within the models (e.g. it won't create 18+ content). At the time of writing, SDXL 1.0 is pretty new. It's trained on 1024x1024 images, needs less prompts, generates better hands, but needs more VRAM by default and does not (yet) have decently trained Anime models.

Anime vs. Realism

The original SD-Model trained by StabilityAI is a realistic model. This does not only mean that it creates realistic looking images, but also that it needs prompts written in natural language! A prompt for a realistic model could look like this:

complex 3d render ultra detailed of a beautiful porcelain profile woman android face, cyborg, robotic parts, 150 mm, beautiful studio soft light, rim light, vibrant details, luxurious cyberpunk, H. R. Giger style, an extremely delicate and beautiful

On the other hand, we have Anime Models. They not only differ in artstyle, but in usage too! When prompting, it is recommended to use tags found on sites like danbooru (warning, nsfw content). A typical prompt would look like this:

masterpiece, best quality, 1girl, long hair, white hair, blurry foreground, white dress, full body, outdoors

LoRA

LoRA (Low-Rank Adaptation) are trained NeuralNetworks, which you can "add into" your model to help StableDiffusion understand new Concepts. You can basically loosely categorize them into 4 diffent types (At least, thats how I would do it):

  • Objects/Characters/Clothing, e.g. DanielCraig or Bananas
  • Styles, e.g. the style of a certain artist
  • Actions/Poses, e.g. Eating a large Cheeseburger, or Riding a Chocobo
  • Specials, e.g. A lora for adding/removing details

To use a Lora, you need to do two things: Load it and trigger it. Some lora don't need to be triggered, usually style-lora. They are usually .safetensors files.

Technical explanation

Will follow… I'm not really an expert in this yet, but its not really needed to understand how they work to use or even create them.

Hypernetworks/LyCoris

For users, these two work the same as LoRA. Similar usage, same effects. As a rule of thumb (based on my opinion): Hypernetworks are worse than LoRA, but LyCoris are usually better. They are usually .safetensors files.

Embeddings/Textual Inversion

Embeddings aka. Textual Inversions are similar to LoRA too, but are used a bit differently. They usually only need to be triggered. For concepts like Objects,Characters,Poses I would not recommend to use Embeddings, as they don't work that well for that. What they absolutely do excell at is controlling quality. E.g. there are some Embeddings which will drastically improve the look of generated hands. Instead of looking like some mutations, they will tend to look more like actual hands, sometimes even with 4 fingers + a thumb! They are usually saved as a .pt file.

Using Automatic1111

The Automatic1111 UI is the most well known UI for running SD on your own hardware. It's packed full of features and can even be expanded by using extensions. The official wiki of the UI features installation instructions for Win/Mac/Linux and AMD/Nvidia/Apple/Intel GPUs. It also includes a lot of other information and can definetly be worth a look. In Linux, installation basically comes down to:

    1. Installing python3.10 and its venv module
    1. Cloning the UIs git repo
    1. Creating a venv called venv in the cloned repo
    1. Running the webui

Regarding step 4, when starting the webui with "./webui.sh", It's worth knowing a few extra arguments.

Argument Effect
–listen Opens the Ui to the local network, so that you can access it from another computer
–lowvram Reduces the amount of vram needed for generating images by a lot (also reducing speed)
–medvram Reduces the amount of vram needed by a bit (marginal speed loss)
–xformers Enables xformers on Nvidia GPUs, speeding up generations, but making them non deterministic
–enable-insecure-extension-access See below, in "Extensions"

Once the terminal says something like "Running on local URL: http://0.0.0.0:7860" you can connect to the UI on that device following the link. If using –listen, you can also connect from another device in the same network, using the local IP of the device the UI is running on (e.g. http://192.168.0.50:7860 ). You can easily determine your local ip with the command "ip a" in Linux (and probably MacOS) or "ipconfig" in Windows CMD/Powershell.

Important Settings

Alright, now lets look at the UI and all its textfields, sliders and buttons, one by one.

txt2img tab

In this tab, you can create images from a text, which is usually called text to image, or shortened to txt2img.

Stable Diffusion checkpoint

A list to choose the actual model that gets used during the denoising.

Prompt and Negative Prompt

Here you can add the instructions, which the TextEncoder will tokenize and pass to the Model. Whatever you write in the prompt will (hopefully) end up in your image. Things you add in the Negative prompt will get avoided in the Diffusion process, and should (hopefully) not end up in the image.

Sampling Method

The sampler, which will get used during the denoising.

Sampling Steps

How many iterations of denoising should be done. 15-25 are usually sane values, but this can vary depending on the model and the sampling method

Highres Fix

A way to generate images at higher resolutions, without duplication artifacts. When using this, your image will get generated as usual, but after the set amount of steps, it will upscale the image, add a bit of noise and repeat the denoising. "Denoising Strenght" is probably the most important setting here as it determines how much the upscaled image will differ from the original. Below 0.3 will usually result in a washed out image. Using upscale factors higher than 2 tends to create more wierdness, like extra arms or heads (in that case, try reducing the denoise strenght).

Width/Height

The resulutions of the final image (or when using Highres Fix, the resolutions before upscaling) Stable Diffusion 1.5 (which is the most popular version) is trained on 512x512 images, and is best at this resolution. 512x768 and 768x768 can still give decent results too. Some models are good at using 1024x1024 too, but it's a bit rare. Do note that increasing the resolution exponentially increases needed VRAM too!

Batch count/size

You can create images in batches. Batch Size determines how many images are created in paralell, Batch count how many of those batches should be created. They will usually start with the set Seed and increase it by one for each image. Useful when testing out prompts, but Batch Size lineary increases needed VRAM.

CFG Scale

ContactFreeGuidance determines how closely the AI should follow your instructions. Lower values increase it's "creativity", higher ones decrease it. 7 is a sane value, but lowering it to something like 3 can create really interesting images too.

Seed

This number determines the Random Noise image that gets created as a latent (before the Denoising starts). Using a set seed should result in similar images. E.g. When setting the seed as 774, and creating an image with the prompt "1girl, hatsune miku" and one with "1girl, hatsune miku, sunglasses" should result in the same pose and framing, but one with sunglasses, the other without. So this can be used for finetuning your prompt and settings. A seed of -1 means, that the seed is randomly choosen.

Script

Scripts are very powerful in a variety of different fields. I've only used the X/Y/Z-Plot till now, which is a powerful way of generating grids of images which differ in single settings from each other. Here is an example:

  • X type: CFG Scale
  • X values: 1,2.5,4,6,8,10,16
  • Y type: Prompt S/R (Seek/Replace)
  • Y values: sunglasses, glasses, round glasses

This will create 7*3 (=21) images in a grid. The Prompts S/R will seek for the word "sunglasses" in the prompt and replace it with "glasses" and "round glasses" on the Y axis. Using this X/Y plot should give a good understanding of the CFG Scale setting. I recommend creating a few plots with HighresFix upscale and denoising, sample size and scheduler to gain a deeper understanding of these values.

Extras Tab

This is usually used for upscaling created images using different img2img AIs trained on upscaling. You can set a scale factor, choose one or two upscalers (which will be blended together) etc.

PNG Info Tab

By default, every image you generate will have your generation settings saved in the .png metadata. In this tab, you can paste an image and see your used generation settings. Images shared on Twitter, Whatsapp, Pixiv, etc. will usually get stripped of their metadata, rendering this tab mostly useless for these images. Images from Civitai usually include metadata. You can also send the selected image and prompt directly to txt2img, img2img, etc, which is super convenient when you want to recreate an image or want to continue working on your settings another day.

Checkpoint Merger, Train

This is used for creating your own Models or merging Several models together. I have no experiance with this tab.

Text2Prompt

Well the name says what you can do here, but I havn't used it yet, so I can't comment on it.

Settings

Here you can customize pretty much anything. Saving behavior, live previews, UI Reorders, etc. Some settings I recommend:

  • Saving images/grids:

    • Save copy of image before highres fix: true
    • Save text information about generation … to png files: true (Ethical and ensures reproducability, see PNG Info Tab)
    • Create a text file next to image with generation parameters: true
  • User interface:

    • Gradio Theme: choose your favourite colorscheme for the WebUI
    • Quicksettings list: sd_model_checkpoint sd_vae CLIP_stop_at_last_layers (adds quicksettings for CLIP and VAE to the top)
  • Live previews:

    • Live preview display period: 2 (gives some nice visual feedback, but slows generation down by a bit)

Using Models, Lora, Lycoris, Embeddings

Installing

  • Models: Download into: "[WebUIBaseDir]/models/Stable-diffusion/"
  • Lora: Download into: "[WebUIBaseDir]/models/Lora/"
  • Lycoris: Download into: "[WebUIBaseDir]/models/Lora/" (yes, in up-to-date versions, its the same as Lora)
  • Hypernetworks: Download into: "[WebUIBaseDir]/models/hypernetworks/"
  • Embeddings: Download into: "[WebUIBaseDir]/embeddings/" (yes, not in the models directory for some reason…)

Personally, I recomend to add an image and optionally a textfile of the same names for each downloaded file. The image should be representative of the effects of the Lora, as it will be used as a thumbnail in Automatic1111. In the textfile I would add the trigger word and weight (for Lora/Lycoris/Hyper), additional information and a link to the download page. This will show up in the WebUI too, but there is a new and better way to do so (See below). For example, if you add my EliaStellaria Lora, you would have these files: "[WebUIBaseDir]/models/Lora/EliaStellaria.safetensors" "[WebUIBaseDir]/models/Lora/EliaStellaria.png/jpg" "[WebUIBaseDir]/models/Lora/EliaStellaria.txt"

Usage

Inside the WebUIs txt2img and img2img tabs, you can change the tab from "Generation" to "Lora" and the other ones. This is where we will see the images we saved (try the refresh button if you can't find new additions). If you added .txt files too, you will be able to see their contents too (hover with mouse to expand). On click, it will add the networks activation to the prompt and will look like this <lora:Elia:1> On hover, you will get a Tool button, try it out! A window will open, allowing you to Edit the description (Which is the content of the .txt), see the training data Tags the creator used, set a new thumbnail and most importantly:

Activation Text and weight

You can (and probably should) add the main trigger word here, and set the default weight of the Network. This will automatically add the trigger word(s) along with the activation to the prompt when you click the networks entry. If the Network has optional prompts (e.G. outfit prompts for a Character), I would add it to the notes instead, but its your choice.

Organization

You don't have to put all your Lora into the same folder. You can use subdirectories, which will help you a lot if you have many Lora or maybe want to seperate NSFW Lora from SFW Lora. Same goes for Checkpoints, Lycoris, etc. Eg. My Lora directory tree looks like this: Lora ├── NSFW │   ├── NClothes │   ├── NPoses │   ├── NSpecial │   └── NStyle └── Safe ├── Characters │   └── Elia │   ├── New │   └── Stanana ├── Clothes ├── Objects ├── Poses ├── Special └── Style └── ObjectStyles

The reason I prepended the NSFW folders is an old bug (thats maybe fixed by now) merging Safe/Clothes with NSFW/Clothes in the WebUI. Same reason for why I use SFW instead of NSFW. For Checkpoints, I use this Structure: Stable-diffusion ├── Anime ├── AnimeXL └── Realism

Extensions

Now that we took a look at the standard UserInterface of Automatic1111, we will take a look at adding even more features bit by bit. We will use Extensions for that. To install extensions, you first have to start the UI with an extra argument in the terminal/cmd

sh ./webui.sh --enable-insecure-extension-access

This is only needed for installing extensions, not for running them. So I recommend to not use this argument when using the UI, just for installing/updating the extensions. The reason being: it's insecure. Especially when combined with the –listen argument, because then someone in you Wifi could basically install a virus into your Automatic1111 UI. Also, in the same train of thought: Only install extensions you trust. High activity and many stars on its linked git repository can be a good indicator for trustworthyness, but are not perfect!

So, to proceed the Installation: In the WebUI, click on the Extensions tab, then the "Avaliable" tab. The textfield next to the "Load from:" button should already contain a link to the official extensions list. If it doesn't copy/paste it yourself: https://raw.githubusercontent.com/AUTOMATIC1111/stable-diffusion-webui-extensions/master/index.json Now you can click the button and will get a list of all these extensions. For your first visits, It's a good idea to sort by stars, which will order by popularity. To install a extension, click its install button, return from the "Avaliable" tab to the "Installed" tab and click on "Apply and restart UI" Done! So, here are my personal favourites:

a1111-sd-webui-tagcomplete

This will add a small popup when typing your prompt, suggesting booru style tags, for your Anime images. Very handy, as Anime models are trained to understand exactly these tags.

multidiffusion-upscaler-for-automatic1111

Ever got a "CUDA out of memory" error? This usually happens when your choosen resolution is so large, that you need more VRAM in your Graphics Card. Sadly, you can't add more VRAM without replacing the entire graphics card. Using –verylowvram makes generations abyssmally slow, but this extension is your saviour! In the txt2img and img2img tabs, you will get 2 new sections:

Tiled VAE

If you get the out of memory errors at 99%, that is no coincidence. The Decoding using the VAE usually needs more VRAM than the generation itself. And since the decoding is the very last step, this wastes all the time you waited for your image. With Tiled VAE, you can fix this issue. Just enable it and disable the Fast Encoder OR enable the "Fast Encoder Color Fix". For very limited VRAM, you might want to reduce the Encoder and Decoder Tile Sizes. It is normal for the Decoder Tile Size to be significantly smaller than the Encoder one. E.g. 1536/96 is usually good for my 8GB Graphics Card. There is basically no conceivable downside to using Tiled VAE, I always enable it to avoid possible errors at 99%

Tiled Diffusion

If you get out of memory errors during creation, especially at 50% when using Highres Fix, you can enable Tiled Diffusion. It will cut your image into smaller images (tiles) and work on 1 to 8 tiles at a time (set using "Latent tile batch size"), drastically reducing VRAM usage. As a rule of thumb, you might want to set the Tile width/height to about a tenth of your output resolution (If using Highres fix, use its output resolution). Overlap of 50% is a good idea, if set too low you will get visible seams in the output image, if set too high you will waste a lot of time. You can choose the method to your own preference. Try both out and decide for yourself which you prefer.

If you are in the img2img tab, you also get another section within Tiled Diffusion: Noise Inversion This is basically an upscaling method which can also add more detail into the image. When you enable Noise Inversion, you should reduce Denoising Strenght (in the settings further up) to below 0.6 Feel free to experiment however. Also, set an Upscaler and a Scale Factor within Tiled Diffusion, even something like 4 or higher is fine.