From Stable Diffusion Wiki
Jump to navigation Jump to search


A text-to-image model is a type of artificial intelligence system that can generate realistic images from natural language descriptions. For example, given the text "a blue bird with yellow wings", a text-to-image model can produce an image of a bird that matches the description. Text-to-image models are useful for various applications, such as content creation, data augmentation, image editing, and visual communication. Text-to-image models typically consist of two components: an encoder and a decoder. The encoder transforms the input text into a latent representation, such as a vector or a tensor. The decoder then uses the latent representation to generate an image pixel by pixel or patch by patch. The encoder and decoder are usually trained together using a large dataset of text-image pairs, such as COCO or Conceptual Captions. The training process involves optimizing an objective function that measures how well the generated images match the input texts and how realistic the images look.

Machine Learning Models: Text-to-Image, have evolved, particularly since the mid-2010s, to create an image that corresponds to a given natural language description. The cutting-edge technology of deep neural networks facilitated this growth, leading to quality outputs nearing actual photographs or human-crafted artwork by 2022. Among these models, OpenAI's DALL-E 2, Google Brain's Imagen, StabilityAI's Stable Diffusion, and Midjourney stand as significant achievements in the field.


Typically, a text-to-image model integrates two main components: a language model that translates the textual input into a latent form, and a generative image model that takes this latent form to generate an image. The most powerful models commonly result from training on substantial quantities of text and image data found online.

Prompts and Generation

The models function by accepting text inputs, referred to as prompts, which can be either positive or negative, and then generate an image based on those inputs.

Tips on how to use positive and negative text prompts for Stable Diffusion

  • Start with a simple positive prompt. This could be something like "a photorealistic painting of a cat" or "a fantasy landscape with a castle."
  • Once you're happy with the basic image, you can start adding negative prompts. These are words or phrases that you don't want to see in the image. For example, you could add "no people" or "no text" to your prompt.
  • Be specific with your negative prompts. The more specific you are, the more likely Stable Diffusion is to avoid the things you don't want to see. For example, instead of just saying "no people," you could say "no people in the foreground" or "no people in the frame."
  • Experiment with different negative prompts. There's no one-size-fits-all negative prompt. You'll need to experiment to find the ones that work best for you.
  • Use a combination of positive and negative prompts. Sometimes, the best way to get the desired result is to use a combination of positive and negative prompts. For example, you could say "a photorealistic painting of a cat without a background" or "a fantasy landscape with a castle but no people."
  • Be patient. It may take some trial and error to get the desired results. Don't get discouraged if you don't get it right the first time.

Additional Tips

  • Use keywords that are specific and descriptive. The more specific you are, the better Stable Diffusion will be able to understand what you want.
  • Use keywords that are relevant to the subject matter of the image. For example, if you're trying to generate an image of a cat, you would use keywords like "cat," "fur," "whiskers," and "tail."
  • Avoid using keywords that are too common. Common keywords can make it difficult for Stable Diffusion to generate a unique image.
  • Use negative keywords sparingly. Too many negative keywords can make it difficult for Stable Diffusion to generate any image at all.

Stable Diffusion Expansion

Stable Diffusion's capabilities have expanded beyond merely processing text inputs, considering numerous other parameters. Nevertheless, the text inputs remain the essential cornerstone of the Stable Diffusion model.