Modern generative AI models are able to create detailed and accurate images given only a simple text description, and in order to use these tools productively it is important to understand their inner workings.
How does an image model process a prompt like "Barbie house in Bauhaus style, pink, in the style of a watercolor architecture drawing by Steven Holl" and output this image:
Generative Image AI refers to Artificial Intelligence models (Midjourney, Stable Diffusion, Dall-E), which are able to create high-quality imagery through text or image prompts. These models are trained on billions of images (the dataset), which are curated for the intended use of the model.
Currently, two types of AI models dominate the market:
Diffusion models are named so because they function inspired by thermodynamic diffusion, the process by which a drop of food coloring spreads in water to eventually create a uniform color.
Diffusion models use this principle with images, by first taking an image and diffusing it, altering the pixels until the image gradually becomes TV static. Through this diffusion process, the diffusion model is learning how to reverse the diffusion process, that is, taking a noisy image and diffusing it backward to create images. You can think of the forward-diffusion process of diffusing clear images into static as the training process. Reverse diffusion can be thought of as the act of generating new images with static noise.
The key idea is that it is easy for computers to generate TV static, and use the randomness of that generated static to create new images each time. The randomness of the noise is also why diffusion models generate different images each time even if the same prompts are used.
A high-level diagram of diffusion model processes.
Since Diffusion models always start image synthesis from random noise, the produced image is different each time. This makes them highly effective at creating a diverse range of images. However, they are slower to produce images than their peers, VAE's and GANs.
Generative Adversarial Networks are a model that pits two sub-models, a Generator and a Discriminator against each other in order to train the Generator to produce images that can pass as originals from the training dataset. The Discriminator identifies if an image is an original or a generated image. If it identifies correctly that the image is generated, the process undergoes a feedback loop that allows the Generator to improve on its outputs, until the Generator is outputting imagery that is as convincing as the originals. The training process looks something like below:
Due to the method of training, GANs often suffer from image diversity problems. However, GANs can generate high-quality outputs of their training data, making GANs especially useful for effective style transfer.
Now we know how AI models generate images, we need to understand how they generate specific images. How do they take a text prompt and translate that into meaningful imagery?
Text prompts use a text encoder in order to map text to vectors that contain the meaning of the text. Vectors contain the meaning behind the text, and point the model in the right direction to generate the correct images.
Here's an example using the four words below:
In the above example, the word "Man" contains the vector (0,1), and the word "Woman" contains the vector (1, 1). The first digit in the vector refers to sex, and the second digit refers to age.
Now with the next word "Boy", the vector would be (0,0), since "Boy" shares commonalities with "Man" but differ on the second vector of age. With this we can infer that "Girl" would be the vector below:
With these vectors, diffusion models can now use the information contained to point to specific image outputs. It is important to note that the diffusion process is still stochastic. This means that with the same vectors, different images will be produced each time because no single image properly represents all the information presented in a vector. Using the simple example of generating images of a "woman", models can generate multiple images that all capture the meaning behind the word while differing on any number of factors: Where is the woman? Is she sitting or standing? What color hair does she have? etc.
University of Toronto Libraries
130 St. George St.,Toronto, ON, M5S 1A5
libraryhelp@utoronto.ca
416-978-8450
Map
About web accessibility. Tell us about a web accessibility problem.
About online privacy and data collection.
© University of Toronto. All rights reserved. Terms and conditions.