Last updated August 28, 2022

Introduction

The Image Synthesis Style Studies is a database of publicly collected information documenting the responses of open source “AI” image synthesis models, such as CLIP and Stable Diffusion, to specific text-based inputs. These inputs include adjectives, names of artists, popular media, or other descriptors (hereby “modifiers”) with plausible visual effects on the style of the images synthesized by these models.

The database includes the recognition status (i.e. whether the model recognized the modifier, indicated with “Yes”, “No” or “Unsure”) of each individual modifier for each set of models. It also includes examples of images synthesized by the above models (Figure 1) when tested with the specified modifier to determine recognition status (see FAQ below - “How do you determine whether the models recognize a modifier?”).

at night.png

Figure 1. Examples of study “cards” of images synthesized with Disco Diffusion (an implementation of CLIP) using text inputs with selected descriptors, artist names, and other media included (e.g., “at night”, “Ivan Aivazovsky”, “Ernst Haeckel”, and “amigurumi”, the Japanese art of knitting or crocheting small, stuffed yarn creatures). Note: “by [artist name]” in these prompts is equivalent to “in the style of”. Outputs are not representative of actual work by these artists, nor does usage of these phrasings invoke direct sampling or usage of their actual works by the models.


This is a living database, meaning it is a work-in-progress, and is still regularly updated with modifier recognition information using both Disco Diffusion v4.1 (an implementation of CLIP guided diffusion that runs in Google Colab) as well as Stable Diffusion. It is not meant to include all possible modifiers, but instead serve as public documentation of the recognition status of specific text-based modifiers with plausible presence within the original datasets of image-text pairs used for the models’ training.

Note that the original datasets used in the training of these models are either:

a) not public, as in the case of CLIP’s training corpus, or

b) do not have modifier recognition information formally aggregated, as in the case of the publicly available LAION_5B dataset, from which the dataset of image-text pairs used in Stable Diffusion’s training was derived.

A layperson explanation of CLIP and diffusion models falls outside of the scope of this writing, but readers who are unfamiliar with these methods are encouraged to access the following resources:

The AI that creates any picture you want, explained (video by Vox):

https://www.youtube.com/watch?v=SVcsDDABEkM

A Beginner’s Guide to the CLIP Model (article by Matthew Brems)

https://www.kdnuggets.com/2021/03/beginners-guide-clip-model.html

Diffusion models explained in 4-difficulty levels (video by AssemblyAI) :

https://www.youtube.com/watch?v=yTAMrHVG1ew

How does DALL-E 2 actually work? (video by AssemblyAI) :

https://www.youtube.com/watch?v=F1X4fHzF4mQ