使用 Stable Diffusion 进行图像插值

这个 notebook 展示了如何使用 Stable Diffusion 来对图像进行插值。使用 Stable Diffusion 的图像插值是通过基于扩散的生成模型，从一张给定的图像平滑过渡到另一张图像，创建中间图像的过程。

以下是一些使用 Stable Diffusion 进行图像插值的不同应用场景：

数据增强： Stable Diffusion 可以通过生成介于现有数据点之间的合成图像，来增强机器学习模型的训练数据。这可以提高机器学习模型的一般化和鲁棒性，特别是在图像生成、分类或对象检测等任务中。
产品设计和原型制作： Stable Diffusion 可以通过生成具有微妙差异的产品设计或原型变体，来辅助产品设计。这对于探索设计替代方案、进行用户研究或在投入物理原型之前可视化设计迭代非常有用。
媒体制作内容生成：在媒体制作中，如电影和视频编辑， Stable Diffusion 可以用来生成关键帧之间的中间帧，实现更平滑的过渡并增强视觉叙事。与手动逐帧编辑相比，这可以节省时间和资源。

在图像插值的背景下， Stable Diffusion 模型通常用于在多维潜在空间中导航。每个维度代表模型学到的特定特征。通过在这个潜在空间中行走并在不同图像的潜在表示之间进行插值，模型能够生成一系列中间图像，这些图像显示了原始图像之间的平滑过渡。 Stable Diffusion 中有两种类型的潜在：提示潜在和图像潜在。

潜在空间行走涉及沿着由两个或多个点（代表图像）定义的路径在潜在空间中移动。通过仔细选择这些点和它们之间的路径，可以控制生成图像的特征，如风格、内容和其他视觉方面。

在这个 notebook 中，我们将探索使用 Stable Diffusion 进行图像插值的示例，并展示如何实现和利用潜在空间行走来创建图像之间的平滑过渡。我们将提供代码片段和可视化来展示这个过程的效果，从而更深入地理解生成模型如何以有意义的方式操纵和转化图像表示。

首先，让我们安装所有需要的模块

!pip install -q diffusers transformers xformers accelerate
!pip install -q numpy scipy ftfy Pillow

导入模块

import torch
import numpy as np
import os

import time

from PIL import Image
from IPython import display as IPdisplay
from tqdm.auto import tqdm

from diffusers import StableDiffusionPipeline
from diffusers import (
    DDIMScheduler,
    PNDMScheduler,
    LMSDiscreteScheduler,
    DPMSolverMultistepScheduler,
    EulerAncestralDiscreteScheduler,
    EulerDiscreteScheduler,
)
from transformers import logging

logging.set_verbosity_error()

让我们查看一下 CUDA 是否可用

print(torch.cuda.is_available())

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

这些设置用于优化在启用 CUDA 的 GPU 上 PyTorch 模型的性能，尤其是在使用混合精度训练或推理时，这在速度和内存使用方面可能有益。

来源：https://huggingface.co/docs/diffusers/optimization/fp16#memory-efficient-attention

torch.backends.cudnn.benchmark = True
torch.backends.cuda.matmul.allow_tf32 = True

模型

我们在这个项目中使用了 runwayml/stable-diffusion-v1-5 模型和 LMSDiscreteScheduler 调度器来生成图片。尽管这个模型已经不是最新的技术，但因为它的速度快、对内存的需求小，而且有很多社区成员基于这个版本改进的模型，所以还是挺受欢迎的。当然，如果你想尝试其他模型或调度器来生成图片，也是可以的。

model_name_or_path = "runwayml/stable-diffusion-v1-5"

scheduler = LMSDiscreteScheduler(
    beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000
)


pipe = StableDiffusionPipeline.from_pretrained(
    model_name_or_path,
    scheduler=scheduler,
    torch_dtype=torch.float32,
).to(device)

# Disable image generation progress bar, we'll display our own
pipe.set_progress_bar_config(disable=True)

这些方法旨在减少 GPU 消耗的内存。如果你有足够的显存，可以跳过这个步骤。

更详细的信息可以在这里找到：https://huggingface.co/docs/diffusers/en/optimization/opt_overview
特别是，关于以下方法的信息可以在这里找到：https://huggingface.co/docs/diffusers/optimization/memory

# Offloading the weights to the CPU and only loading them on the GPU can reduce memory consumption to less than 3GB.
pipe.enable_model_cpu_offload()

# Tighter ordering of memory tensors.
pipe.unet.to(memory_format=torch.channels_last)

# Decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time.
pipe.enable_vae_slicing()

# Splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image.
pipe.enable_vae_tiling()

# Using Flash Attention; If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling xformers.
pipe.enable_xformers_memory_efficient_attention()

display_images 函数将图像数组的列表转换成 GIF，保存到指定路径，并返回 GIF 对象以供显示。它使用当前时间来命名 GIF 文件，并通过打印出来处理任何错误。

def display_images(images, save_path):
    try:
        # Convert each image in the 'images' list from an array to an Image object.
        images = [Image.fromarray(np.array(image[0], dtype=np.uint8)) for image in images]

        # Generate a file name based on the current time, replacing colons with hyphens
        # to ensure the filename is valid for file systems that don't allow colons.
        filename = time.strftime("%H:%M:%S", time.localtime()).replace(":", "-")
        # Save the first image in the list as a GIF file at the 'save_path' location.
        # The rest of the images in the list are added as subsequent frames to the GIF.
        # The GIF will play each frame for 100 milliseconds and will loop indefinitely.
        images[0].save(
            f"{save_path}/{filename}.gif",
            save_all=True,
            append_images=images[1:],
            duration=100,
            loop=0,
        )
    except Exception as e:
        # If there is an error during the process, print the exception message.
        print(e)

    # Return the saved GIF as an IPython display object so it can be displayed in a notebook.
    return IPdisplay.Image(f"{save_path}/{filename}.gif")

生成参数

seed: 这个变量用于设置一个特定的随机种子，以便复现结果。
generator: 如果提供了种子，这将设置为一个 PyTorch 随机数生成器对象，否则为 None。它确保使用它的操作具有可复现的结果。
guidance_scale: 这个参数控制模型在文本到图像生成任务中遵循提示的程度，值越高，对提示的遵循越强。
num_inference_steps: 这指定了模型生成图像所需的步骤数。更多的步骤可以导致生成更高质量的图像，但生成时间会更长。
num_interpolation_steps: 这决定了在潜在空间中两点之间插值时使用的步骤数，影响生成动画中过渡的平滑度。
height: 生成图像的高度，以像素为单位。
width: 生成图像的宽度，以像素为单位。
save_path: 生成的 GIF 将保存的文件系统路径。

# The seed is set to "None", because we want different results each time we run the generation.
seed = None

if seed is not None:
    generator = torch.manual_seed(seed)
else:
    generator = None

# The guidance scale is set to its normal range (7 - 10).
guidance_scale = 8

# The number of inference steps was chosen empirically to generate an acceptable picture within an acceptable time.
num_inference_steps = 15

# The higher you set this value, the smoother the interpolations will be. However, the generation time will increase. This value was chosen empirically.
num_interpolation_steps = 30

# I would not recommend less than 512 on either dimension. This is because this model was trained on 512x512 image resolution.
height = 512
width = 512

# The path where the generated GIFs will be saved
save_path = "/output"

if not os.path.exists(save_path):
    os.makedirs(save_path)

示例 1：提示插值

在这个例子中，我们将通过在积极提示和消极提示之间进行插值，来探索这两个提示定义的概念之间的空间。这样做可以让我们看到一系列逐渐融合这两种提示特征的图像。具体来说，我们会修改原始提示的嵌入向量，逐渐添加一些小的变化，从而创建出一系列新的提示嵌入。这些新的嵌入将被用来生成图像，这些图像会平滑地从一种提示的状态过渡到另一种。

Example 1

首先，我们需要对积极和消极的文本提示进行标记化并获得它们的嵌入。积极提示引导图像生成朝向期望的特征，而消极提示则使其远离不希望出现的特征。

# The text prompt that describes the desired output image.
prompt = "Epic shot of Sweden, ultra detailed lake with an ren dear, nostalgic vintage, ultra cozy and inviting, wonderful light atmosphere, fairy, little photorealistic, digital painting, sharp focus, ultra cozy and inviting, wish to be there. very detailed, arty, should rank high on youtube for a dream trip."
# A negative prompt that can be used to steer the generation away from certain features; here, it is empty.
negative_prompt = "poorly drawn,cartoon, 2d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry"

# The step size for the interpolation in the latent space.
step_size = 0.001

# Tokenizing and encoding the prompt into embeddings.
prompt_tokens = pipe.tokenizer(
    prompt,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt",
)
prompt_embeds = pipe.text_encoder(prompt_tokens.input_ids.to(device))[0]


# Tokenizing and encoding the negative prompt into embeddings.
if negative_prompt is None:
    negative_prompt = [""]

negative_prompt_tokens = pipe.tokenizer(
    negative_prompt,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt",
)
negative_prompt_embeds = pipe.text_encoder(negative_prompt_tokens.input_ids.to(device))[0]

现在让我们来看看生成随机初始向量的代码部分，该向量使用正态分布生成，其结构符合扩散模型（UNet）预期的维度。这允许通过可选地使用随机数生成器来复现结果。创建了初始向量后，代码通过每次迭代增量地添加一个小步长，在两个嵌入（积极和消极提示）之间进行一系列插值。结果存储在一个名为 “walked_embeddings” 的列表中。

# Generating initial latent vectors from a random normal distribution, with the option to use a generator for reproducibility.
latents = torch.randn(
    (1, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator,
)

walked_embeddings = []

# Interpolating between embeddings for the given number of interpolation steps.
for i in range(num_interpolation_steps):
    walked_embeddings.append([prompt_embeds + step_size * i, negative_prompt_embeds + step_size * i])

最后，让我们根据插值嵌入生成一系列图像，然后显示这些图像。我们将遍历一个嵌入数组，使用每个嵌入生成具有指定特征（如高度、宽度以及与图像生成相关的其他参数）的图像。然后我们将这些图像收集到一个列表中。一旦生成完成，我们将调用 display_image 函数，以在给定的保存路径上将这些图像保存并显示为 GIF。

# Generating images using the interpolated embeddings.
images = []
for latent in tqdm(walked_embeddings):
    images.append(
        pipe(
            height=height,
            width=width,
            num_images_per_prompt=1,
            prompt_embeds=latent[0],
            negative_prompt_embeds=latent[1],
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latents,
        ).images
    )

# Display of saved generated images.
display_images(images, save_path)

示例 2：针对单个提示的扩散潜在插值

与第一个示例不同，在这个示例中，我们是在扩散模型本身的两个嵌入之间执行插值，而不是在提示之间。请注意，在这种情况下，我们使用 slerp 函数进行插值。然而，这并不妨碍我们在一个嵌入中添加一个常数。

Example 2

下面呈现的函数代表球面线性插值（Spherical Linear Interpolation）。这是一种在球面上进行插值的方法。这个函数在计算机图形学中常用于平滑地动画旋转，并且也可以用于机器学习中高维数据点之间的插值，比如生成模型中使用的潜在向量。

该函数的来源是 Andrej Karpathy 的 gist：https://gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355。
关于这种方法更详细的解释可以在：https://en.wikipedia.org/wiki/Slerp 找到。

def slerp(v0, v1, num, t0=0, t1=1):
    v0 = v0.detach().cpu().numpy()
    v1 = v1.detach().cpu().numpy()

    def interpolation(t, v0, v1, DOT_THRESHOLD=0.9995):
        """helper function to spherically interpolate two arrays v1 v2"""
        dot = np.sum(v0 * v1 / (np.linalg.norm(v0) * np.linalg.norm(v1)))
        if np.abs(dot) > DOT_THRESHOLD:
            v2 = (1 - t) * v0 + t * v1
        else:
            theta_0 = np.arccos(dot)
            sin_theta_0 = np.sin(theta_0)
            theta_t = theta_0 * t
            sin_theta_t = np.sin(theta_t)
            s0 = np.sin(theta_0 - theta_t) / sin_theta_0
            s1 = sin_theta_t / sin_theta_0
            v2 = s0 * v0 + s1 * v1
        return v2

    t = np.linspace(t0, t1, num)

    v3 = torch.tensor(np.array([interpolation(t[i], v0, v1) for i in range(num)]))

    return v3

# The text prompt that describes the desired output image.
prompt = (
    "Sci-fi digital painting of an alien landscape with otherworldly plants, strange creatures, and distant planets."
)
# A negative prompt that can be used to steer the generation away from certain features.
negative_prompt = "poorly drawn,cartoon, 3d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry"

# Generating initial latent vectors from a random normal distribution. In this example two latent vectors are generated, which will serve as start and end points for the interpolation.
# These vectors are shaped to fit the input requirements of the diffusion model's U-Net architecture.
latents = torch.randn(
    (2, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator,
)

# Getting our latent embeddings
interpolated_latents = slerp(latents[0], latents[1], num_interpolation_steps)

# Generating images using the interpolated embeddings.
images = []
for latent_vector in tqdm(interpolated_latents):
    images.append(
        pipe(
            prompt,
            height=height,
            width=width,
            negative_prompt=negative_prompt,
            num_images_per_prompt=1,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latent_vector[None, ...],
        ).images
    )

# Display of saved generated images.
display_images(images, save_path)

示例 3：多个提示之间的插值

与第一个示例中我们从一个提示移动开来不同，在这个示例中，我们将对任意数量的提示进行插值。为此，我们将取连续的提示对，并创建它们之间的平滑过渡。然后，我们将这些连续对的插值组合起来，并指示模型基于它们生成图像。我们将使用第二个示例中的 slerp 函数进行插值。

Example 3

再次，让我们对多个积极和消极的文本提示进行标记化并获得它们的嵌入。

# Text prompts that describes the desired output image.
prompts = [
    "A cute dog in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain",
    "A cute cat in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain",
]
# Negative prompts that can be used to steer the generation away from certain features.
negative_prompts = [
    "poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry",
    "poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry",
]

# NOTE: The number of prompts must match the number of negative prompts

batch_size = len(prompts)

# Tokenizing and encoding prompts into embeddings.
prompts_tokens = pipe.tokenizer(
    prompts,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt",
)
prompts_embeds = pipe.text_encoder(prompts_tokens.input_ids.to(device))[0]

# Tokenizing and encoding negative prompts into embeddings.
if negative_prompts is None:
    negative_prompts = [""] * batch_size

negative_prompts_tokens = pipe.tokenizer(
    negative_prompts,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt",
)
negative_prompts_embeds = pipe.text_encoder(negative_prompts_tokens.input_ids.to(device))[0]

如前所述，我们将使用 slerp 函数对连续的提示对创建平滑过渡。

# Generating initial U-Net latent vectors from a random normal distribution.
latents = torch.randn(
    (1, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator,
)

# Interpolating between embeddings pairs for the given number of interpolation steps.
interpolated_prompt_embeds = []
interpolated_negative_prompts_embeds = []
for i in range(batch_size - 1):
    interpolated_prompt_embeds.append(slerp(prompts_embeds[i], prompts_embeds[i + 1], num_interpolation_steps))
    interpolated_negative_prompts_embeds.append(
        slerp(
            negative_prompts_embeds[i],
            negative_prompts_embeds[i + 1],
            num_interpolation_steps,
        )
    )

interpolated_prompt_embeds = torch.cat(interpolated_prompt_embeds, dim=0).to(device)

interpolated_negative_prompts_embeds = torch.cat(interpolated_negative_prompts_embeds, dim=0).to(device)

最后，我们需要根据嵌入生成图像。

# Generating images using the interpolated embeddings.
images = []
for prompt_embeds, negative_prompt_embeds in tqdm(
    zip(interpolated_prompt_embeds, interpolated_negative_prompts_embeds),
    total=len(interpolated_prompt_embeds),
):
    images.append(
        pipe(
            height=height,
            width=width,
            num_images_per_prompt=1,
            prompt_embeds=prompt_embeds[None, ...],
            negative_prompt_embeds=negative_prompt_embeds[None, ...],
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latents,
        ).images
    )

# Display of saved generated images.
display_images(images, save_path)

示例 4：针对单个提示在扩散潜在空间中的循环行走

这个示例来自：https://keras.io/examples/generative/random_walks_with_stable_diffusion/

假设我们有两个噪声成分，我们称之为 x 和 y。我们从 0 移动到 2π，在每一步中，我们将 x 的余弦和 y 的正弦加到结果中。使用这种方法，在我们的移动结束时，我们得到了与开始时相同的噪声值。这意味着向量最终转变成它们自己，结束了我们的移动。

Example 4

# The text prompt that describes the desired output image.
prompt = "Beautiful sea sunset, warm light, Aivazovsky style"
# A negative prompt that can be used to steer the generation away from certain features
negative_prompt = "picture frames"

# Generating initial latent vectors from a random normal distribution to create a loop interpolation between them.
latents = torch.randn(
    (2, 1, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator,
)


# Calculation of looped embeddings
walk_noise_x = latents[0].to(device)
walk_noise_y = latents[1].to(device)

# Walking on a trigonometric circle
walk_scale_x = torch.cos(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(device)
walk_scale_y = torch.sin(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(device)

# Applying interpolation to noise
noise_x = torch.tensordot(walk_scale_x, walk_noise_x, dims=0)
noise_y = torch.tensordot(walk_scale_y, walk_noise_y, dims=0)

circular_latents = noise_x + noise_y

# Generating images using the interpolated embeddings.
images = []
for latent_vector in tqdm(circular_latents):
    images.append(
        pipe(
            prompt,
            height=height,
            width=width,
            negative_prompt=negative_prompt,
            num_images_per_prompt=1,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latent_vector,
        ).images
    )

# Display of saved generated images.
display_images(images, save_path)

下一步

接下来，您可以探索各种参数，如指导比例（guidance scale）、种子（seed）和插值步骤数（number of interpolation steps），以观察它们如何影响生成的图像。此外，尝试使用不同的提示和调度器来进一步优化你的结果。另一个有价值的步骤是实施线性插值（linspace），而不是球面线性插值（slerp），并比较结果，以更深入地了解插值过程。

< > Update on GitHub