diff --git a/app.py b/app.py index b422d7ba867831e7ccc4b82eddd9a871414b2022..96166a9915aabbfe228c4bd660b61f6f26ff284d 100644 --- a/app.py +++ b/app.py @@ -13,10 +13,6 @@ import threading import time import os -os.system("pip uninstall -y diffusers") -os.system("pip install -e ./diffusers") - - import cv2 import tempfile import imageio_ffmpeg diff --git a/diffusers/.github/ISSUE_TEMPLATE/bug-report.yml b/diffusers/.github/ISSUE_TEMPLATE/bug-report.yml deleted file mode 100644 index a0517725284e6062287e08dad3495d94b0bd0481..0000000000000000000000000000000000000000 --- a/diffusers/.github/ISSUE_TEMPLATE/bug-report.yml +++ /dev/null @@ -1,110 +0,0 @@ -name: "\U0001F41B Bug Report" -description: Report a bug on Diffusers -labels: [ "bug" ] -body: - - type: markdown - attributes: - value: | - Thanks a lot for taking the time to file this issue 🤗. - Issues do not only help to improve the library, but also publicly document common problems, questions, workflows for the whole community! - Thus, issues are of the same importance as pull requests when contributing to this library ❤️. - In order to make your issue as **useful for the community as possible**, let's try to stick to some simple guidelines: - - 1. Please try to be as precise and concise as possible. - *Give your issue a fitting title. Assume that someone which very limited knowledge of Diffusers can understand your issue. Add links to the source code, documentation other issues, pull requests etc...* - - 2. If your issue is about something not working, **always** provide a reproducible code snippet. The reader should be able to reproduce your issue by **only copy-pasting your code snippet into a Python shell**. - *The community cannot solve your issue if it cannot reproduce it. If your bug is related to training, add your training script and make everything needed to train public. Otherwise, just add a simple Python code snippet.* - - 3. Add the **minimum** amount of code / context that is needed to understand, reproduce your issue. - *Make the life of maintainers easy. `diffusers` is getting many issues every day. Make sure your issue is about one bug and one bug only. Make sure you add only the context, code needed to understand your issues - nothing more. Generally, every issue is a way of documenting this library, try to make it a good documentation entry.* - - 4. For issues related to community pipelines (i.e., the pipelines located in the `examples/community` folder), please tag the author of the pipeline in your issue thread as those pipelines are not maintained. - - type: markdown - attributes: - value: | - For more in-detail information on how to write good issues you can have a look [here](https://huggingface.co/course/chapter8/5?fw=pt). - - type: textarea - id: bug-description - attributes: - label: Describe the bug - description: A clear and concise description of what the bug is. If you intend to submit a pull request for this issue, tell us in the description. Thanks! - placeholder: Bug description - validations: - required: true - - type: textarea - id: reproduction - attributes: - label: Reproduction - description: Please provide a minimal reproducible code which we can copy/paste and reproduce the issue. - placeholder: Reproduction - validations: - required: true - - type: textarea - id: logs - attributes: - label: Logs - description: "Please include the Python logs if you can." - render: shell - - type: textarea - id: system-info - attributes: - label: System Info - description: Please share your system info with us. You can run the command `diffusers-cli env` and copy-paste its output below. - placeholder: Diffusers version, platform, Python version, ... - validations: - required: true - - type: textarea - id: who-can-help - attributes: - label: Who can help? - description: | - Your issue will be replied to more quickly if you can figure out the right person to tag with @. - If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**. - - All issues are read by one of the core maintainers, so if you don't know who to tag, just leave this blank and - a core maintainer will ping the right person. - - Please tag a maximum of 2 people. - - Questions on DiffusionPipeline (Saving, Loading, From pretrained, ...): @sayakpaul @DN6 - - Questions on pipelines: - - Stable Diffusion @yiyixuxu @asomoza - - Stable Diffusion XL @yiyixuxu @sayakpaul @DN6 - - Stable Diffusion 3: @yiyixuxu @sayakpaul @DN6 @asomoza - - Kandinsky @yiyixuxu - - ControlNet @sayakpaul @yiyixuxu @DN6 - - T2I Adapter @sayakpaul @yiyixuxu @DN6 - - IF @DN6 - - Text-to-Video / Video-to-Video @DN6 @a-r-r-o-w - - Wuerstchen @DN6 - - Other: @yiyixuxu @DN6 - - Improving generation quality: @asomoza - - Questions on models: - - UNet @DN6 @yiyixuxu @sayakpaul - - VAE @sayakpaul @DN6 @yiyixuxu - - Transformers/Attention @DN6 @yiyixuxu @sayakpaul - - Questions on single file checkpoints: @DN6 - - Questions on Schedulers: @yiyixuxu - - Questions on LoRA: @sayakpaul - - Questions on Textual Inversion: @sayakpaul - - Questions on Training: - - DreamBooth @sayakpaul - - Text-to-Image Fine-tuning @sayakpaul - - Textual Inversion @sayakpaul - - ControlNet @sayakpaul - - Questions on Tests: @DN6 @sayakpaul @yiyixuxu - - Questions on Documentation: @stevhliu - - Questions on JAX- and MPS-related things: @pcuenca - - Questions on audio pipelines: @sanchit-gandhi - - - - placeholder: "@Username ..." diff --git a/diffusers/.github/ISSUE_TEMPLATE/config.yml b/diffusers/.github/ISSUE_TEMPLATE/config.yml deleted file mode 100644 index e81992fe3c69b65f58f627252ffa6569d1cd67e2..0000000000000000000000000000000000000000 --- a/diffusers/.github/ISSUE_TEMPLATE/config.yml +++ /dev/null @@ -1,4 +0,0 @@ -contact_links: - - name: Questions / Discussions - url: https://github.com/huggingface/diffusers/discussions - about: General usage questions and community discussions diff --git a/diffusers/.github/ISSUE_TEMPLATE/feature_request.md b/diffusers/.github/ISSUE_TEMPLATE/feature_request.md deleted file mode 100644 index 42f93232c1de7c73dcd90cdb6b0733bbb4461508..0000000000000000000000000000000000000000 --- a/diffusers/.github/ISSUE_TEMPLATE/feature_request.md +++ /dev/null @@ -1,20 +0,0 @@ ---- -name: "\U0001F680 Feature Request" -about: Suggest an idea for this project -title: '' -labels: '' -assignees: '' - ---- - -**Is your feature request related to a problem? Please describe.** -A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]. - -**Describe the solution you'd like.** -A clear and concise description of what you want to happen. - -**Describe alternatives you've considered.** -A clear and concise description of any alternative solutions or features you've considered. - -**Additional context.** -Add any other context or screenshots about the feature request here. diff --git a/diffusers/.github/ISSUE_TEMPLATE/feedback.md b/diffusers/.github/ISSUE_TEMPLATE/feedback.md deleted file mode 100644 index 25808b6575a405694f64dbf1b5a0ece8e0fcd2e2..0000000000000000000000000000000000000000 --- a/diffusers/.github/ISSUE_TEMPLATE/feedback.md +++ /dev/null @@ -1,12 +0,0 @@ ---- -name: "💬 Feedback about API Design" -about: Give feedback about the current API design -title: '' -labels: '' -assignees: '' - ---- - -**What API design would you like to have changed or added to the library? Why?** - -**What use case would this enable or better enable? Can you give us a code example?** diff --git a/diffusers/.github/ISSUE_TEMPLATE/new-model-addition.yml b/diffusers/.github/ISSUE_TEMPLATE/new-model-addition.yml deleted file mode 100644 index 432e287dd3348965466a696ee5e01a187f179ee5..0000000000000000000000000000000000000000 --- a/diffusers/.github/ISSUE_TEMPLATE/new-model-addition.yml +++ /dev/null @@ -1,31 +0,0 @@ -name: "\U0001F31F New Model/Pipeline/Scheduler Addition" -description: Submit a proposal/request to implement a new diffusion model/pipeline/scheduler -labels: [ "New model/pipeline/scheduler" ] - -body: - - type: textarea - id: description-request - validations: - required: true - attributes: - label: Model/Pipeline/Scheduler description - description: | - Put any and all important information relative to the model/pipeline/scheduler - - - type: checkboxes - id: information-tasks - attributes: - label: Open source status - description: | - Please note that if the model implementation isn't available or if the weights aren't open-source, we are less likely to implement it in `diffusers`. - options: - - label: "The model implementation is available." - - label: "The model weights are available (Only relevant if addition is not a scheduler)." - - - type: textarea - id: additional-info - attributes: - label: Provide useful links for the implementation - description: | - Please provide information regarding the implementation, the weights, and the authors. - Please mention the authors by @gh-username if you're aware of their usernames. diff --git a/diffusers/.github/ISSUE_TEMPLATE/translate.md b/diffusers/.github/ISSUE_TEMPLATE/translate.md deleted file mode 100644 index 3471ec9640d727e7cdf223852d2012834660e88a..0000000000000000000000000000000000000000 --- a/diffusers/.github/ISSUE_TEMPLATE/translate.md +++ /dev/null @@ -1,29 +0,0 @@ ---- -name: 🌐 Translating a New Language? -about: Start a new translation effort in your language -title: '[] Translating docs to ' -labels: WIP -assignees: '' - ---- - - - -Hi! - -Let's bring the documentation to all the -speaking community 🌐. - -Who would want to translate? Please follow the 🤗 [TRANSLATING guide](https://github.com/huggingface/diffusers/blob/main/docs/TRANSLATING.md). Here is a list of the files ready for translation. Let us know in this issue if you'd like to translate any, and we'll add your name to the list. - -Some notes: - -* Please translate using an informal tone (imagine you are talking with a friend about Diffusers 🤗). -* Please translate in a gender-neutral way. -* Add your translations to the folder called `` inside the [source folder](https://github.com/huggingface/diffusers/tree/main/docs/source). -* Register your translation in `/_toctree.yml`; please follow the order of the [English version](https://github.com/huggingface/diffusers/blob/main/docs/source/en/_toctree.yml). -* Once you're finished, open a pull request and tag this issue by including #issue-number in the description, where issue-number is the number of this issue. Please ping @stevhliu for review. -* 🙋 If you'd like others to help you with the translation, you can also post in the 🤗 [forums](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63). - -Thank you so much for your help! 🤗 diff --git a/diffusers/.github/PULL_REQUEST_TEMPLATE.md b/diffusers/.github/PULL_REQUEST_TEMPLATE.md deleted file mode 100644 index e4b2b45a4ecde200e65e3a2e16ac0cdb0b87d7c3..0000000000000000000000000000000000000000 --- a/diffusers/.github/PULL_REQUEST_TEMPLATE.md +++ /dev/null @@ -1,61 +0,0 @@ -# What does this PR do? - - - - - -Fixes # (issue) - - -## Before submitting -- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). -- [ ] Did you read the [contributor guideline](https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md)? -- [ ] Did you read our [philosophy doc](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md) (important for complex PRs)? -- [ ] Was this discussed/approved via a GitHub issue or the [forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63)? Please add a link to it if that's the case. -- [ ] Did you make sure to update the documentation with your changes? Here are the - [documentation guidelines](https://github.com/huggingface/diffusers/tree/main/docs), and - [here are tips on formatting docstrings](https://github.com/huggingface/diffusers/tree/main/docs#writing-source-documentation). -- [ ] Did you write any new necessary tests? - - -## Who can review? - -Anyone in the community is free to review the PR once the tests have passed. Feel free to tag -members/contributors who may be interested in your PR. - - diff --git a/diffusers/.github/actions/setup-miniconda/action.yml b/diffusers/.github/actions/setup-miniconda/action.yml deleted file mode 100644 index b1f4f194bfe1fd14e03239269e466e7978e3d5c5..0000000000000000000000000000000000000000 --- a/diffusers/.github/actions/setup-miniconda/action.yml +++ /dev/null @@ -1,146 +0,0 @@ -name: Set up conda environment for testing - -description: Sets up miniconda in your ${RUNNER_TEMP} environment and gives you the ${CONDA_RUN} environment variable so you don't have to worry about polluting non-empeheral runners anymore - -inputs: - python-version: - description: If set to any value, don't use sudo to clean the workspace - required: false - type: string - default: "3.9" - miniconda-version: - description: Miniconda version to install - required: false - type: string - default: "4.12.0" - environment-file: - description: Environment file to install dependencies from - required: false - type: string - default: "" - -runs: - using: composite - steps: - # Use the same trick from https://github.com/marketplace/actions/setup-miniconda - # to refresh the cache daily. This is kind of optional though - - name: Get date - id: get-date - shell: bash - run: echo "today=$(/bin/date -u '+%Y%m%d')d" >> $GITHUB_OUTPUT - - name: Setup miniconda cache - id: miniconda-cache - uses: actions/cache@v2 - with: - path: ${{ runner.temp }}/miniconda - key: miniconda-${{ runner.os }}-${{ runner.arch }}-${{ inputs.python-version }}-${{ steps.get-date.outputs.today }} - - name: Install miniconda (${{ inputs.miniconda-version }}) - if: steps.miniconda-cache.outputs.cache-hit != 'true' - env: - MINICONDA_VERSION: ${{ inputs.miniconda-version }} - shell: bash -l {0} - run: | - MINICONDA_INSTALL_PATH="${RUNNER_TEMP}/miniconda" - mkdir -p "${MINICONDA_INSTALL_PATH}" - case ${RUNNER_OS}-${RUNNER_ARCH} in - Linux-X64) - MINICONDA_ARCH="Linux-x86_64" - ;; - macOS-ARM64) - MINICONDA_ARCH="MacOSX-arm64" - ;; - macOS-X64) - MINICONDA_ARCH="MacOSX-x86_64" - ;; - *) - echo "::error::Platform ${RUNNER_OS}-${RUNNER_ARCH} currently unsupported using this action" - exit 1 - ;; - esac - MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py39_${MINICONDA_VERSION}-${MINICONDA_ARCH}.sh" - curl -fsSL "${MINICONDA_URL}" -o "${MINICONDA_INSTALL_PATH}/miniconda.sh" - bash "${MINICONDA_INSTALL_PATH}/miniconda.sh" -b -u -p "${MINICONDA_INSTALL_PATH}" - rm -rf "${MINICONDA_INSTALL_PATH}/miniconda.sh" - - name: Update GitHub path to include miniconda install - shell: bash - run: | - MINICONDA_INSTALL_PATH="${RUNNER_TEMP}/miniconda" - echo "${MINICONDA_INSTALL_PATH}/bin" >> $GITHUB_PATH - - name: Setup miniconda env cache (with env file) - id: miniconda-env-cache-env-file - if: ${{ runner.os }} == 'macOS' && ${{ inputs.environment-file }} != '' - uses: actions/cache@v2 - with: - path: ${{ runner.temp }}/conda-python-${{ inputs.python-version }} - key: miniconda-env-${{ runner.os }}-${{ runner.arch }}-${{ inputs.python-version }}-${{ steps.get-date.outputs.today }}-${{ hashFiles(inputs.environment-file) }} - - name: Setup miniconda env cache (without env file) - id: miniconda-env-cache - if: ${{ runner.os }} == 'macOS' && ${{ inputs.environment-file }} == '' - uses: actions/cache@v2 - with: - path: ${{ runner.temp }}/conda-python-${{ inputs.python-version }} - key: miniconda-env-${{ runner.os }}-${{ runner.arch }}-${{ inputs.python-version }}-${{ steps.get-date.outputs.today }} - - name: Setup conda environment with python (v${{ inputs.python-version }}) - if: steps.miniconda-env-cache-env-file.outputs.cache-hit != 'true' && steps.miniconda-env-cache.outputs.cache-hit != 'true' - shell: bash - env: - PYTHON_VERSION: ${{ inputs.python-version }} - ENV_FILE: ${{ inputs.environment-file }} - run: | - CONDA_BASE_ENV="${RUNNER_TEMP}/conda-python-${PYTHON_VERSION}" - ENV_FILE_FLAG="" - if [[ -f "${ENV_FILE}" ]]; then - ENV_FILE_FLAG="--file ${ENV_FILE}" - elif [[ -n "${ENV_FILE}" ]]; then - echo "::warning::Specified env file (${ENV_FILE}) not found, not going to include it" - fi - conda create \ - --yes \ - --prefix "${CONDA_BASE_ENV}" \ - "python=${PYTHON_VERSION}" \ - ${ENV_FILE_FLAG} \ - cmake=3.22 \ - conda-build=3.21 \ - ninja=1.10 \ - pkg-config=0.29 \ - wheel=0.37 - - name: Clone the base conda environment and update GitHub env - shell: bash - env: - PYTHON_VERSION: ${{ inputs.python-version }} - CONDA_BASE_ENV: ${{ runner.temp }}/conda-python-${{ inputs.python-version }} - run: | - CONDA_ENV="${RUNNER_TEMP}/conda_environment_${GITHUB_RUN_ID}" - conda create \ - --yes \ - --prefix "${CONDA_ENV}" \ - --clone "${CONDA_BASE_ENV}" - # TODO: conda-build could not be cloned because it hardcodes the path, so it - # could not be cached - conda install --yes -p ${CONDA_ENV} conda-build=3.21 - echo "CONDA_ENV=${CONDA_ENV}" >> "${GITHUB_ENV}" - echo "CONDA_RUN=conda run -p ${CONDA_ENV} --no-capture-output" >> "${GITHUB_ENV}" - echo "CONDA_BUILD=conda run -p ${CONDA_ENV} conda-build" >> "${GITHUB_ENV}" - echo "CONDA_INSTALL=conda install -p ${CONDA_ENV}" >> "${GITHUB_ENV}" - - name: Get disk space usage and throw an error for low disk space - shell: bash - run: | - echo "Print the available disk space for manual inspection" - df -h - # Set the minimum requirement space to 4GB - MINIMUM_AVAILABLE_SPACE_IN_GB=4 - MINIMUM_AVAILABLE_SPACE_IN_KB=$(($MINIMUM_AVAILABLE_SPACE_IN_GB * 1024 * 1024)) - # Use KB to avoid floating point warning like 3.1GB - df -k | tr -s ' ' | cut -d' ' -f 4,9 | while read -r LINE; - do - AVAIL=$(echo $LINE | cut -f1 -d' ') - MOUNT=$(echo $LINE | cut -f2 -d' ') - if [ "$MOUNT" = "/" ]; then - if [ "$AVAIL" -lt "$MINIMUM_AVAILABLE_SPACE_IN_KB" ]; then - echo "There is only ${AVAIL}KB free space left in $MOUNT, which is less than the minimum requirement of ${MINIMUM_AVAILABLE_SPACE_IN_KB}KB. Please help create an issue to PyTorch Release Engineering via https://github.com/pytorch/test-infra/issues and provide the link to the workflow run." - exit 1; - else - echo "There is ${AVAIL}KB free space left in $MOUNT, continue" - fi - fi - done diff --git a/diffusers/.github/workflows/benchmark.yml b/diffusers/.github/workflows/benchmark.yml deleted file mode 100644 index d311c1c73f1105bfd7d942ba4a80c6d431ebf30a..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/benchmark.yml +++ /dev/null @@ -1,67 +0,0 @@ -name: Benchmarking tests - -on: - workflow_dispatch: - schedule: - - cron: "30 1 1,15 * *" # every 2 weeks on the 1st and the 15th of every month at 1:30 AM - -env: - DIFFUSERS_IS_CI: yes - HF_HUB_ENABLE_HF_TRANSFER: 1 - HF_HOME: /mnt/cache - OMP_NUM_THREADS: 8 - MKL_NUM_THREADS: 8 - -jobs: - torch_pipelines_cuda_benchmark_tests: - env: - SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL_BENCHMARK }} - name: Torch Core Pipelines CUDA Benchmarking Tests - strategy: - fail-fast: false - max-parallel: 1 - runs-on: - group: aws-g6-4xlarge-plus - container: - image: diffusers/diffusers-pytorch-compile-cuda - options: --shm-size "16gb" --ipc host --gpus 0 - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - name: NVIDIA-SMI - run: | - nvidia-smi - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - python -m uv pip install pandas peft - - name: Environment - run: | - python utils/print_env.py - - name: Diffusers Benchmarking - env: - HF_TOKEN: ${{ secrets.DIFFUSERS_BOT_TOKEN }} - BASE_PATH: benchmark_outputs - run: | - export TOTAL_GPU_MEMORY=$(python -c "import torch; print(torch.cuda.get_device_properties(0).total_memory / (1024**3))") - cd benchmarks && mkdir ${BASE_PATH} && python run_all.py && python push_results.py - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: benchmark_test_reports - path: benchmarks/benchmark_outputs - - - name: Report success status - if: ${{ success() }} - run: | - pip install requests && python utils/notify_benchmarking_status.py --status=success - - - name: Report failure status - if: ${{ failure() }} - run: | - pip install requests && python utils/notify_benchmarking_status.py --status=failure \ No newline at end of file diff --git a/diffusers/.github/workflows/build_docker_images.yml b/diffusers/.github/workflows/build_docker_images.yml deleted file mode 100644 index 9f4776db4315d9418a8a5bd07877f0a08f9ec2dd..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/build_docker_images.yml +++ /dev/null @@ -1,103 +0,0 @@ -name: Test, build, and push Docker images - -on: - pull_request: # During PRs, we just check if the changes Dockerfiles can be successfully built - branches: - - main - paths: - - "docker/**" - workflow_dispatch: - schedule: - - cron: "0 0 * * *" # every day at midnight - -concurrency: - group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} - cancel-in-progress: true - -env: - REGISTRY: diffusers - CI_SLACK_CHANNEL: ${{ secrets.CI_DOCKER_CHANNEL }} - -jobs: - test-build-docker-images: - runs-on: - group: aws-general-8-plus - if: github.event_name == 'pull_request' - steps: - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v1 - - - name: Check out code - uses: actions/checkout@v3 - - - name: Find Changed Dockerfiles - id: file_changes - uses: jitterbit/get-changed-files@v1 - with: - format: 'space-delimited' - token: ${{ secrets.GITHUB_TOKEN }} - - - name: Build Changed Docker Images - run: | - CHANGED_FILES="${{ steps.file_changes.outputs.all }}" - for FILE in $CHANGED_FILES; do - if [[ "$FILE" == docker/*Dockerfile ]]; then - DOCKER_PATH="${FILE%/Dockerfile}" - DOCKER_TAG=$(basename "$DOCKER_PATH") - echo "Building Docker image for $DOCKER_TAG" - docker build -t "$DOCKER_TAG" "$DOCKER_PATH" - fi - done - if: steps.file_changes.outputs.all != '' - - build-and-push-docker-images: - runs-on: - group: aws-general-8-plus - if: github.event_name != 'pull_request' - - permissions: - contents: read - packages: write - - strategy: - fail-fast: false - matrix: - image-name: - - diffusers-pytorch-cpu - - diffusers-pytorch-cuda - - diffusers-pytorch-compile-cuda - - diffusers-pytorch-xformers-cuda - - diffusers-flax-cpu - - diffusers-flax-tpu - - diffusers-onnxruntime-cpu - - diffusers-onnxruntime-cuda - - diffusers-doc-builder - - steps: - - name: Checkout repository - uses: actions/checkout@v3 - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v1 - - name: Login to Docker Hub - uses: docker/login-action@v2 - with: - username: ${{ env.REGISTRY }} - password: ${{ secrets.DOCKERHUB_TOKEN }} - - name: Build and push - uses: docker/build-push-action@v3 - with: - no-cache: true - context: ./docker/${{ matrix.image-name }} - push: true - tags: ${{ env.REGISTRY }}/${{ matrix.image-name }}:latest - - - name: Post to a Slack channel - id: slack - uses: huggingface/hf-workflows/.github/actions/post-slack@main - with: - # Slack channel id, channel name, or user id to post message. - # See also: https://api.slack.com/methods/chat.postMessage#channels - slack_channel: ${{ env.CI_SLACK_CHANNEL }} - title: "🤗 Results of the ${{ matrix.image-name }} Docker Image build" - status: ${{ job.status }} - slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} diff --git a/diffusers/.github/workflows/build_documentation.yml b/diffusers/.github/workflows/build_documentation.yml deleted file mode 100644 index 6d4193e3cccc42e6824e9f0881a6b3e50bfa7173..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/build_documentation.yml +++ /dev/null @@ -1,27 +0,0 @@ -name: Build documentation - -on: - push: - branches: - - main - - doc-builder* - - v*-release - - v*-patch - paths: - - "src/diffusers/**.py" - - "examples/**" - - "docs/**" - -jobs: - build: - uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main - with: - commit_sha: ${{ github.sha }} - install_libgl1: true - package: diffusers - notebook_folder: diffusers_doc - languages: en ko zh ja pt - custom_container: diffusers/diffusers-doc-builder - secrets: - token: ${{ secrets.HUGGINGFACE_PUSH }} - hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} diff --git a/diffusers/.github/workflows/build_pr_documentation.yml b/diffusers/.github/workflows/build_pr_documentation.yml deleted file mode 100644 index 52e0757331639c7335132766cfc2afb2d74e0368..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/build_pr_documentation.yml +++ /dev/null @@ -1,23 +0,0 @@ -name: Build PR Documentation - -on: - pull_request: - paths: - - "src/diffusers/**.py" - - "examples/**" - - "docs/**" - -concurrency: - group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} - cancel-in-progress: true - -jobs: - build: - uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main - with: - commit_sha: ${{ github.event.pull_request.head.sha }} - pr_number: ${{ github.event.number }} - install_libgl1: true - package: diffusers - languages: en ko zh ja pt - custom_container: diffusers/diffusers-doc-builder diff --git a/diffusers/.github/workflows/mirror_community_pipeline.yml b/diffusers/.github/workflows/mirror_community_pipeline.yml deleted file mode 100644 index f6eff1bbd8f05fe937c72453d04b430b6882628b..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/mirror_community_pipeline.yml +++ /dev/null @@ -1,102 +0,0 @@ -name: Mirror Community Pipeline - -on: - # Push changes on the main branch - push: - branches: - - main - paths: - - 'examples/community/**.py' - - # And on tag creation (e.g. `v0.28.1`) - tags: - - '*' - - # Manual trigger with ref input - workflow_dispatch: - inputs: - ref: - description: "Either 'main' or a tag ref" - required: true - default: 'main' - -jobs: - mirror_community_pipeline: - env: - SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL_COMMUNITY_MIRROR }} - - runs-on: ubuntu-22.04 - steps: - # Checkout to correct ref - # If workflow dispatch - # If ref is 'main', set: - # CHECKOUT_REF=refs/heads/main - # PATH_IN_REPO=main - # Else it must be a tag. Set: - # CHECKOUT_REF=refs/tags/{tag} - # PATH_IN_REPO={tag} - # If not workflow dispatch - # If ref is 'refs/heads/main' => set 'main' - # Else it must be a tag => set {tag} - - name: Set checkout_ref and path_in_repo - run: | - if [ "${{ github.event_name }}" == "workflow_dispatch" ]; then - if [ -z "${{ github.event.inputs.ref }}" ]; then - echo "Error: Missing ref input" - exit 1 - elif [ "${{ github.event.inputs.ref }}" == "main" ]; then - echo "CHECKOUT_REF=refs/heads/main" >> $GITHUB_ENV - echo "PATH_IN_REPO=main" >> $GITHUB_ENV - else - echo "CHECKOUT_REF=refs/tags/${{ github.event.inputs.ref }}" >> $GITHUB_ENV - echo "PATH_IN_REPO=${{ github.event.inputs.ref }}" >> $GITHUB_ENV - fi - elif [ "${{ github.ref }}" == "refs/heads/main" ]; then - echo "CHECKOUT_REF=${{ github.ref }}" >> $GITHUB_ENV - echo "PATH_IN_REPO=main" >> $GITHUB_ENV - else - # e.g. refs/tags/v0.28.1 -> v0.28.1 - echo "CHECKOUT_REF=${{ github.ref }}" >> $GITHUB_ENV - echo "PATH_IN_REPO=$(echo ${{ github.ref }} | sed 's/^refs\/tags\///')" >> $GITHUB_ENV - fi - - name: Print env vars - run: | - echo "CHECKOUT_REF: ${{ env.CHECKOUT_REF }}" - echo "PATH_IN_REPO: ${{ env.PATH_IN_REPO }}" - - uses: actions/checkout@v3 - with: - ref: ${{ env.CHECKOUT_REF }} - - # Setup + install dependencies - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.10" - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install --upgrade huggingface_hub - - # Check secret is set - - name: whoami - run: huggingface-cli whoami - env: - HF_TOKEN: ${{ secrets.HF_TOKEN_MIRROR_COMMUNITY_PIPELINES }} - - # Push to HF! (under subfolder based on checkout ref) - # https://huggingface.co/datasets/diffusers/community-pipelines-mirror - - name: Mirror community pipeline to HF - run: huggingface-cli upload diffusers/community-pipelines-mirror ./examples/community ${PATH_IN_REPO} --repo-type dataset - env: - PATH_IN_REPO: ${{ env.PATH_IN_REPO }} - HF_TOKEN: ${{ secrets.HF_TOKEN_MIRROR_COMMUNITY_PIPELINES }} - - - name: Report success status - if: ${{ success() }} - run: | - pip install requests && python utils/notify_community_pipelines_mirror.py --status=success - - - name: Report failure status - if: ${{ failure() }} - run: | - pip install requests && python utils/notify_community_pipelines_mirror.py --status=failure \ No newline at end of file diff --git a/diffusers/.github/workflows/nightly_tests.yml b/diffusers/.github/workflows/nightly_tests.yml deleted file mode 100644 index b8e9860aec6366778bddc91cff33d2c46737b24a..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/nightly_tests.yml +++ /dev/null @@ -1,464 +0,0 @@ -name: Nightly and release tests on main/release branch - -on: - workflow_dispatch: - schedule: - - cron: "0 0 * * *" # every day at midnight - -env: - DIFFUSERS_IS_CI: yes - HF_HUB_ENABLE_HF_TRANSFER: 1 - OMP_NUM_THREADS: 8 - MKL_NUM_THREADS: 8 - PYTEST_TIMEOUT: 600 - RUN_SLOW: yes - RUN_NIGHTLY: yes - PIPELINE_USAGE_CUTOFF: 5000 - SLACK_API_TOKEN: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} - -jobs: - setup_torch_cuda_pipeline_matrix: - name: Setup Torch Pipelines CUDA Slow Tests Matrix - runs-on: - group: aws-general-8-plus - container: - image: diffusers/diffusers-pytorch-cpu - outputs: - pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }} - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - name: Install dependencies - run: | - pip install -e .[test] - pip install huggingface_hub - - name: Fetch Pipeline Matrix - id: fetch_pipeline_matrix - run: | - matrix=$(python utils/fetch_torch_cuda_pipeline_test_matrix.py) - echo $matrix - echo "pipeline_test_matrix=$matrix" >> $GITHUB_OUTPUT - - - name: Pipeline Tests Artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: test-pipelines.json - path: reports - - run_nightly_tests_for_torch_pipelines: - name: Nightly Torch Pipelines CUDA Tests - needs: setup_torch_cuda_pipeline_matrix - strategy: - fail-fast: false - max-parallel: 8 - matrix: - module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }} - runs-on: - group: aws-g4dn-2xlarge - container: - image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host --gpus 0 - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - name: NVIDIA-SMI - run: nvidia-smi - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - python -m uv pip install pytest-reportlog - - name: Environment - run: | - python utils/print_env.py - - name: Pipeline CUDA Test - env: - HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} - # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms - CUBLAS_WORKSPACE_CONFIG: :16:8 - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "not Flax and not Onnx" \ - --make-reports=tests_pipeline_${{ matrix.module }}_cuda \ - --report-log=tests_pipeline_${{ matrix.module }}_cuda.log \ - tests/pipelines/${{ matrix.module }} - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt - cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: pipeline_${{ matrix.module }}_test_reports - path: reports - - name: Generate Report and Notify Channel - if: always() - run: | - pip install slack_sdk tabulate - python utils/log_reports.py >> $GITHUB_STEP_SUMMARY - - run_nightly_tests_for_other_torch_modules: - name: Nightly Torch CUDA Tests - runs-on: - group: aws-g4dn-2xlarge - container: - image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host --gpus 0 - defaults: - run: - shell: bash - strategy: - fail-fast: false - max-parallel: 2 - matrix: - module: [models, schedulers, lora, others, single_file, examples] - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - python -m uv pip install peft@git+https://github.com/huggingface/peft.git - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - python -m uv pip install pytest-reportlog - - name: Environment - run: python utils/print_env.py - - - name: Run nightly PyTorch CUDA tests for non-pipeline modules - if: ${{ matrix.module != 'examples'}} - env: - HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} - # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms - CUBLAS_WORKSPACE_CONFIG: :16:8 - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "not Flax and not Onnx" \ - --make-reports=tests_torch_${{ matrix.module }}_cuda \ - --report-log=tests_torch_${{ matrix.module }}_cuda.log \ - tests/${{ matrix.module }} - - - name: Run nightly example tests with Torch - if: ${{ matrix.module == 'examples' }} - env: - HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} - # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms - CUBLAS_WORKSPACE_CONFIG: :16:8 - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v --make-reports=examples_torch_cuda \ - --report-log=examples_torch_cuda.log \ - examples/ - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_torch_${{ matrix.module }}_cuda_stats.txt - cat reports/tests_torch_${{ matrix.module }}_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: torch_${{ matrix.module }}_cuda_test_reports - path: reports - - - name: Generate Report and Notify Channel - if: always() - run: | - pip install slack_sdk tabulate - python utils/log_reports.py >> $GITHUB_STEP_SUMMARY - - run_big_gpu_torch_tests: - name: Torch tests on big GPU - strategy: - fail-fast: false - max-parallel: 2 - runs-on: - group: aws-g6e-xlarge-plus - container: - image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host --gpus 0 - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - name: NVIDIA-SMI - run: nvidia-smi - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - python -m uv pip install peft@git+https://github.com/huggingface/peft.git - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - python -m uv pip install pytest-reportlog - - name: Environment - run: | - python utils/print_env.py - - name: Selected Torch CUDA Test on big GPU - env: - HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} - # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms - CUBLAS_WORKSPACE_CONFIG: :16:8 - BIG_GPU_MEMORY: 40 - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -m "big_gpu_with_torch_cuda" \ - --make-reports=tests_big_gpu_torch_cuda \ - --report-log=tests_big_gpu_torch_cuda.log \ - tests/ - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_big_gpu_torch_cuda_stats.txt - cat reports/tests_big_gpu_torch_cuda_failures_short.txt - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: torch_cuda_big_gpu_test_reports - path: reports - - name: Generate Report and Notify Channel - if: always() - run: | - pip install slack_sdk tabulate - python utils/log_reports.py >> $GITHUB_STEP_SUMMARY - - run_flax_tpu_tests: - name: Nightly Flax TPU Tests - runs-on: docker-tpu - if: github.event_name == 'schedule' - - container: - image: diffusers/diffusers-flax-tpu - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ --privileged - defaults: - run: - shell: bash - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - python -m uv pip install pytest-reportlog - - - name: Environment - run: python utils/print_env.py - - - name: Run nightly Flax TPU tests - env: - HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} - run: | - python -m pytest -n 0 \ - -s -v -k "Flax" \ - --make-reports=tests_flax_tpu \ - --report-log=tests_flax_tpu.log \ - tests/ - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_flax_tpu_stats.txt - cat reports/tests_flax_tpu_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: flax_tpu_test_reports - path: reports - - - name: Generate Report and Notify Channel - if: always() - run: | - pip install slack_sdk tabulate - python utils/log_reports.py >> $GITHUB_STEP_SUMMARY - - run_nightly_onnx_tests: - name: Nightly ONNXRuntime CUDA tests on Ubuntu - runs-on: - group: aws-g4dn-2xlarge - container: - image: diffusers/diffusers-onnxruntime-cuda - options: --gpus 0 --shm-size "16gb" --ipc host - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: NVIDIA-SMI - run: nvidia-smi - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - python -m uv pip install pytest-reportlog - - name: Environment - run: python utils/print_env.py - - - name: Run Nightly ONNXRuntime CUDA tests - env: - HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "Onnx" \ - --make-reports=tests_onnx_cuda \ - --report-log=tests_onnx_cuda.log \ - tests/ - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_onnx_cuda_stats.txt - cat reports/tests_onnx_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: tests_onnx_cuda_reports - path: reports - - - name: Generate Report and Notify Channel - if: always() - run: | - pip install slack_sdk tabulate - python utils/log_reports.py >> $GITHUB_STEP_SUMMARY - -# M1 runner currently not well supported -# TODO: (Dhruv) add these back when we setup better testing for Apple Silicon -# run_nightly_tests_apple_m1: -# name: Nightly PyTorch MPS tests on MacOS -# runs-on: [ self-hosted, apple-m1 ] -# if: github.event_name == 'schedule' -# -# steps: -# - name: Checkout diffusers -# uses: actions/checkout@v3 -# with: -# fetch-depth: 2 -# -# - name: Clean checkout -# shell: arch -arch arm64 bash {0} -# run: | -# git clean -fxd -# - name: Setup miniconda -# uses: ./.github/actions/setup-miniconda -# with: -# python-version: 3.9 -# -# - name: Install dependencies -# shell: arch -arch arm64 bash {0} -# run: | -# ${CONDA_RUN} python -m pip install --upgrade pip uv -# ${CONDA_RUN} python -m uv pip install -e [quality,test] -# ${CONDA_RUN} python -m uv pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu -# ${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate -# ${CONDA_RUN} python -m uv pip install pytest-reportlog -# - name: Environment -# shell: arch -arch arm64 bash {0} -# run: | -# ${CONDA_RUN} python utils/print_env.py -# - name: Run nightly PyTorch tests on M1 (MPS) -# shell: arch -arch arm64 bash {0} -# env: -# HF_HOME: /System/Volumes/Data/mnt/cache -# HF_TOKEN: ${{ secrets.HF_TOKEN }} -# run: | -# ${CONDA_RUN} python -m pytest -n 1 -s -v --make-reports=tests_torch_mps \ -# --report-log=tests_torch_mps.log \ -# tests/ -# - name: Failure short reports -# if: ${{ failure() }} -# run: cat reports/tests_torch_mps_failures_short.txt -# -# - name: Test suite reports artifacts -# if: ${{ always() }} -# uses: actions/upload-artifact@v4 -# with: -# name: torch_mps_test_reports -# path: reports -# -# - name: Generate Report and Notify Channel -# if: always() -# run: | -# pip install slack_sdk tabulate -# python utils/log_reports.py >> $GITHUB_STEP_SUMMARY run_nightly_tests_apple_m1: -# name: Nightly PyTorch MPS tests on MacOS -# runs-on: [ self-hosted, apple-m1 ] -# if: github.event_name == 'schedule' -# -# steps: -# - name: Checkout diffusers -# uses: actions/checkout@v3 -# with: -# fetch-depth: 2 -# -# - name: Clean checkout -# shell: arch -arch arm64 bash {0} -# run: | -# git clean -fxd -# - name: Setup miniconda -# uses: ./.github/actions/setup-miniconda -# with: -# python-version: 3.9 -# -# - name: Install dependencies -# shell: arch -arch arm64 bash {0} -# run: | -# ${CONDA_RUN} python -m pip install --upgrade pip uv -# ${CONDA_RUN} python -m uv pip install -e [quality,test] -# ${CONDA_RUN} python -m uv pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu -# ${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate -# ${CONDA_RUN} python -m uv pip install pytest-reportlog -# - name: Environment -# shell: arch -arch arm64 bash {0} -# run: | -# ${CONDA_RUN} python utils/print_env.py -# - name: Run nightly PyTorch tests on M1 (MPS) -# shell: arch -arch arm64 bash {0} -# env: -# HF_HOME: /System/Volumes/Data/mnt/cache -# HF_TOKEN: ${{ secrets.HF_TOKEN }} -# run: | -# ${CONDA_RUN} python -m pytest -n 1 -s -v --make-reports=tests_torch_mps \ -# --report-log=tests_torch_mps.log \ -# tests/ -# - name: Failure short reports -# if: ${{ failure() }} -# run: cat reports/tests_torch_mps_failures_short.txt -# -# - name: Test suite reports artifacts -# if: ${{ always() }} -# uses: actions/upload-artifact@v4 -# with: -# name: torch_mps_test_reports -# path: reports -# -# - name: Generate Report and Notify Channel -# if: always() -# run: | -# pip install slack_sdk tabulate -# python utils/log_reports.py >> $GITHUB_STEP_SUMMARY \ No newline at end of file diff --git a/diffusers/.github/workflows/notify_slack_about_release.yml b/diffusers/.github/workflows/notify_slack_about_release.yml deleted file mode 100644 index 612ad4e24503a8747d58e5b3273557892782fbe8..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/notify_slack_about_release.yml +++ /dev/null @@ -1,23 +0,0 @@ -name: Notify Slack about a release - -on: - workflow_dispatch: - release: - types: [published] - -jobs: - build: - runs-on: ubuntu-22.04 - - steps: - - uses: actions/checkout@v3 - - - name: Setup Python - uses: actions/setup-python@v4 - with: - python-version: '3.8' - - - name: Notify Slack about the release - env: - SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} - run: pip install requests && python utils/notify_slack_about_release.py diff --git a/diffusers/.github/workflows/pr_dependency_test.yml b/diffusers/.github/workflows/pr_dependency_test.yml deleted file mode 100644 index d9350c09ac427940c9160111fd6ba28a242ec5e3..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/pr_dependency_test.yml +++ /dev/null @@ -1,35 +0,0 @@ -name: Run dependency tests - -on: - pull_request: - branches: - - main - paths: - - "src/diffusers/**.py" - push: - branches: - - main - -concurrency: - group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} - cancel-in-progress: true - -jobs: - check_dependencies: - runs-on: ubuntu-22.04 - steps: - - uses: actions/checkout@v3 - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.8" - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pip install --upgrade pip uv - python -m uv pip install -e . - python -m uv pip install pytest - - name: Check for soft dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - pytest tests/others/test_dependencies.py diff --git a/diffusers/.github/workflows/pr_flax_dependency_test.yml b/diffusers/.github/workflows/pr_flax_dependency_test.yml deleted file mode 100644 index e091b5f2d7b3722e8108641c28f621c104b8e28f..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/pr_flax_dependency_test.yml +++ /dev/null @@ -1,38 +0,0 @@ -name: Run Flax dependency tests - -on: - pull_request: - branches: - - main - paths: - - "src/diffusers/**.py" - push: - branches: - - main - -concurrency: - group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} - cancel-in-progress: true - -jobs: - check_flax_dependencies: - runs-on: ubuntu-22.04 - steps: - - uses: actions/checkout@v3 - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.8" - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pip install --upgrade pip uv - python -m uv pip install -e . - python -m uv pip install "jax[cpu]>=0.2.16,!=0.3.2" - python -m uv pip install "flax>=0.4.1" - python -m uv pip install "jaxlib>=0.1.65" - python -m uv pip install pytest - - name: Check for soft dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - pytest tests/others/test_dependencies.py diff --git a/diffusers/.github/workflows/pr_test_fetcher.yml b/diffusers/.github/workflows/pr_test_fetcher.yml deleted file mode 100644 index b032bb842786a73761b9ef73dfc1cc87a4ab0e26..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/pr_test_fetcher.yml +++ /dev/null @@ -1,177 +0,0 @@ -name: Fast tests for PRs - Test Fetcher - -on: workflow_dispatch - -env: - DIFFUSERS_IS_CI: yes - OMP_NUM_THREADS: 4 - MKL_NUM_THREADS: 4 - PYTEST_TIMEOUT: 60 - -concurrency: - group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} - cancel-in-progress: true - -jobs: - setup_pr_tests: - name: Setup PR Tests - runs-on: - group: aws-general-8-plus - container: - image: diffusers/diffusers-pytorch-cpu - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ - defaults: - run: - shell: bash - outputs: - matrix: ${{ steps.set_matrix.outputs.matrix }} - test_map: ${{ steps.set_matrix.outputs.test_map }} - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 0 - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - - name: Environment - run: | - python utils/print_env.py - echo $(git --version) - - name: Fetch Tests - run: | - python utils/tests_fetcher.py | tee test_preparation.txt - - name: Report fetched tests - uses: actions/upload-artifact@v3 - with: - name: test_fetched - path: test_preparation.txt - - id: set_matrix - name: Create Test Matrix - # The `keys` is used as GitHub actions matrix for jobs, i.e. `models`, `pipelines`, etc. - # The `test_map` is used to get the actual identified test files under each key. - # If no test to run (so no `test_map.json` file), create a dummy map (empty matrix will fail) - run: | - if [ -f test_map.json ]; then - keys=$(python3 -c 'import json; fp = open("test_map.json"); test_map = json.load(fp); fp.close(); d = list(test_map.keys()); print(json.dumps(d))') - test_map=$(python3 -c 'import json; fp = open("test_map.json"); test_map = json.load(fp); fp.close(); print(json.dumps(test_map))') - else - keys=$(python3 -c 'keys = ["dummy"]; print(keys)') - test_map=$(python3 -c 'test_map = {"dummy": []}; print(test_map)') - fi - echo $keys - echo $test_map - echo "matrix=$keys" >> $GITHUB_OUTPUT - echo "test_map=$test_map" >> $GITHUB_OUTPUT - - run_pr_tests: - name: Run PR Tests - needs: setup_pr_tests - if: contains(fromJson(needs.setup_pr_tests.outputs.matrix), 'dummy') != true - strategy: - fail-fast: false - max-parallel: 2 - matrix: - modules: ${{ fromJson(needs.setup_pr_tests.outputs.matrix) }} - runs-on: - group: aws-general-8-plus - container: - image: diffusers/diffusers-pytorch-cpu - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ - defaults: - run: - shell: bash - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pip install -e [quality,test] - python -m pip install accelerate - - - name: Environment - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python utils/print_env.py - - - name: Run all selected tests on CPU - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pytest -n 2 --dist=loadfile -v --make-reports=${{ matrix.modules }}_tests_cpu ${{ fromJson(needs.setup_pr_tests.outputs.test_map)[matrix.modules] }} - - - name: Failure short reports - if: ${{ failure() }} - continue-on-error: true - run: | - cat reports/${{ matrix.modules }}_tests_cpu_stats.txt - cat reports/${{ matrix.modules }}_tests_cpu_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v3 - with: - name: ${{ matrix.modules }}_test_reports - path: reports - - run_staging_tests: - strategy: - fail-fast: false - matrix: - config: - - name: Hub tests for models, schedulers, and pipelines - framework: hub_tests_pytorch - runner: aws-general-8-plus - image: diffusers/diffusers-pytorch-cpu - report: torch_hub - - name: ${{ matrix.config.name }} - runs-on: - group: ${{ matrix.config.runner }} - container: - image: ${{ matrix.config.image }} - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ - - defaults: - run: - shell: bash - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pip install -e [quality,test] - - - name: Environment - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python utils/print_env.py - - - name: Run Hub tests for models, schedulers, and pipelines on a staging env - if: ${{ matrix.config.framework == 'hub_tests_pytorch' }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - HUGGINGFACE_CO_STAGING=true python -m pytest \ - -m "is_staging_test" \ - --make-reports=tests_${{ matrix.config.report }} \ - tests - - - name: Failure short reports - if: ${{ failure() }} - run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: pr_${{ matrix.config.report }}_test_reports - path: reports diff --git a/diffusers/.github/workflows/pr_test_peft_backend.yml b/diffusers/.github/workflows/pr_test_peft_backend.yml deleted file mode 100644 index 190e5d26e6f32509a56ee3c3fc0590de963b5348..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/pr_test_peft_backend.yml +++ /dev/null @@ -1,134 +0,0 @@ -name: Fast tests for PRs - PEFT backend - -on: - pull_request: - branches: - - main - paths: - - "src/diffusers/**.py" - - "tests/**.py" - -concurrency: - group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} - cancel-in-progress: true - -env: - DIFFUSERS_IS_CI: yes - OMP_NUM_THREADS: 4 - MKL_NUM_THREADS: 4 - PYTEST_TIMEOUT: 60 - -jobs: - check_code_quality: - runs-on: ubuntu-22.04 - steps: - - uses: actions/checkout@v3 - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.8" - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install .[quality] - - name: Check quality - run: make quality - - name: Check if failure - if: ${{ failure() }} - run: | - echo "Quality check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make style && make quality'" >> $GITHUB_STEP_SUMMARY - - check_repository_consistency: - needs: check_code_quality - runs-on: ubuntu-22.04 - steps: - - uses: actions/checkout@v3 - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.8" - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install .[quality] - - name: Check repo consistency - run: | - python utils/check_copies.py - python utils/check_dummies.py - make deps_table_check_updated - - name: Check if failure - if: ${{ failure() }} - run: | - echo "Repo consistency check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make fix-copies'" >> $GITHUB_STEP_SUMMARY - - run_fast_tests: - needs: [check_code_quality, check_repository_consistency] - strategy: - fail-fast: false - matrix: - lib-versions: ["main", "latest"] - - - name: LoRA - ${{ matrix.lib-versions }} - - runs-on: - group: aws-general-8-plus - - container: - image: diffusers/diffusers-pytorch-cpu - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ - - defaults: - run: - shell: bash - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - # TODO (sayakpaul, DN6): revisit `--no-deps` - if [ "${{ matrix.lib-versions }}" == "main" ]; then - python -m pip install -U peft@git+https://github.com/huggingface/peft.git --no-deps - python -m uv pip install -U transformers@git+https://github.com/huggingface/transformers.git --no-deps - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps - else - python -m uv pip install -U peft --no-deps - python -m uv pip install -U transformers accelerate --no-deps - fi - - - name: Environment - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python utils/print_env.py - - - name: Run fast PyTorch LoRA CPU tests with PEFT backend - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ - -s -v \ - --make-reports=tests_${{ matrix.lib-versions }} \ - tests/lora/ - python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ - -s -v \ - --make-reports=tests_models_lora_${{ matrix.lib-versions }} \ - tests/models/ -k "lora" - - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_${{ matrix.lib-versions }}_failures_short.txt - cat reports/tests_models_lora_${{ matrix.lib-versions }}_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: pr_${{ matrix.lib-versions }}_test_reports - path: reports diff --git a/diffusers/.github/workflows/pr_tests.yml b/diffusers/.github/workflows/pr_tests.yml deleted file mode 100644 index ec3e55a5e88252e8f8045479b39e97be900d9b64..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/pr_tests.yml +++ /dev/null @@ -1,236 +0,0 @@ -name: Fast tests for PRs - -on: - pull_request: - branches: - - main - paths: - - "src/diffusers/**.py" - - "benchmarks/**.py" - - "examples/**.py" - - "scripts/**.py" - - "tests/**.py" - - ".github/**.yml" - - "utils/**.py" - push: - branches: - - ci-* - -concurrency: - group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} - cancel-in-progress: true - -env: - DIFFUSERS_IS_CI: yes - HF_HUB_ENABLE_HF_TRANSFER: 1 - OMP_NUM_THREADS: 4 - MKL_NUM_THREADS: 4 - PYTEST_TIMEOUT: 60 - -jobs: - check_code_quality: - runs-on: ubuntu-22.04 - steps: - - uses: actions/checkout@v3 - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.8" - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install .[quality] - - name: Check quality - run: make quality - - name: Check if failure - if: ${{ failure() }} - run: | - echo "Quality check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make style && make quality'" >> $GITHUB_STEP_SUMMARY - - check_repository_consistency: - needs: check_code_quality - runs-on: ubuntu-22.04 - steps: - - uses: actions/checkout@v3 - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.8" - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install .[quality] - - name: Check repo consistency - run: | - python utils/check_copies.py - python utils/check_dummies.py - make deps_table_check_updated - - name: Check if failure - if: ${{ failure() }} - run: | - echo "Repo consistency check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make fix-copies'" >> $GITHUB_STEP_SUMMARY - - run_fast_tests: - needs: [check_code_quality, check_repository_consistency] - strategy: - fail-fast: false - matrix: - config: - - name: Fast PyTorch Pipeline CPU tests - framework: pytorch_pipelines - runner: aws-highmemory-32-plus - image: diffusers/diffusers-pytorch-cpu - report: torch_cpu_pipelines - - name: Fast PyTorch Models & Schedulers CPU tests - framework: pytorch_models - runner: aws-general-8-plus - image: diffusers/diffusers-pytorch-cpu - report: torch_cpu_models_schedulers - - name: Fast Flax CPU tests - framework: flax - runner: aws-general-8-plus - image: diffusers/diffusers-flax-cpu - report: flax_cpu - - name: PyTorch Example CPU tests - framework: pytorch_examples - runner: aws-general-8-plus - image: diffusers/diffusers-pytorch-cpu - report: torch_example_cpu - - name: ${{ matrix.config.name }} - - runs-on: - group: ${{ matrix.config.runner }} - - container: - image: ${{ matrix.config.image }} - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ - - defaults: - run: - shell: bash - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - python -m uv pip install accelerate - - - name: Environment - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python utils/print_env.py - - - name: Run fast PyTorch Pipeline CPU tests - if: ${{ matrix.config.framework == 'pytorch_pipelines' }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pytest -n 8 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "not Flax and not Onnx" \ - --make-reports=tests_${{ matrix.config.report }} \ - tests/pipelines - - - name: Run fast PyTorch Model Scheduler CPU tests - if: ${{ matrix.config.framework == 'pytorch_models' }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "not Flax and not Onnx and not Dependency" \ - --make-reports=tests_${{ matrix.config.report }} \ - tests/models tests/schedulers tests/others - - - name: Run fast Flax TPU tests - if: ${{ matrix.config.framework == 'flax' }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "Flax" \ - --make-reports=tests_${{ matrix.config.report }} \ - tests - - - name: Run example PyTorch CPU tests - if: ${{ matrix.config.framework == 'pytorch_examples' }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install peft timm - python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ - --make-reports=tests_${{ matrix.config.report }} \ - examples - - - name: Failure short reports - if: ${{ failure() }} - run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: pr_${{ matrix.config.framework }}_${{ matrix.config.report }}_test_reports - path: reports - - run_staging_tests: - needs: [check_code_quality, check_repository_consistency] - strategy: - fail-fast: false - matrix: - config: - - name: Hub tests for models, schedulers, and pipelines - framework: hub_tests_pytorch - runner: - group: aws-general-8-plus - image: diffusers/diffusers-pytorch-cpu - report: torch_hub - - name: ${{ matrix.config.name }} - - runs-on: ${{ matrix.config.runner }} - - container: - image: ${{ matrix.config.image }} - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ - - defaults: - run: - shell: bash - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - - - name: Environment - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python utils/print_env.py - - - name: Run Hub tests for models, schedulers, and pipelines on a staging env - if: ${{ matrix.config.framework == 'hub_tests_pytorch' }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - HUGGINGFACE_CO_STAGING=true python -m pytest \ - -m "is_staging_test" \ - --make-reports=tests_${{ matrix.config.report }} \ - tests - - - name: Failure short reports - if: ${{ failure() }} - run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: pr_${{ matrix.config.report }}_test_reports - path: reports diff --git a/diffusers/.github/workflows/pr_torch_dependency_test.yml b/diffusers/.github/workflows/pr_torch_dependency_test.yml deleted file mode 100644 index c39d5eca2d9a60214f348b1ca7ecfd90638e4470..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/pr_torch_dependency_test.yml +++ /dev/null @@ -1,36 +0,0 @@ -name: Run Torch dependency tests - -on: - pull_request: - branches: - - main - paths: - - "src/diffusers/**.py" - push: - branches: - - main - -concurrency: - group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} - cancel-in-progress: true - -jobs: - check_torch_dependencies: - runs-on: ubuntu-22.04 - steps: - - uses: actions/checkout@v3 - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.8" - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pip install --upgrade pip uv - python -m uv pip install -e . - python -m uv pip install torch torchvision torchaudio - python -m uv pip install pytest - - name: Check for soft dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - pytest tests/others/test_dependencies.py diff --git a/diffusers/.github/workflows/push_tests.yml b/diffusers/.github/workflows/push_tests.yml deleted file mode 100644 index 2289d1b5cad164fecf88380cf43c66b41fef725f..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/push_tests.yml +++ /dev/null @@ -1,391 +0,0 @@ -name: Fast GPU Tests on main - -on: - workflow_dispatch: - push: - branches: - - main - paths: - - "src/diffusers/**.py" - - "examples/**.py" - - "tests/**.py" - -env: - DIFFUSERS_IS_CI: yes - OMP_NUM_THREADS: 8 - MKL_NUM_THREADS: 8 - HF_HUB_ENABLE_HF_TRANSFER: 1 - PYTEST_TIMEOUT: 600 - PIPELINE_USAGE_CUTOFF: 50000 - -jobs: - setup_torch_cuda_pipeline_matrix: - name: Setup Torch Pipelines CUDA Slow Tests Matrix - runs-on: - group: aws-general-8-plus - container: - image: diffusers/diffusers-pytorch-cpu - outputs: - pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }} - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - - name: Environment - run: | - python utils/print_env.py - - name: Fetch Pipeline Matrix - id: fetch_pipeline_matrix - run: | - matrix=$(python utils/fetch_torch_cuda_pipeline_test_matrix.py) - echo $matrix - echo "pipeline_test_matrix=$matrix" >> $GITHUB_OUTPUT - - name: Pipeline Tests Artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: test-pipelines.json - path: reports - - torch_pipelines_cuda_tests: - name: Torch Pipelines CUDA Tests - needs: setup_torch_cuda_pipeline_matrix - strategy: - fail-fast: false - max-parallel: 8 - matrix: - module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }} - runs-on: - group: aws-g4dn-2xlarge - container: - image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host --gpus 0 - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - name: NVIDIA-SMI - run: | - nvidia-smi - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - - name: Environment - run: | - python utils/print_env.py - - name: PyTorch CUDA checkpoint tests on Ubuntu - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms - CUBLAS_WORKSPACE_CONFIG: :16:8 - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "not Flax and not Onnx" \ - --make-reports=tests_pipeline_${{ matrix.module }}_cuda \ - tests/pipelines/${{ matrix.module }} - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt - cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: pipeline_${{ matrix.module }}_test_reports - path: reports - - torch_cuda_tests: - name: Torch CUDA Tests - runs-on: - group: aws-g4dn-2xlarge - container: - image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host --gpus 0 - defaults: - run: - shell: bash - strategy: - fail-fast: false - max-parallel: 2 - matrix: - module: [models, schedulers, lora, others, single_file] - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - python -m uv pip install peft@git+https://github.com/huggingface/peft.git - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - - - name: Environment - run: | - python utils/print_env.py - - - name: Run PyTorch CUDA tests - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms - CUBLAS_WORKSPACE_CONFIG: :16:8 - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "not Flax and not Onnx" \ - --make-reports=tests_torch_cuda_${{ matrix.module }} \ - tests/${{ matrix.module }} - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_torch_cuda_${{ matrix.module }}_stats.txt - cat reports/tests_torch_cuda_${{ matrix.module }}_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: torch_cuda_test_reports_${{ matrix.module }} - path: reports - - flax_tpu_tests: - name: Flax TPU Tests - runs-on: docker-tpu - container: - image: diffusers/diffusers-flax-tpu - options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ --privileged - defaults: - run: - shell: bash - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - - - name: Environment - run: | - python utils/print_env.py - - - name: Run Flax TPU tests - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - run: | - python -m pytest -n 0 \ - -s -v -k "Flax" \ - --make-reports=tests_flax_tpu \ - tests/ - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_flax_tpu_stats.txt - cat reports/tests_flax_tpu_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: flax_tpu_test_reports - path: reports - - onnx_cuda_tests: - name: ONNX CUDA Tests - runs-on: - group: aws-g4dn-2xlarge - container: - image: diffusers/diffusers-onnxruntime-cuda - options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ --gpus 0 - defaults: - run: - shell: bash - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - - - name: Environment - run: | - python utils/print_env.py - - - name: Run ONNXRuntime CUDA tests - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "Onnx" \ - --make-reports=tests_onnx_cuda \ - tests/ - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_onnx_cuda_stats.txt - cat reports/tests_onnx_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: onnx_cuda_test_reports - path: reports - - run_torch_compile_tests: - name: PyTorch Compile CUDA tests - - runs-on: - group: aws-g4dn-2xlarge - - container: - image: diffusers/diffusers-pytorch-compile-cuda - options: --gpus 0 --shm-size "16gb" --ipc host - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: NVIDIA-SMI - run: | - nvidia-smi - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test,training] - - name: Environment - run: | - python utils/print_env.py - - name: Run example tests on GPU - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - RUN_COMPILE: yes - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/ - - name: Failure short reports - if: ${{ failure() }} - run: cat reports/tests_torch_compile_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: torch_compile_test_reports - path: reports - - run_xformers_tests: - name: PyTorch xformers CUDA tests - - runs-on: - group: aws-g4dn-2xlarge - - container: - image: diffusers/diffusers-pytorch-xformers-cuda - options: --gpus 0 --shm-size "16gb" --ipc host - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: NVIDIA-SMI - run: | - nvidia-smi - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test,training] - - name: Environment - run: | - python utils/print_env.py - - name: Run example tests on GPU - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "xformers" --make-reports=tests_torch_xformers_cuda tests/ - - name: Failure short reports - if: ${{ failure() }} - run: cat reports/tests_torch_xformers_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: torch_xformers_test_reports - path: reports - - run_examples_tests: - name: Examples PyTorch CUDA tests on Ubuntu - - runs-on: - group: aws-g4dn-2xlarge - - container: - image: diffusers/diffusers-pytorch-cuda - options: --gpus 0 --shm-size "16gb" --ipc host - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: NVIDIA-SMI - run: | - nvidia-smi - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test,training] - - - name: Environment - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python utils/print_env.py - - - name: Run example tests on GPU - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install timm - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v --make-reports=examples_torch_cuda examples/ - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/examples_torch_cuda_stats.txt - cat reports/examples_torch_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: examples_test_reports - path: reports diff --git a/diffusers/.github/workflows/push_tests_fast.yml b/diffusers/.github/workflows/push_tests_fast.yml deleted file mode 100644 index e8a73446de737999a05b7d3e829e5a4832425bce..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/push_tests_fast.yml +++ /dev/null @@ -1,126 +0,0 @@ -name: Fast tests on main - -on: - push: - branches: - - main - paths: - - "src/diffusers/**.py" - - "examples/**.py" - - "tests/**.py" - -concurrency: - group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} - cancel-in-progress: true - -env: - DIFFUSERS_IS_CI: yes - HF_HOME: /mnt/cache - OMP_NUM_THREADS: 8 - MKL_NUM_THREADS: 8 - HF_HUB_ENABLE_HF_TRANSFER: 1 - PYTEST_TIMEOUT: 600 - RUN_SLOW: no - -jobs: - run_fast_tests: - strategy: - fail-fast: false - matrix: - config: - - name: Fast PyTorch CPU tests on Ubuntu - framework: pytorch - runner: aws-general-8-plus - image: diffusers/diffusers-pytorch-cpu - report: torch_cpu - - name: Fast Flax CPU tests on Ubuntu - framework: flax - runner: aws-general-8-plus - image: diffusers/diffusers-flax-cpu - report: flax_cpu - - name: Fast ONNXRuntime CPU tests on Ubuntu - framework: onnxruntime - runner: aws-general-8-plus - image: diffusers/diffusers-onnxruntime-cpu - report: onnx_cpu - - name: PyTorch Example CPU tests on Ubuntu - framework: pytorch_examples - runner: aws-general-8-plus - image: diffusers/diffusers-pytorch-cpu - report: torch_example_cpu - - name: ${{ matrix.config.name }} - - runs-on: - group: ${{ matrix.config.runner }} - - container: - image: ${{ matrix.config.image }} - options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ - - defaults: - run: - shell: bash - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - - - name: Environment - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python utils/print_env.py - - - name: Run fast PyTorch CPU tests - if: ${{ matrix.config.framework == 'pytorch' }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "not Flax and not Onnx" \ - --make-reports=tests_${{ matrix.config.report }} \ - tests/ - - - name: Run fast Flax TPU tests - if: ${{ matrix.config.framework == 'flax' }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "Flax" \ - --make-reports=tests_${{ matrix.config.report }} \ - tests/ - - - name: Run fast ONNXRuntime CPU tests - if: ${{ matrix.config.framework == 'onnxruntime' }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "Onnx" \ - --make-reports=tests_${{ matrix.config.report }} \ - tests/ - - - name: Run example PyTorch CPU tests - if: ${{ matrix.config.framework == 'pytorch_examples' }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install peft timm - python -m pytest -n 4 --max-worker-restart=0 --dist=loadfile \ - --make-reports=tests_${{ matrix.config.report }} \ - examples - - - name: Failure short reports - if: ${{ failure() }} - run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: pr_${{ matrix.config.report }}_test_reports - path: reports diff --git a/diffusers/.github/workflows/push_tests_mps.yml b/diffusers/.github/workflows/push_tests_mps.yml deleted file mode 100644 index 8d521074a08fdf05c457479daa06f5f99bad9faa..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/push_tests_mps.yml +++ /dev/null @@ -1,76 +0,0 @@ -name: Fast mps tests on main - -on: - push: - branches: - - main - paths: - - "src/diffusers/**.py" - - "tests/**.py" - -env: - DIFFUSERS_IS_CI: yes - HF_HOME: /mnt/cache - OMP_NUM_THREADS: 8 - MKL_NUM_THREADS: 8 - HF_HUB_ENABLE_HF_TRANSFER: 1 - PYTEST_TIMEOUT: 600 - RUN_SLOW: no - -concurrency: - group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} - cancel-in-progress: true - -jobs: - run_fast_tests_apple_m1: - name: Fast PyTorch MPS tests on MacOS - runs-on: macos-13-xlarge - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Clean checkout - shell: arch -arch arm64 bash {0} - run: | - git clean -fxd - - - name: Setup miniconda - uses: ./.github/actions/setup-miniconda - with: - python-version: 3.9 - - - name: Install dependencies - shell: arch -arch arm64 bash {0} - run: | - ${CONDA_RUN} python -m pip install --upgrade pip uv - ${CONDA_RUN} python -m uv pip install -e [quality,test] - ${CONDA_RUN} python -m uv pip install torch torchvision torchaudio - ${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git - ${CONDA_RUN} python -m uv pip install transformers --upgrade - - - name: Environment - shell: arch -arch arm64 bash {0} - run: | - ${CONDA_RUN} python utils/print_env.py - - - name: Run fast PyTorch tests on M1 (MPS) - shell: arch -arch arm64 bash {0} - env: - HF_HOME: /System/Volumes/Data/mnt/cache - HF_TOKEN: ${{ secrets.HF_TOKEN }} - run: | - ${CONDA_RUN} python -m pytest -n 0 -s -v --make-reports=tests_torch_mps tests/ - - - name: Failure short reports - if: ${{ failure() }} - run: cat reports/tests_torch_mps_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: pr_torch_mps_test_reports - path: reports diff --git a/diffusers/.github/workflows/pypi_publish.yaml b/diffusers/.github/workflows/pypi_publish.yaml deleted file mode 100644 index 33a5bb5640f28f9c12ea4bac2df632214a714005..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/pypi_publish.yaml +++ /dev/null @@ -1,81 +0,0 @@ -# Adapted from https://blog.deepjyoti30.dev/pypi-release-github-action - -name: PyPI release - -on: - workflow_dispatch: - push: - tags: - - "*" - -jobs: - find-and-checkout-latest-branch: - runs-on: ubuntu-22.04 - outputs: - latest_branch: ${{ steps.set_latest_branch.outputs.latest_branch }} - steps: - - name: Checkout Repo - uses: actions/checkout@v3 - - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: '3.8' - - - name: Fetch latest branch - id: fetch_latest_branch - run: | - pip install -U requests packaging - LATEST_BRANCH=$(python utils/fetch_latest_release_branch.py) - echo "Latest branch: $LATEST_BRANCH" - echo "latest_branch=$LATEST_BRANCH" >> $GITHUB_ENV - - - name: Set latest branch output - id: set_latest_branch - run: echo "::set-output name=latest_branch::${{ env.latest_branch }}" - - release: - needs: find-and-checkout-latest-branch - runs-on: ubuntu-22.04 - - steps: - - name: Checkout Repo - uses: actions/checkout@v3 - with: - ref: ${{ needs.find-and-checkout-latest-branch.outputs.latest_branch }} - - - name: Setup Python - uses: actions/setup-python@v4 - with: - python-version: "3.8" - - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install -U setuptools wheel twine - pip install -U torch --index-url https://download.pytorch.org/whl/cpu - pip install -U transformers - - - name: Build the dist files - run: python setup.py bdist_wheel && python setup.py sdist - - - name: Publish to the test PyPI - env: - TWINE_USERNAME: ${{ secrets.TEST_PYPI_USERNAME }} - TWINE_PASSWORD: ${{ secrets.TEST_PYPI_PASSWORD }} - run: twine upload dist/* -r pypitest --repository-url=https://test.pypi.org/legacy/ - - - name: Test installing diffusers and importing - run: | - pip install diffusers && pip uninstall diffusers -y - pip install -i https://testpypi.python.org/pypi diffusers - python -c "from diffusers import __version__; print(__version__)" - python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('fusing/unet-ldm-dummy-update'); pipe()" - python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('hf-internal-testing/tiny-stable-diffusion-pipe', safety_checker=None); pipe('ah suh du')" - python -c "from diffusers import *" - - - name: Publish to PyPI - env: - TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }} - TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }} - run: twine upload dist/* -r pypi diff --git a/diffusers/.github/workflows/release_tests_fast.yml b/diffusers/.github/workflows/release_tests_fast.yml deleted file mode 100644 index a8a6f2699dca8a2ae02dd5c8dc9546efdc1e0f5c..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/release_tests_fast.yml +++ /dev/null @@ -1,389 +0,0 @@ -# Duplicate workflow to push_tests.yml that is meant to run on release/patch branches as a final check -# Creating a duplicate workflow here is simpler than adding complex path/branch parsing logic to push_tests.yml -# Needs to be updated if push_tests.yml updated -name: (Release) Fast GPU Tests on main - -on: - push: - branches: - - "v*.*.*-release" - - "v*.*.*-patch" - -env: - DIFFUSERS_IS_CI: yes - OMP_NUM_THREADS: 8 - MKL_NUM_THREADS: 8 - PYTEST_TIMEOUT: 600 - PIPELINE_USAGE_CUTOFF: 50000 - -jobs: - setup_torch_cuda_pipeline_matrix: - name: Setup Torch Pipelines CUDA Slow Tests Matrix - runs-on: - group: aws-general-8-plus - container: - image: diffusers/diffusers-pytorch-cpu - outputs: - pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }} - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - - name: Environment - run: | - python utils/print_env.py - - name: Fetch Pipeline Matrix - id: fetch_pipeline_matrix - run: | - matrix=$(python utils/fetch_torch_cuda_pipeline_test_matrix.py) - echo $matrix - echo "pipeline_test_matrix=$matrix" >> $GITHUB_OUTPUT - - name: Pipeline Tests Artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: test-pipelines.json - path: reports - - torch_pipelines_cuda_tests: - name: Torch Pipelines CUDA Tests - needs: setup_torch_cuda_pipeline_matrix - strategy: - fail-fast: false - max-parallel: 8 - matrix: - module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }} - runs-on: - group: aws-g4dn-2xlarge - container: - image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host --gpus 0 - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - name: NVIDIA-SMI - run: | - nvidia-smi - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - - name: Environment - run: | - python utils/print_env.py - - name: Slow PyTorch CUDA checkpoint tests on Ubuntu - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms - CUBLAS_WORKSPACE_CONFIG: :16:8 - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "not Flax and not Onnx" \ - --make-reports=tests_pipeline_${{ matrix.module }}_cuda \ - tests/pipelines/${{ matrix.module }} - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt - cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: pipeline_${{ matrix.module }}_test_reports - path: reports - - torch_cuda_tests: - name: Torch CUDA Tests - runs-on: - group: aws-g4dn-2xlarge - container: - image: diffusers/diffusers-pytorch-cuda - options: --shm-size "16gb" --ipc host --gpus 0 - defaults: - run: - shell: bash - strategy: - fail-fast: false - max-parallel: 2 - matrix: - module: [models, schedulers, lora, others, single_file] - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - python -m uv pip install peft@git+https://github.com/huggingface/peft.git - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - - - name: Environment - run: | - python utils/print_env.py - - - name: Run PyTorch CUDA tests - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms - CUBLAS_WORKSPACE_CONFIG: :16:8 - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "not Flax and not Onnx" \ - --make-reports=tests_torch_${{ matrix.module }}_cuda \ - tests/${{ matrix.module }} - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_torch_${{ matrix.module }}_cuda_stats.txt - cat reports/tests_torch_${{ matrix.module }}_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: torch_cuda_${{ matrix.module }}_test_reports - path: reports - - flax_tpu_tests: - name: Flax TPU Tests - runs-on: docker-tpu - container: - image: diffusers/diffusers-flax-tpu - options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ --privileged - defaults: - run: - shell: bash - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - - - name: Environment - run: | - python utils/print_env.py - - - name: Run slow Flax TPU tests - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - run: | - python -m pytest -n 0 \ - -s -v -k "Flax" \ - --make-reports=tests_flax_tpu \ - tests/ - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_flax_tpu_stats.txt - cat reports/tests_flax_tpu_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: flax_tpu_test_reports - path: reports - - onnx_cuda_tests: - name: ONNX CUDA Tests - runs-on: - group: aws-g4dn-2xlarge - container: - image: diffusers/diffusers-onnxruntime-cuda - options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ --gpus 0 - defaults: - run: - shell: bash - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git - - - name: Environment - run: | - python utils/print_env.py - - - name: Run slow ONNXRuntime CUDA tests - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \ - -s -v -k "Onnx" \ - --make-reports=tests_onnx_cuda \ - tests/ - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/tests_onnx_cuda_stats.txt - cat reports/tests_onnx_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: onnx_cuda_test_reports - path: reports - - run_torch_compile_tests: - name: PyTorch Compile CUDA tests - - runs-on: - group: aws-g4dn-2xlarge - - container: - image: diffusers/diffusers-pytorch-compile-cuda - options: --gpus 0 --shm-size "16gb" --ipc host - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: NVIDIA-SMI - run: | - nvidia-smi - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test,training] - - name: Environment - run: | - python utils/print_env.py - - name: Run example tests on GPU - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - RUN_COMPILE: yes - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/ - - name: Failure short reports - if: ${{ failure() }} - run: cat reports/tests_torch_compile_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: torch_compile_test_reports - path: reports - - run_xformers_tests: - name: PyTorch xformers CUDA tests - - runs-on: - group: aws-g4dn-2xlarge - - container: - image: diffusers/diffusers-pytorch-xformers-cuda - options: --gpus 0 --shm-size "16gb" --ipc host - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: NVIDIA-SMI - run: | - nvidia-smi - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test,training] - - name: Environment - run: | - python utils/print_env.py - - name: Run example tests on GPU - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - run: | - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "xformers" --make-reports=tests_torch_xformers_cuda tests/ - - name: Failure short reports - if: ${{ failure() }} - run: cat reports/tests_torch_xformers_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: torch_xformers_test_reports - path: reports - - run_examples_tests: - name: Examples PyTorch CUDA tests on Ubuntu - - runs-on: - group: aws-g4dn-2xlarge - - container: - image: diffusers/diffusers-pytorch-cuda - options: --gpus 0 --shm-size "16gb" --ipc host - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: NVIDIA-SMI - run: | - nvidia-smi - - - name: Install dependencies - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test,training] - - - name: Environment - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python utils/print_env.py - - - name: Run example tests on GPU - env: - HF_TOKEN: ${{ secrets.HF_TOKEN }} - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install timm - python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v --make-reports=examples_torch_cuda examples/ - - - name: Failure short reports - if: ${{ failure() }} - run: | - cat reports/examples_torch_cuda_stats.txt - cat reports/examples_torch_cuda_failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: examples_test_reports - path: reports diff --git a/diffusers/.github/workflows/run_tests_from_a_pr.yml b/diffusers/.github/workflows/run_tests_from_a_pr.yml deleted file mode 100644 index 1e736e5430891ec5b1b7843ba2deadddadb9dca1..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/run_tests_from_a_pr.yml +++ /dev/null @@ -1,74 +0,0 @@ -name: Check running SLOW tests from a PR (only GPU) - -on: - workflow_dispatch: - inputs: - docker_image: - default: 'diffusers/diffusers-pytorch-cuda' - description: 'Name of the Docker image' - required: true - branch: - description: 'PR Branch to test on' - required: true - test: - description: 'Tests to run (e.g.: `tests/models`).' - required: true - -env: - DIFFUSERS_IS_CI: yes - IS_GITHUB_CI: "1" - HF_HOME: /mnt/cache - OMP_NUM_THREADS: 8 - MKL_NUM_THREADS: 8 - PYTEST_TIMEOUT: 600 - RUN_SLOW: yes - -jobs: - run_tests: - name: "Run a test on our runner from a PR" - runs-on: - group: aws-g4dn-2xlarge - container: - image: ${{ github.event.inputs.docker_image }} - options: --gpus 0 --privileged --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ - - steps: - - name: Validate test files input - id: validate_test_files - env: - PY_TEST: ${{ github.event.inputs.test }} - run: | - if [[ ! "$PY_TEST" =~ ^tests/ ]]; then - echo "Error: The input string must start with 'tests/'." - exit 1 - fi - - if [[ ! "$PY_TEST" =~ ^tests/(models|pipelines) ]]; then - echo "Error: The input string must contain either 'models' or 'pipelines' after 'tests/'." - exit 1 - fi - - if [[ "$PY_TEST" == *";"* ]]; then - echo "Error: The input string must not contain ';'." - exit 1 - fi - echo "$PY_TEST" - - - name: Checkout PR branch - uses: actions/checkout@v4 - with: - ref: ${{ github.event.inputs.branch }} - repository: ${{ github.event.pull_request.head.repo.full_name }} - - - - name: Install pytest - run: | - python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH" - python -m uv pip install -e [quality,test] - python -m uv pip install peft - - - name: Run tests - env: - PY_TEST: ${{ github.event.inputs.test }} - run: | - pytest "$PY_TEST" diff --git a/diffusers/.github/workflows/ssh-pr-runner.yml b/diffusers/.github/workflows/ssh-pr-runner.yml deleted file mode 100644 index 49fa9c0ad24d65634a891496ac93bc93e9e24378..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/ssh-pr-runner.yml +++ /dev/null @@ -1,40 +0,0 @@ -name: SSH into PR runners - -on: - workflow_dispatch: - inputs: - docker_image: - description: 'Name of the Docker image' - required: true - -env: - IS_GITHUB_CI: "1" - HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }} - HF_HOME: /mnt/cache - DIFFUSERS_IS_CI: yes - OMP_NUM_THREADS: 8 - MKL_NUM_THREADS: 8 - RUN_SLOW: yes - -jobs: - ssh_runner: - name: "SSH" - runs-on: - group: aws-highmemory-32-plus - container: - image: ${{ github.event.inputs.docker_image }} - options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --privileged - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Tailscale # In order to be able to SSH when a test fails - uses: huggingface/tailscale-action@main - with: - authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }} - slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }} - slackToken: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} - waitForSSH: true diff --git a/diffusers/.github/workflows/ssh-runner.yml b/diffusers/.github/workflows/ssh-runner.yml deleted file mode 100644 index fd65598a53a78de498589547ad75c14518b2d0cc..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/ssh-runner.yml +++ /dev/null @@ -1,52 +0,0 @@ -name: SSH into GPU runners - -on: - workflow_dispatch: - inputs: - runner_type: - description: 'Type of runner to test (aws-g6-4xlarge-plus: a10, aws-g4dn-2xlarge: t4, aws-g6e-xlarge-plus: L40)' - type: choice - required: true - options: - - aws-g6-4xlarge-plus - - aws-g4dn-2xlarge - - aws-g6e-xlarge-plus - docker_image: - description: 'Name of the Docker image' - required: true - -env: - IS_GITHUB_CI: "1" - HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }} - HF_HOME: /mnt/cache - DIFFUSERS_IS_CI: yes - OMP_NUM_THREADS: 8 - MKL_NUM_THREADS: 8 - RUN_SLOW: yes - -jobs: - ssh_runner: - name: "SSH" - runs-on: - group: "${{ github.event.inputs.runner_type }}" - container: - image: ${{ github.event.inputs.docker_image }} - options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --gpus 0 --privileged - - steps: - - name: Checkout diffusers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: NVIDIA-SMI - run: | - nvidia-smi - - - name: Tailscale # In order to be able to SSH when a test fails - uses: huggingface/tailscale-action@main - with: - authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }} - slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }} - slackToken: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} - waitForSSH: true diff --git a/diffusers/.github/workflows/stale.yml b/diffusers/.github/workflows/stale.yml deleted file mode 100644 index 27450ed4c7f20d4534b4c4597ca019267e12dd81..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/stale.yml +++ /dev/null @@ -1,30 +0,0 @@ -name: Stale Bot - -on: - schedule: - - cron: "0 15 * * *" - -jobs: - close_stale_issues: - name: Close Stale Issues - if: github.repository == 'huggingface/diffusers' - runs-on: ubuntu-22.04 - permissions: - issues: write - pull-requests: write - env: - GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} - steps: - - uses: actions/checkout@v2 - - - name: Setup Python - uses: actions/setup-python@v1 - with: - python-version: 3.8 - - - name: Install requirements - run: | - pip install PyGithub - - name: Close stale issues - run: | - python utils/stale.py diff --git a/diffusers/.github/workflows/trufflehog.yml b/diffusers/.github/workflows/trufflehog.yml deleted file mode 100644 index 44f821ea84edcb65e127251d7ab790e21e023f10..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/trufflehog.yml +++ /dev/null @@ -1,15 +0,0 @@ -on: - push: - -name: Secret Leaks - -jobs: - trufflehog: - runs-on: ubuntu-22.04 - steps: - - name: Checkout code - uses: actions/checkout@v4 - with: - fetch-depth: 0 - - name: Secret Scanning - uses: trufflesecurity/trufflehog@main diff --git a/diffusers/.github/workflows/typos.yml b/diffusers/.github/workflows/typos.yml deleted file mode 100644 index 6d2f2fc8dd9a69a626bd9852e2add65cb32dfdc9..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/typos.yml +++ /dev/null @@ -1,14 +0,0 @@ -name: Check typos - -on: - workflow_dispatch: - -jobs: - build: - runs-on: ubuntu-22.04 - - steps: - - uses: actions/checkout@v3 - - - name: typos-action - uses: crate-ci/typos@v1.12.4 diff --git a/diffusers/.github/workflows/update_metadata.yml b/diffusers/.github/workflows/update_metadata.yml deleted file mode 100644 index 92aea0369ba855237b39d4dee9e8daee0c81010d..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/update_metadata.yml +++ /dev/null @@ -1,30 +0,0 @@ -name: Update Diffusers metadata - -on: - workflow_dispatch: - push: - branches: - - main - - update_diffusers_metadata* - -jobs: - update_metadata: - runs-on: ubuntu-22.04 - defaults: - run: - shell: bash -l {0} - - steps: - - uses: actions/checkout@v3 - - - name: Setup environment - run: | - pip install --upgrade pip - pip install datasets pandas - pip install .[torch] - - - name: Update metadata - env: - HF_TOKEN: ${{ secrets.SAYAK_HF_TOKEN }} - run: | - python utils/update_metadata.py --commit_sha ${{ github.sha }} diff --git a/diffusers/.github/workflows/upload_pr_documentation.yml b/diffusers/.github/workflows/upload_pr_documentation.yml deleted file mode 100644 index fc102df8103e48fb139a8bd47be05fc257d992c5..0000000000000000000000000000000000000000 --- a/diffusers/.github/workflows/upload_pr_documentation.yml +++ /dev/null @@ -1,16 +0,0 @@ -name: Upload PR Documentation - -on: - workflow_run: - workflows: ["Build PR Documentation"] - types: - - completed - -jobs: - build: - uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main - with: - package_name: diffusers - secrets: - hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} - comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }} diff --git a/diffusers/.gitignore b/diffusers/.gitignore deleted file mode 100644 index 15617d5fdc745f57271db05b5fe7fd83f440c3a7..0000000000000000000000000000000000000000 --- a/diffusers/.gitignore +++ /dev/null @@ -1,178 +0,0 @@ -# Initially taken from GitHub's Python gitignore file - -# Byte-compiled / optimized / DLL files -__pycache__/ -*.py[cod] -*$py.class - -# C extensions -*.so - -# tests and logs -tests/fixtures/cached_*_text.txt -logs/ -lightning_logs/ -lang_code_data/ - -# Distribution / packaging -.Python -build/ -develop-eggs/ -dist/ -downloads/ -eggs/ -.eggs/ -lib/ -lib64/ -parts/ -sdist/ -var/ -wheels/ -*.egg-info/ -.installed.cfg -*.egg -MANIFEST - -# PyInstaller -# Usually these files are written by a Python script from a template -# before PyInstaller builds the exe, so as to inject date/other infos into it. -*.manifest -*.spec - -# Installer logs -pip-log.txt -pip-delete-this-directory.txt - -# Unit test / coverage reports -htmlcov/ -.tox/ -.nox/ -.coverage -.coverage.* -.cache -nosetests.xml -coverage.xml -*.cover -.hypothesis/ -.pytest_cache/ - -# Translations -*.mo -*.pot - -# Django stuff: -*.log -local_settings.py -db.sqlite3 - -# Flask stuff: -instance/ -.webassets-cache - -# Scrapy stuff: -.scrapy - -# Sphinx documentation -docs/_build/ - -# PyBuilder -target/ - -# Jupyter Notebook -.ipynb_checkpoints - -# IPython -profile_default/ -ipython_config.py - -# pyenv -.python-version - -# celery beat schedule file -celerybeat-schedule - -# SageMath parsed files -*.sage.py - -# Environments -.env -.venv -env/ -venv/ -ENV/ -env.bak/ -venv.bak/ - -# Spyder project settings -.spyderproject -.spyproject - -# Rope project settings -.ropeproject - -# mkdocs documentation -/site - -# mypy -.mypy_cache/ -.dmypy.json -dmypy.json - -# Pyre type checker -.pyre/ - -# vscode -.vs -.vscode - -# Pycharm -.idea - -# TF code -tensorflow_code - -# Models -proc_data - -# examples -runs -/runs_old -/wandb -/examples/runs -/examples/**/*.args -/examples/rag/sweep - -# data -/data -serialization_dir - -# emacs -*.*~ -debug.env - -# vim -.*.swp - -# ctags -tags - -# pre-commit -.pre-commit* - -# .lock -*.lock - -# DS_Store (MacOS) -.DS_Store - -# RL pipelines may produce mp4 outputs -*.mp4 - -# dependencies -/transformers - -# ruff -.ruff_cache - -# wandb -wandb \ No newline at end of file diff --git a/diffusers/CITATION.cff b/diffusers/CITATION.cff deleted file mode 100644 index 09fc6c744d06407fa6b4707af1afe3cc9529b4db..0000000000000000000000000000000000000000 --- a/diffusers/CITATION.cff +++ /dev/null @@ -1,52 +0,0 @@ -cff-version: 1.2.0 -title: 'Diffusers: State-of-the-art diffusion models' -message: >- - If you use this software, please cite it using the - metadata from this file. -type: software -authors: - - given-names: Patrick - family-names: von Platen - - given-names: Suraj - family-names: Patil - - given-names: Anton - family-names: Lozhkov - - given-names: Pedro - family-names: Cuenca - - given-names: Nathan - family-names: Lambert - - given-names: Kashif - family-names: Rasul - - given-names: Mishig - family-names: Davaadorj - - given-names: Dhruv - family-names: Nair - - given-names: Sayak - family-names: Paul - - given-names: Steven - family-names: Liu - - given-names: William - family-names: Berman - - given-names: Yiyi - family-names: Xu - - given-names: Thomas - family-names: Wolf -repository-code: 'https://github.com/huggingface/diffusers' -abstract: >- - Diffusers provides pretrained diffusion models across - multiple modalities, such as vision and audio, and serves - as a modular toolbox for inference and training of - diffusion models. -keywords: - - deep-learning - - pytorch - - image-generation - - hacktoberfest - - diffusion - - text2image - - image2image - - score-based-generative-modeling - - stable-diffusion - - stable-diffusion-diffusers -license: Apache-2.0 -version: 0.12.1 diff --git a/diffusers/CODE_OF_CONDUCT.md b/diffusers/CODE_OF_CONDUCT.md deleted file mode 100644 index 2139079964fbd53692380985e60ef90e2fa05dad..0000000000000000000000000000000000000000 --- a/diffusers/CODE_OF_CONDUCT.md +++ /dev/null @@ -1,130 +0,0 @@ - -# Contributor Covenant Code of Conduct - -## Our Pledge - -We as members, contributors, and leaders pledge to make participation in our -community a harassment-free experience for everyone, regardless of age, body -size, visible or invisible disability, ethnicity, sex characteristics, gender -identity and expression, level of experience, education, socio-economic status, -nationality, personal appearance, race, caste, color, religion, or sexual identity -and orientation. - -We pledge to act and interact in ways that contribute to an open, welcoming, -diverse, inclusive, and healthy community. - -## Our Standards - -Examples of behavior that contributes to a positive environment for our -community include: - -* Demonstrating empathy and kindness toward other people -* Being respectful of differing opinions, viewpoints, and experiences -* Giving and gracefully accepting constructive feedback -* Accepting responsibility and apologizing to those affected by our mistakes, - and learning from the experience -* Focusing on what is best not just for us as individuals, but for the - overall Diffusers community - -Examples of unacceptable behavior include: - -* The use of sexualized language or imagery, and sexual attention or - advances of any kind -* Trolling, insulting or derogatory comments, and personal or political attacks -* Public or private harassment -* Publishing others' private information, such as a physical or email - address, without their explicit permission -* Spamming issues or PRs with links to projects unrelated to this library -* Other conduct which could reasonably be considered inappropriate in a - professional setting - -## Enforcement Responsibilities - -Community leaders are responsible for clarifying and enforcing our standards of -acceptable behavior and will take appropriate and fair corrective action in -response to any behavior that they deem inappropriate, threatening, offensive, -or harmful. - -Community leaders have the right and responsibility to remove, edit, or reject -comments, commits, code, wiki edits, issues, and other contributions that are -not aligned to this Code of Conduct, and will communicate reasons for moderation -decisions when appropriate. - -## Scope - -This Code of Conduct applies within all community spaces, and also applies when -an individual is officially representing the community in public spaces. -Examples of representing our community include using an official e-mail address, -posting via an official social media account, or acting as an appointed -representative at an online or offline event. - -## Enforcement - -Instances of abusive, harassing, or otherwise unacceptable behavior may be -reported to the community leaders responsible for enforcement at -feedback@huggingface.co. -All complaints will be reviewed and investigated promptly and fairly. - -All community leaders are obligated to respect the privacy and security of the -reporter of any incident. - -## Enforcement Guidelines - -Community leaders will follow these Community Impact Guidelines in determining -the consequences for any action they deem in violation of this Code of Conduct: - -### 1. Correction - -**Community Impact**: Use of inappropriate language or other behavior deemed -unprofessional or unwelcome in the community. - -**Consequence**: A private, written warning from community leaders, providing -clarity around the nature of the violation and an explanation of why the -behavior was inappropriate. A public apology may be requested. - -### 2. Warning - -**Community Impact**: A violation through a single incident or series -of actions. - -**Consequence**: A warning with consequences for continued behavior. No -interaction with the people involved, including unsolicited interaction with -those enforcing the Code of Conduct, for a specified period of time. This -includes avoiding interactions in community spaces as well as external channels -like social media. Violating these terms may lead to a temporary or -permanent ban. - -### 3. Temporary Ban - -**Community Impact**: A serious violation of community standards, including -sustained inappropriate behavior. - -**Consequence**: A temporary ban from any sort of interaction or public -communication with the community for a specified period of time. No public or -private interaction with the people involved, including unsolicited interaction -with those enforcing the Code of Conduct, is allowed during this period. -Violating these terms may lead to a permanent ban. - -### 4. Permanent Ban - -**Community Impact**: Demonstrating a pattern of violation of community -standards, including sustained inappropriate behavior, harassment of an -individual, or aggression toward or disparagement of classes of individuals. - -**Consequence**: A permanent ban from any sort of public interaction within -the community. - -## Attribution - -This Code of Conduct is adapted from the [Contributor Covenant][homepage], -version 2.1, available at -https://www.contributor-covenant.org/version/2/1/code_of_conduct.html. - -Community Impact Guidelines were inspired by [Mozilla's code of conduct -enforcement ladder](https://github.com/mozilla/diversity). - -[homepage]: https://www.contributor-covenant.org - -For answers to common questions about this code of conduct, see the FAQ at -https://www.contributor-covenant.org/faq. Translations are available at -https://www.contributor-covenant.org/translations. diff --git a/diffusers/CONTRIBUTING.md b/diffusers/CONTRIBUTING.md deleted file mode 100644 index 049d317599ad2292504acf3d139b332bc64cbe74..0000000000000000000000000000000000000000 --- a/diffusers/CONTRIBUTING.md +++ /dev/null @@ -1,506 +0,0 @@ - - -# How to contribute to Diffusers 🧨 - -We ❤️ contributions from the open-source community! Everyone is welcome, and all types of participation –not just code– are valued and appreciated. Answering questions, helping others, reaching out, and improving the documentation are all immensely valuable to the community, so don't be afraid and get involved if you're up for it! - -Everyone is encouraged to start by saying 👋 in our public Discord channel. We discuss the latest trends in diffusion models, ask questions, show off personal projects, help each other with contributions, or just hang out ☕. Join us on Discord - -Whichever way you choose to contribute, we strive to be part of an open, welcoming, and kind community. Please, read our [code of conduct](https://github.com/huggingface/diffusers/blob/main/CODE_OF_CONDUCT.md) and be mindful to respect it during your interactions. We also recommend you become familiar with the [ethical guidelines](https://huggingface.co/docs/diffusers/conceptual/ethical_guidelines) that guide our project and ask you to adhere to the same principles of transparency and responsibility. - -We enormously value feedback from the community, so please do not be afraid to speak up if you believe you have valuable feedback that can help improve the library - every message, comment, issue, and pull request (PR) is read and considered. - -## Overview - -You can contribute in many ways ranging from answering questions on issues to adding new diffusion models to -the core library. - -In the following, we give an overview of different ways to contribute, ranked by difficulty in ascending order. All of them are valuable to the community. - -* 1. Asking and answering questions on [the Diffusers discussion forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers) or on [Discord](https://discord.gg/G7tWnz98XR). -* 2. Opening new issues on [the GitHub Issues tab](https://github.com/huggingface/diffusers/issues/new/choose). -* 3. Answering issues on [the GitHub Issues tab](https://github.com/huggingface/diffusers/issues). -* 4. Fix a simple issue, marked by the "Good first issue" label, see [here](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22). -* 5. Contribute to the [documentation](https://github.com/huggingface/diffusers/tree/main/docs/source). -* 6. Contribute a [Community Pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3Acommunity-examples). -* 7. Contribute to the [examples](https://github.com/huggingface/diffusers/tree/main/examples). -* 8. Fix a more difficult issue, marked by the "Good second issue" label, see [here](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22). -* 9. Add a new pipeline, model, or scheduler, see ["New Pipeline/Model"](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22) and ["New scheduler"](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22) issues. For this contribution, please have a look at [Design Philosophy](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md). - -As said before, **all contributions are valuable to the community**. -In the following, we will explain each contribution a bit more in detail. - -For all contributions 4-9, you will need to open a PR. It is explained in detail how to do so in [Opening a pull request](#how-to-open-a-pr). - -### 1. Asking and answering questions on the Diffusers discussion forum or on the Diffusers Discord - -Any question or comment related to the Diffusers library can be asked on the [discussion forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/) or on [Discord](https://discord.gg/G7tWnz98XR). Such questions and comments include (but are not limited to): -- Reports of training or inference experiments in an attempt to share knowledge -- Presentation of personal projects -- Questions to non-official training examples -- Project proposals -- General feedback -- Paper summaries -- Asking for help on personal projects that build on top of the Diffusers library -- General questions -- Ethical questions regarding diffusion models -- ... - -Every question that is asked on the forum or on Discord actively encourages the community to publicly -share knowledge and might very well help a beginner in the future who has the same question you're -having. Please do pose any questions you might have. -In the same spirit, you are of immense help to the community by answering such questions because this way you are publicly documenting knowledge for everybody to learn from. - -**Please** keep in mind that the more effort you put into asking or answering a question, the higher -the quality of the publicly documented knowledge. In the same way, well-posed and well-answered questions create a high-quality knowledge database accessible to everybody, while badly posed questions or answers reduce the overall quality of the public knowledge database. -In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accessible*, and *well-formatted/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section. - -**NOTE about channels**: -[*The forum*](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) is much better indexed by search engines, such as Google. Posts are ranked by popularity rather than chronologically. Hence, it's easier to look up questions and answers that we posted some time ago. -In addition, questions and answers posted in the forum can easily be linked to. -In contrast, *Discord* has a chat-like format that invites fast back-and-forth communication. -While it will most likely take less time for you to get an answer to your question on Discord, your -question won't be visible anymore over time. Also, it's much harder to find information that was posted a while back on Discord. We therefore strongly recommend using the forum for high-quality questions and answers in an attempt to create long-lasting knowledge for the community. If discussions on Discord lead to very interesting answers and conclusions, we recommend posting the results on the forum to make the information more available for future readers. - -### 2. Opening new issues on the GitHub issues tab - -The 🧨 Diffusers library is robust and reliable thanks to the users who notify us of -the problems they encounter. So thank you for reporting an issue. - -Remember, GitHub issues are reserved for technical questions directly related to the Diffusers library, bug reports, feature requests, or feedback on the library design. - -In a nutshell, this means that everything that is **not** related to the **code of the Diffusers library** (including the documentation) should **not** be asked on GitHub, but rather on either the [forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) or [Discord](https://discord.gg/G7tWnz98XR). - -**Please consider the following guidelines when opening a new issue**: -- Make sure you have searched whether your issue has already been asked before (use the search bar on GitHub under Issues). -- Please never report a new issue on another (related) issue. If another issue is highly related, please -open a new issue nevertheless and link to the related issue. -- Make sure your issue is written in English. Please use one of the great, free online translation services, such as [DeepL](https://www.deepl.com/translator) to translate from your native language to English if you are not comfortable in English. -- Check whether your issue might be solved by updating to the newest Diffusers version. Before posting your issue, please make sure that `python -c "import diffusers; print(diffusers.__version__)"` is higher or matches the latest Diffusers version. -- Remember that the more effort you put into opening a new issue, the higher the quality of your answer will be and the better the overall quality of the Diffusers issues. - -New issues usually include the following. - -#### 2.1. Reproducible, minimal bug reports - -A bug report should always have a reproducible code snippet and be as minimal and concise as possible. -This means in more detail: -- Narrow the bug down as much as you can, **do not just dump your whole code file**. -- Format your code. -- Do not include any external libraries except for Diffusers depending on them. -- **Always** provide all necessary information about your environment; for this, you can run: `diffusers-cli env` in your shell and copy-paste the displayed information to the issue. -- Explain the issue. If the reader doesn't know what the issue is and why it is an issue, she cannot solve it. -- **Always** make sure the reader can reproduce your issue with as little effort as possible. If your code snippet cannot be run because of missing libraries or undefined variables, the reader cannot help you. Make sure your reproducible code snippet is as minimal as possible and can be copy-pasted into a simple Python shell. -- If in order to reproduce your issue a model and/or dataset is required, make sure the reader has access to that model or dataset. You can always upload your model or dataset to the [Hub](https://huggingface.co) to make it easily downloadable. Try to keep your model and dataset as small as possible, to make the reproduction of your issue as effortless as possible. - -For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section. - -You can open a bug report [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=bug&projects=&template=bug-report.yml). - -#### 2.2. Feature requests - -A world-class feature request addresses the following points: - -1. Motivation first: -* Is it related to a problem/frustration with the library? If so, please explain -why. Providing a code snippet that demonstrates the problem is best. -* Is it related to something you would need for a project? We'd love to hear -about it! -* Is it something you worked on and think could benefit the community? -Awesome! Tell us what problem it solved for you. -2. Write a *full paragraph* describing the feature; -3. Provide a **code snippet** that demonstrates its future use; -4. In case this is related to a paper, please attach a link; -5. Attach any additional information (drawings, screenshots, etc.) you think may help. - -You can open a feature request [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feature_request.md&title=). - -#### 2.3 Feedback - -Feedback about the library design and why it is good or not good helps the core maintainers immensely to build a user-friendly library. To understand the philosophy behind the current design philosophy, please have a look [here](https://huggingface.co/docs/diffusers/conceptual/philosophy). If you feel like a certain design choice does not fit with the current design philosophy, please explain why and how it should be changed. If a certain design choice follows the design philosophy too much, hence restricting use cases, explain why and how it should be changed. -If a certain design choice is very useful for you, please also leave a note as this is great feedback for future design decisions. - -You can open an issue about feedback [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=). - -#### 2.4 Technical questions - -Technical questions are mainly about why certain code of the library was written in a certain way, or what a certain part of the code does. Please make sure to link to the code in question and please provide detail on -why this part of the code is difficult to understand. - -You can open an issue about a technical question [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=bug&template=bug-report.yml). - -#### 2.5 Proposal to add a new model, scheduler, or pipeline - -If the diffusion model community released a new model, pipeline, or scheduler that you would like to see in the Diffusers library, please provide the following information: - -* Short description of the diffusion pipeline, model, or scheduler and link to the paper or public release. -* Link to any of its open-source implementation. -* Link to the model weights if they are available. - -If you are willing to contribute to the model yourself, let us know so we can best guide you. Also, don't forget -to tag the original author of the component (model, scheduler, pipeline, etc.) by GitHub handle if you can find it. - -You can open a request for a model/pipeline/scheduler [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=New+model%2Fpipeline%2Fscheduler&template=new-model-addition.yml). - -### 3. Answering issues on the GitHub issues tab - -Answering issues on GitHub might require some technical knowledge of Diffusers, but we encourage everybody to give it a try even if you are not 100% certain that your answer is correct. -Some tips to give a high-quality answer to an issue: -- Be as concise and minimal as possible. -- Stay on topic. An answer to the issue should concern the issue and only the issue. -- Provide links to code, papers, or other sources that prove or encourage your point. -- Answer in code. If a simple code snippet is the answer to the issue or shows how the issue can be solved, please provide a fully reproducible code snippet. - -Also, many issues tend to be simply off-topic, duplicates of other issues, or irrelevant. It is of great -help to the maintainers if you can answer such issues, encouraging the author of the issue to be -more precise, provide the link to a duplicated issue or redirect them to [the forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) or [Discord](https://discord.gg/G7tWnz98XR). - -If you have verified that the issued bug report is correct and requires a correction in the source code, -please have a look at the next sections. - -For all of the following contributions, you will need to open a PR. It is explained in detail how to do so in the [Opening a pull request](#how-to-open-a-pr) section. - -### 4. Fixing a "Good first issue" - -*Good first issues* are marked by the [Good first issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) label. Usually, the issue already -explains how a potential solution should look so that it is easier to fix. -If the issue hasn't been closed and you would like to try to fix this issue, you can just leave a message "I would like to try this issue.". There are usually three scenarios: -- a.) The issue description already proposes a fix. In this case and if the solution makes sense to you, you can open a PR or draft PR to fix it. -- b.) The issue description does not propose a fix. In this case, you can ask what a proposed fix could look like and someone from the Diffusers team should answer shortly. If you have a good idea of how to fix it, feel free to directly open a PR. -- c.) There is already an open PR to fix the issue, but the issue hasn't been closed yet. If the PR has gone stale, you can simply open a new PR and link to the stale PR. PRs often go stale if the original contributor who wanted to fix the issue suddenly cannot find the time anymore to proceed. This often happens in open-source and is very normal. In this case, the community will be very happy if you give it a new try and leverage the knowledge of the existing PR. If there is already a PR and it is active, you can help the author by giving suggestions, reviewing the PR or even asking whether you can contribute to the PR. - - -### 5. Contribute to the documentation - -A good library **always** has good documentation! The official documentation is often one of the first points of contact for new users of the library, and therefore contributing to the documentation is a **highly -valuable contribution**. - -Contributing to the library can have many forms: - -- Correcting spelling or grammatical errors. -- Correct incorrect formatting of the docstring. If you see that the official documentation is weirdly displayed or a link is broken, we are very happy if you take some time to correct it. -- Correct the shape or dimensions of a docstring input or output tensor. -- Clarify documentation that is hard to understand or incorrect. -- Update outdated code examples. -- Translating the documentation to another language. - -Anything displayed on [the official Diffusers doc page](https://huggingface.co/docs/diffusers/index) is part of the official documentation and can be corrected, adjusted in the respective [documentation source](https://github.com/huggingface/diffusers/tree/main/docs/source). - -Please have a look at [this page](https://github.com/huggingface/diffusers/tree/main/docs) on how to verify changes made to the documentation locally. - - -### 6. Contribute a community pipeline - -[Pipelines](https://huggingface.co/docs/diffusers/api/pipelines/overview) are usually the first point of contact between the Diffusers library and the user. -Pipelines are examples of how to use Diffusers [models](https://huggingface.co/docs/diffusers/api/models/overview) and [schedulers](https://huggingface.co/docs/diffusers/api/schedulers/overview). -We support two types of pipelines: - -- Official Pipelines -- Community Pipelines - -Both official and community pipelines follow the same design and consist of the same type of components. - -Official pipelines are tested and maintained by the core maintainers of Diffusers. Their code -resides in [src/diffusers/pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines). -In contrast, community pipelines are contributed and maintained purely by the **community** and are **not** tested. -They reside in [examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community) and while they can be accessed via the [PyPI diffusers package](https://pypi.org/project/diffusers/), their code is not part of the PyPI distribution. - -The reason for the distinction is that the core maintainers of the Diffusers library cannot maintain and test all -possible ways diffusion models can be used for inference, but some of them may be of interest to the community. -Officially released diffusion pipelines, -such as Stable Diffusion are added to the core src/diffusers/pipelines package which ensures -high quality of maintenance, no backward-breaking code changes, and testing. -More bleeding edge pipelines should be added as community pipelines. If usage for a community pipeline is high, the pipeline can be moved to the official pipelines upon request from the community. This is one of the ways we strive to be a community-driven library. - -To add a community pipeline, one should add a .py file to [examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community) and adapt the [examples/community/README.md](https://github.com/huggingface/diffusers/tree/main/examples/community/README.md) to include an example of the new pipeline. - -An example can be seen [here](https://github.com/huggingface/diffusers/pull/2400). - -Community pipeline PRs are only checked at a superficial level and ideally they should be maintained by their original authors. - -Contributing a community pipeline is a great way to understand how Diffusers models and schedulers work. Having contributed a community pipeline is usually the first stepping stone to contributing an official pipeline to the -core package. - -### 7. Contribute to training examples - -Diffusers examples are a collection of training scripts that reside in [examples](https://github.com/huggingface/diffusers/tree/main/examples). - -We support two types of training examples: - -- Official training examples -- Research training examples - -Research training examples are located in [examples/research_projects](https://github.com/huggingface/diffusers/tree/main/examples/research_projects) whereas official training examples include all folders under [examples](https://github.com/huggingface/diffusers/tree/main/examples) except the `research_projects` and `community` folders. -The official training examples are maintained by the Diffusers' core maintainers whereas the research training examples are maintained by the community. -This is because of the same reasons put forward in [6. Contribute a community pipeline](#6-contribute-a-community-pipeline) for official pipelines vs. community pipelines: It is not feasible for the core maintainers to maintain all possible training methods for diffusion models. -If the Diffusers core maintainers and the community consider a certain training paradigm to be too experimental or not popular enough, the corresponding training code should be put in the `research_projects` folder and maintained by the author. - -Both official training and research examples consist of a directory that contains one or more training scripts, a `requirements.txt` file, and a `README.md` file. In order for the user to make use of the -training examples, it is required to clone the repository: - -```bash -git clone https://github.com/huggingface/diffusers -``` - -as well as to install all additional dependencies required for training: - -```bash -cd diffusers -pip install -r examples//requirements.txt -``` - -Therefore when adding an example, the `requirements.txt` file shall define all pip dependencies required for your training example so that once all those are installed, the user can run the example's training script. See, for example, the [DreamBooth `requirements.txt` file](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/requirements.txt). - -Training examples of the Diffusers library should adhere to the following philosophy: -- All the code necessary to run the examples should be found in a single Python file. -- One should be able to run the example from the command line with `python .py --args`. -- Examples should be kept simple and serve as **an example** on how to use Diffusers for training. The purpose of example scripts is **not** to create state-of-the-art diffusion models, but rather to reproduce known training schemes without adding too much custom logic. As a byproduct of this point, our examples also strive to serve as good educational materials. - -To contribute an example, it is highly recommended to look at already existing examples such as [dreambooth](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py) to get an idea of how they should look like. -We strongly advise contributors to make use of the [Accelerate library](https://github.com/huggingface/accelerate) as it's tightly integrated -with Diffusers. -Once an example script works, please make sure to add a comprehensive `README.md` that states how to use the example exactly. This README should include: -- An example command on how to run the example script as shown [here e.g.](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth#running-locally-with-pytorch). -- A link to some training results (logs, models, ...) that show what the user can expect as shown [here e.g.](https://api.wandb.ai/report/patrickvonplaten/xm6cd5q5). -- If you are adding a non-official/research training example, **please don't forget** to add a sentence that you are maintaining this training example which includes your git handle as shown [here](https://github.com/huggingface/diffusers/tree/main/examples/research_projects/intel_opts#diffusers-examples-with-intel-optimizations). - -If you are contributing to the official training examples, please also make sure to add a test to [examples/test_examples.py](https://github.com/huggingface/diffusers/blob/main/examples/test_examples.py). This is not necessary for non-official training examples. - -### 8. Fixing a "Good second issue" - -*Good second issues* are marked by the [Good second issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22) label. Good second issues are -usually more complicated to solve than [Good first issues](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22). -The issue description usually gives less guidance on how to fix the issue and requires -a decent understanding of the library by the interested contributor. -If you are interested in tackling a good second issue, feel free to open a PR to fix it and link the PR to the issue. If you see that a PR has already been opened for this issue but did not get merged, have a look to understand why it wasn't merged and try to open an improved PR. -Good second issues are usually more difficult to get merged compared to good first issues, so don't hesitate to ask for help from the core maintainers. If your PR is almost finished the core maintainers can also jump into your PR and commit to it in order to get it merged. - -### 9. Adding pipelines, models, schedulers - -Pipelines, models, and schedulers are the most important pieces of the Diffusers library. -They provide easy access to state-of-the-art diffusion technologies and thus allow the community to -build powerful generative AI applications. - -By adding a new model, pipeline, or scheduler you might enable a new powerful use case for any of the user interfaces relying on Diffusers which can be of immense value for the whole generative AI ecosystem. - -Diffusers has a couple of open feature requests for all three components - feel free to gloss over them -if you don't know yet what specific component you would like to add: -- [Model or pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22) -- [Scheduler](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22) - -Before adding any of the three components, it is strongly recommended that you give the [Philosophy guide](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md) a read to better understand the design of any of the three components. Please be aware that -we cannot merge model, scheduler, or pipeline additions that strongly diverge from our design philosophy -as it will lead to API inconsistencies. If you fundamentally disagree with a design choice, please -open a [Feedback issue](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=) instead so that it can be discussed whether a certain design -pattern/design choice shall be changed everywhere in the library and whether we shall update our design philosophy. Consistency across the library is very important for us. - -Please make sure to add links to the original codebase/paper to the PR and ideally also ping the -original author directly on the PR so that they can follow the progress and potentially help with questions. - -If you are unsure or stuck in the PR, don't hesitate to leave a message to ask for a first review or help. - -## How to write a good issue - -**The better your issue is written, the higher the chances that it will be quickly resolved.** - -1. Make sure that you've used the correct template for your issue. You can pick between *Bug Report*, *Feature Request*, *Feedback about API Design*, *New model/pipeline/scheduler addition*, *Forum*, or a blank issue. Make sure to pick the correct one when opening [a new issue](https://github.com/huggingface/diffusers/issues/new/choose). -2. **Be precise**: Give your issue a fitting title. Try to formulate your issue description as simple as possible. The more precise you are when submitting an issue, the less time it takes to understand the issue and potentially solve it. Make sure to open an issue for one issue only and not for multiple issues. If you found multiple issues, simply open multiple issues. If your issue is a bug, try to be as precise as possible about what bug it is - you should not just write "Error in diffusers". -3. **Reproducibility**: No reproducible code snippet == no solution. If you encounter a bug, maintainers **have to be able to reproduce** it. Make sure that you include a code snippet that can be copy-pasted into a Python interpreter to reproduce the issue. Make sure that your code snippet works, *i.e.* that there are no missing imports or missing links to images, ... Your issue should contain an error message **and** a code snippet that can be copy-pasted without any changes to reproduce the exact same error message. If your issue is using local model weights or local data that cannot be accessed by the reader, the issue cannot be solved. If you cannot share your data or model, try to make a dummy model or dummy data. -4. **Minimalistic**: Try to help the reader as much as you can to understand the issue as quickly as possible by staying as concise as possible. Remove all code / all information that is irrelevant to the issue. If you have found a bug, try to create the easiest code example you can to demonstrate your issue, do not just dump your whole workflow into the issue as soon as you have found a bug. E.g., if you train a model and get an error at some point during the training, you should first try to understand what part of the training code is responsible for the error and try to reproduce it with a couple of lines. Try to use dummy data instead of full datasets. -5. Add links. If you are referring to a certain naming, method, or model make sure to provide a link so that the reader can better understand what you mean. If you are referring to a specific PR or issue, make sure to link it to your issue. Do not assume that the reader knows what you are talking about. The more links you add to your issue the better. -6. Formatting. Make sure to nicely format your issue by formatting code into Python code syntax, and error messages into normal code syntax. See the [official GitHub formatting docs](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) for more information. -7. Think of your issue not as a ticket to be solved, but rather as a beautiful entry to a well-written encyclopedia. Every added issue is a contribution to publicly available knowledge. By adding a nicely written issue you not only make it easier for maintainers to solve your issue, but you are helping the whole community to better understand a certain aspect of the library. - -## How to write a good PR - -1. Be a chameleon. Understand existing design patterns and syntax and make sure your code additions flow seamlessly into the existing code base. Pull requests that significantly diverge from existing design patterns or user interfaces will not be merged. -2. Be laser focused. A pull request should solve one problem and one problem only. Make sure to not fall into the trap of "also fixing another problem while we're adding it". It is much more difficult to review pull requests that solve multiple, unrelated problems at once. -3. If helpful, try to add a code snippet that displays an example of how your addition can be used. -4. The title of your pull request should be a summary of its contribution. -5. If your pull request addresses an issue, please mention the issue number in -the pull request description to make sure they are linked (and people -consulting the issue know you are working on it); -6. To indicate a work in progress please prefix the title with `[WIP]`. These -are useful to avoid duplicated work, and to differentiate it from PRs ready -to be merged; -7. Try to formulate and format your text as explained in [How to write a good issue](#how-to-write-a-good-issue). -8. Make sure existing tests pass; -9. Add high-coverage tests. No quality testing = no merge. -- If you are adding new `@slow` tests, make sure they pass using -`RUN_SLOW=1 python -m pytest tests/test_my_new_model.py`. -CircleCI does not run the slow tests, but GitHub Actions does every night! -10. All public methods must have informative docstrings that work nicely with markdown. See [`pipeline_latent_diffusion.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py) for an example. -11. Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos, and other non-text files. We prefer to leverage a hf.co hosted `dataset` like -[`hf-internal-testing`](https://huggingface.co/hf-internal-testing) or [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images) to place these files. -If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images -to this dataset. - -## How to open a PR - -Before writing code, we strongly advise you to search through the existing PRs or -issues to make sure that nobody is already working on the same thing. If you are -unsure, it is always a good idea to open an issue to get some feedback. - -You will need basic `git` proficiency to be able to contribute to -🧨 Diffusers. `git` is not the easiest tool to use but it has the greatest -manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro -Git](https://git-scm.com/book/en/v2) is a very good reference. - -Follow these steps to start contributing ([supported Python versions](https://github.com/huggingface/diffusers/blob/42f25d601a910dceadaee6c44345896b4cfa9928/setup.py#L270)): - -1. Fork the [repository](https://github.com/huggingface/diffusers) by -clicking on the 'Fork' button on the repository's page. This creates a copy of the code -under your GitHub user account. - -2. Clone your fork to your local disk, and add the base repository as a remote: - - ```bash - $ git clone git@github.com:/diffusers.git - $ cd diffusers - $ git remote add upstream https://github.com/huggingface/diffusers.git - ``` - -3. Create a new branch to hold your development changes: - - ```bash - $ git checkout -b a-descriptive-name-for-my-changes - ``` - -**Do not** work on the `main` branch. - -4. Set up a development environment by running the following command in a virtual environment: - - ```bash - $ pip install -e ".[dev]" - ``` - -If you have already cloned the repo, you might need to `git pull` to get the most recent changes in the -library. - -5. Develop the features on your branch. - -As you work on the features, you should make sure that the test suite -passes. You should run the tests impacted by your changes like this: - - ```bash - $ pytest tests/.py - ``` - -Before you run the tests, please make sure you install the dependencies required for testing. You can do so -with this command: - - ```bash - $ pip install -e ".[test]" - ``` - -You can also run the full test suite with the following command, but it takes -a beefy machine to produce a result in a decent amount of time now that -Diffusers has grown a lot. Here is the command for it: - - ```bash - $ make test - ``` - -🧨 Diffusers relies on `ruff` and `isort` to format its source code -consistently. After you make changes, apply automatic style corrections and code verifications -that can't be automated in one go with: - - ```bash - $ make style - ``` - -🧨 Diffusers also uses `ruff` and a few custom scripts to check for coding mistakes. Quality -control runs in CI, however, you can also run the same checks with: - - ```bash - $ make quality - ``` - -Once you're happy with your changes, add changed files using `git add` and -make a commit with `git commit` to record your changes locally: - - ```bash - $ git add modified_file.py - $ git commit -m "A descriptive message about your changes." - ``` - -It is a good idea to sync your copy of the code with the original -repository regularly. This way you can quickly account for changes: - - ```bash - $ git pull upstream main - ``` - -Push the changes to your account using: - - ```bash - $ git push -u origin a-descriptive-name-for-my-changes - ``` - -6. Once you are satisfied, go to the -webpage of your fork on GitHub. Click on 'Pull request' to send your changes -to the project maintainers for review. - -7. It's ok if maintainers ask you for changes. It happens to core contributors -too! So everyone can see the changes in the Pull request, work in your local -branch and push the changes to your fork. They will automatically appear in -the pull request. - -### Tests - -An extensive test suite is included to test the library behavior and several examples. Library tests can be found in -the [tests folder](https://github.com/huggingface/diffusers/tree/main/tests). - -We like `pytest` and `pytest-xdist` because it's faster. From the root of the -repository, here's how to run tests with `pytest` for the library: - -```bash -$ python -m pytest -n auto --dist=loadfile -s -v ./tests/ -``` - -In fact, that's how `make test` is implemented! - -You can specify a smaller set of tests in order to test only the feature -you're working on. - -By default, slow tests are skipped. Set the `RUN_SLOW` environment variable to -`yes` to run them. This will download many gigabytes of models — make sure you -have enough disk space and a good Internet connection, or a lot of patience! - -```bash -$ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/ -``` - -`unittest` is fully supported, here's how to run tests with it: - -```bash -$ python -m unittest discover -s tests -t . -v -$ python -m unittest discover -s examples -t examples -v -``` - -### Syncing forked main with upstream (HuggingFace) main - -To avoid pinging the upstream repository which adds reference notes to each upstream PR and sends unnecessary notifications to the developers involved in these PRs, -when syncing the main branch of a forked repository, please, follow these steps: -1. When possible, avoid syncing with the upstream using a branch and PR on the forked repository. Instead, merge directly into the forked main. -2. If a PR is absolutely necessary, use the following steps after checking out your branch: -```bash -$ git checkout -b your-branch-for-syncing -$ git pull --squash --no-commit upstream main -$ git commit -m '' -$ git push --set-upstream origin your-branch-for-syncing -``` - -### Style guide - -For documentation strings, 🧨 Diffusers follows the [Google style](https://google.github.io/styleguide/pyguide.html). diff --git a/diffusers/LICENSE b/diffusers/LICENSE deleted file mode 100644 index 261eeb9e9f8b2b4b0d119366dda99c6fd7d35c64..0000000000000000000000000000000000000000 --- a/diffusers/LICENSE +++ /dev/null @@ -1,201 +0,0 @@ - Apache License - Version 2.0, January 2004 - http://www.apache.org/licenses/ - - TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION - - 1. Definitions. - - "License" shall mean the terms and conditions for use, reproduction, - and distribution as defined by Sections 1 through 9 of this document. - - "Licensor" shall mean the copyright owner or entity authorized by - the copyright owner that is granting the License. - - "Legal Entity" shall mean the union of the acting entity and all - other entities that control, are controlled by, or are under common - control with that entity. For the purposes of this definition, - "control" means (i) the power, direct or indirect, to cause the - direction or management of such entity, whether by contract or - otherwise, or (ii) ownership of fifty percent (50%) or more of the - outstanding shares, or (iii) beneficial ownership of such entity. - - "You" (or "Your") shall mean an individual or Legal Entity - exercising permissions granted by this License. - - "Source" form shall mean the preferred form for making modifications, - including but not limited to software source code, documentation - source, and configuration files. - - "Object" form shall mean any form resulting from mechanical - transformation or translation of a Source form, including but - not limited to compiled object code, generated documentation, - and conversions to other media types. - - "Work" shall mean the work of authorship, whether in Source or - Object form, made available under the License, as indicated by a - copyright notice that is included in or attached to the work - (an example is provided in the Appendix below). - - "Derivative Works" shall mean any work, whether in Source or Object - form, that is based on (or derived from) the Work and for which the - editorial revisions, annotations, elaborations, or other modifications - represent, as a whole, an original work of authorship. For the purposes - of this License, Derivative Works shall not include works that remain - separable from, or merely link (or bind by name) to the interfaces of, - the Work and Derivative Works thereof. - - "Contribution" shall mean any work of authorship, including - the original version of the Work and any modifications or additions - to that Work or Derivative Works thereof, that is intentionally - submitted to Licensor for inclusion in the Work by the copyright owner - or by an individual or Legal Entity authorized to submit on behalf of - the copyright owner. For the purposes of this definition, "submitted" - means any form of electronic, verbal, or written communication sent - to the Licensor or its representatives, including but not limited to - communication on electronic mailing lists, source code control systems, - and issue tracking systems that are managed by, or on behalf of, the - Licensor for the purpose of discussing and improving the Work, but - excluding communication that is conspicuously marked or otherwise - designated in writing by the copyright owner as "Not a Contribution." - - "Contributor" shall mean Licensor and any individual or Legal Entity - on behalf of whom a Contribution has been received by Licensor and - subsequently incorporated within the Work. - - 2. Grant of Copyright License. Subject to the terms and conditions of - this License, each Contributor hereby grants to You a perpetual, - worldwide, non-exclusive, no-charge, royalty-free, irrevocable - copyright license to reproduce, prepare Derivative Works of, - publicly display, publicly perform, sublicense, and distribute the - Work and such Derivative Works in Source or Object form. - - 3. Grant of Patent License. Subject to the terms and conditions of - this License, each Contributor hereby grants to You a perpetual, - worldwide, non-exclusive, no-charge, royalty-free, irrevocable - (except as stated in this section) patent license to make, have made, - use, offer to sell, sell, import, and otherwise transfer the Work, - where such license applies only to those patent claims licensable - by such Contributor that are necessarily infringed by their - Contribution(s) alone or by combination of their Contribution(s) - with the Work to which such Contribution(s) was submitted. If You - institute patent litigation against any entity (including a - cross-claim or counterclaim in a lawsuit) alleging that the Work - or a Contribution incorporated within the Work constitutes direct - or contributory patent infringement, then any patent licenses - granted to You under this License for that Work shall terminate - as of the date such litigation is filed. - - 4. Redistribution. You may reproduce and distribute copies of the - Work or Derivative Works thereof in any medium, with or without - modifications, and in Source or Object form, provided that You - meet the following conditions: - - (a) You must give any other recipients of the Work or - Derivative Works a copy of this License; and - - (b) You must cause any modified files to carry prominent notices - stating that You changed the files; and - - (c) You must retain, in the Source form of any Derivative Works - that You distribute, all copyright, patent, trademark, and - attribution notices from the Source form of the Work, - excluding those notices that do not pertain to any part of - the Derivative Works; and - - (d) If the Work includes a "NOTICE" text file as part of its - distribution, then any Derivative Works that You distribute must - include a readable copy of the attribution notices contained - within such NOTICE file, excluding those notices that do not - pertain to any part of the Derivative Works, in at least one - of the following places: within a NOTICE text file distributed - as part of the Derivative Works; within the Source form or - documentation, if provided along with the Derivative Works; or, - within a display generated by the Derivative Works, if and - wherever such third-party notices normally appear. The contents - of the NOTICE file are for informational purposes only and - do not modify the License. You may add Your own attribution - notices within Derivative Works that You distribute, alongside - or as an addendum to the NOTICE text from the Work, provided - that such additional attribution notices cannot be construed - as modifying the License. - - You may add Your own copyright statement to Your modifications and - may provide additional or different license terms and conditions - for use, reproduction, or distribution of Your modifications, or - for any such Derivative Works as a whole, provided Your use, - reproduction, and distribution of the Work otherwise complies with - the conditions stated in this License. - - 5. Submission of Contributions. Unless You explicitly state otherwise, - any Contribution intentionally submitted for inclusion in the Work - by You to the Licensor shall be under the terms and conditions of - this License, without any additional terms or conditions. - Notwithstanding the above, nothing herein shall supersede or modify - the terms of any separate license agreement you may have executed - with Licensor regarding such Contributions. - - 6. Trademarks. This License does not grant permission to use the trade - names, trademarks, service marks, or product names of the Licensor, - except as required for reasonable and customary use in describing the - origin of the Work and reproducing the content of the NOTICE file. - - 7. Disclaimer of Warranty. Unless required by applicable law or - agreed to in writing, Licensor provides the Work (and each - Contributor provides its Contributions) on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - implied, including, without limitation, any warranties or conditions - of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A - PARTICULAR PURPOSE. You are solely responsible for determining the - appropriateness of using or redistributing the Work and assume any - risks associated with Your exercise of permissions under this License. - - 8. Limitation of Liability. In no event and under no legal theory, - whether in tort (including negligence), contract, or otherwise, - unless required by applicable law (such as deliberate and grossly - negligent acts) or agreed to in writing, shall any Contributor be - liable to You for damages, including any direct, indirect, special, - incidental, or consequential damages of any character arising as a - result of this License or out of the use or inability to use the - Work (including but not limited to damages for loss of goodwill, - work stoppage, computer failure or malfunction, or any and all - other commercial damages or losses), even if such Contributor - has been advised of the possibility of such damages. - - 9. Accepting Warranty or Additional Liability. While redistributing - the Work or Derivative Works thereof, You may choose to offer, - and charge a fee for, acceptance of support, warranty, indemnity, - or other liability obligations and/or rights consistent with this - License. However, in accepting such obligations, You may act only - on Your own behalf and on Your sole responsibility, not on behalf - of any other Contributor, and only if You agree to indemnify, - defend, and hold each Contributor harmless for any liability - incurred by, or claims asserted against, such Contributor by reason - of your accepting any such warranty or additional liability. - - END OF TERMS AND CONDITIONS - - APPENDIX: How to apply the Apache License to your work. - - To apply the Apache License to your work, attach the following - boilerplate notice, with the fields enclosed by brackets "[]" - replaced with your own identifying information. (Don't include - the brackets!) The text should be enclosed in the appropriate - comment syntax for the file format. We also recommend that a - file or class name and description of purpose be included on the - same "printed page" as the copyright notice for easier - identification within third-party archives. - - Copyright [yyyy] [name of copyright owner] - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. diff --git a/diffusers/MANIFEST.in b/diffusers/MANIFEST.in deleted file mode 100644 index b22fe1a28a1ef881fdb36af3c30b14c0a5d10aa5..0000000000000000000000000000000000000000 --- a/diffusers/MANIFEST.in +++ /dev/null @@ -1,2 +0,0 @@ -include LICENSE -include src/diffusers/utils/model_card_template.md diff --git a/diffusers/Makefile b/diffusers/Makefile deleted file mode 100644 index 9af2e8b1a5c9993411ffa06e7e48f9cfec3bd164..0000000000000000000000000000000000000000 --- a/diffusers/Makefile +++ /dev/null @@ -1,96 +0,0 @@ -.PHONY: deps_table_update modified_only_fixup extra_style_checks quality style fixup fix-copies test test-examples - -# make sure to test the local checkout in scripts and not the pre-installed one (don't use quotes!) -export PYTHONPATH = src - -check_dirs := examples scripts src tests utils benchmarks - -modified_only_fixup: - $(eval modified_py_files := $(shell python utils/get_modified_files.py $(check_dirs))) - @if test -n "$(modified_py_files)"; then \ - echo "Checking/fixing $(modified_py_files)"; \ - ruff check $(modified_py_files) --fix; \ - ruff format $(modified_py_files);\ - else \ - echo "No library .py files were modified"; \ - fi - -# Update src/diffusers/dependency_versions_table.py - -deps_table_update: - @python setup.py deps_table_update - -deps_table_check_updated: - @md5sum src/diffusers/dependency_versions_table.py > md5sum.saved - @python setup.py deps_table_update - @md5sum -c --quiet md5sum.saved || (printf "\nError: the version dependency table is outdated.\nPlease run 'make fixup' or 'make style' and commit the changes.\n\n" && exit 1) - @rm md5sum.saved - -# autogenerating code - -autogenerate_code: deps_table_update - -# Check that the repo is in a good state - -repo-consistency: - python utils/check_dummies.py - python utils/check_repo.py - python utils/check_inits.py - -# this target runs checks on all files - -quality: - ruff check $(check_dirs) setup.py - ruff format --check $(check_dirs) setup.py - doc-builder style src/diffusers docs/source --max_len 119 --check_only - python utils/check_doc_toc.py - -# Format source code automatically and check is there are any problems left that need manual fixing - -extra_style_checks: - python utils/custom_init_isort.py - python utils/check_doc_toc.py --fix_and_overwrite - -# this target runs checks on all files and potentially modifies some of them - -style: - ruff check $(check_dirs) setup.py --fix - ruff format $(check_dirs) setup.py - doc-builder style src/diffusers docs/source --max_len 119 - ${MAKE} autogenerate_code - ${MAKE} extra_style_checks - -# Super fast fix and check target that only works on relevant modified files since the branch was made - -fixup: modified_only_fixup extra_style_checks autogenerate_code repo-consistency - -# Make marked copies of snippets of codes conform to the original - -fix-copies: - python utils/check_copies.py --fix_and_overwrite - python utils/check_dummies.py --fix_and_overwrite - -# Run tests for the library - -test: - python -m pytest -n auto --dist=loadfile -s -v ./tests/ - -# Run tests for examples - -test-examples: - python -m pytest -n auto --dist=loadfile -s -v ./examples/ - - -# Release stuff - -pre-release: - python utils/release.py - -pre-patch: - python utils/release.py --patch - -post-release: - python utils/release.py --post_release - -post-patch: - python utils/release.py --post_release --patch diff --git a/diffusers/PHILOSOPHY.md b/diffusers/PHILOSOPHY.md deleted file mode 100644 index c646c61ec429020032cd51f0d17ac007b17edc48..0000000000000000000000000000000000000000 --- a/diffusers/PHILOSOPHY.md +++ /dev/null @@ -1,110 +0,0 @@ - - -# Philosophy - -🧨 Diffusers provides **state-of-the-art** pretrained diffusion models across multiple modalities. -Its purpose is to serve as a **modular toolbox** for both inference and training. - -We aim to build a library that stands the test of time and therefore take API design very seriously. - -In a nutshell, Diffusers is built to be a natural extension of PyTorch. Therefore, most of our design choices are based on [PyTorch's Design Principles](https://pytorch.org/docs/stable/community/design.html#pytorch-design-philosophy). Let's go over the most important ones: - -## Usability over Performance - -- While Diffusers has many built-in performance-enhancing features (see [Memory and Speed](https://huggingface.co/docs/diffusers/optimization/fp16)), models are always loaded with the highest precision and lowest optimization. Therefore, by default diffusion pipelines are always instantiated on CPU with float32 precision if not otherwise defined by the user. This ensures usability across different platforms and accelerators and means that no complex installations are required to run the library. -- Diffusers aims to be a **light-weight** package and therefore has very few required dependencies, but many soft dependencies that can improve performance (such as `accelerate`, `safetensors`, `onnx`, etc...). We strive to keep the library as lightweight as possible so that it can be added without much concern as a dependency on other packages. -- Diffusers prefers simple, self-explainable code over condensed, magic code. This means that short-hand code syntaxes such as lambda functions, and advanced PyTorch operators are often not desired. - -## Simple over easy - -As PyTorch states, **explicit is better than implicit** and **simple is better than complex**. This design philosophy is reflected in multiple parts of the library: -- We follow PyTorch's API with methods like [`DiffusionPipeline.to`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.to) to let the user handle device management. -- Raising concise error messages is preferred to silently correct erroneous input. Diffusers aims at teaching the user, rather than making the library as easy to use as possible. -- Complex model vs. scheduler logic is exposed instead of magically handled inside. Schedulers/Samplers are separated from diffusion models with minimal dependencies on each other. This forces the user to write the unrolled denoising loop. However, the separation allows for easier debugging and gives the user more control over adapting the denoising process or switching out diffusion models or schedulers. -- Separately trained components of the diffusion pipeline, *e.g.* the text encoder, the UNet, and the variational autoencoder, each has their own model class. This forces the user to handle the interaction between the different model components, and the serialization format separates the model components into different files. However, this allows for easier debugging and customization. DreamBooth or Textual Inversion training -is very simple thanks to Diffusers' ability to separate single components of the diffusion pipeline. - -## Tweakable, contributor-friendly over abstraction - -For large parts of the library, Diffusers adopts an important design principle of the [Transformers library](https://github.com/huggingface/transformers), which is to prefer copy-pasted code over hasty abstractions. This design principle is very opinionated and stands in stark contrast to popular design principles such as [Don't repeat yourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself). -In short, just like Transformers does for modeling files, Diffusers prefers to keep an extremely low level of abstraction and very self-contained code for pipelines and schedulers. -Functions, long code blocks, and even classes can be copied across multiple files which at first can look like a bad, sloppy design choice that makes the library unmaintainable. -**However**, this design has proven to be extremely successful for Transformers and makes a lot of sense for community-driven, open-source machine learning libraries because: -- Machine Learning is an extremely fast-moving field in which paradigms, model architectures, and algorithms are changing rapidly, which therefore makes it very difficult to define long-lasting code abstractions. -- Machine Learning practitioners like to be able to quickly tweak existing code for ideation and research and therefore prefer self-contained code over one that contains many abstractions. -- Open-source libraries rely on community contributions and therefore must build a library that is easy to contribute to. The more abstract the code, the more dependencies, the harder to read, and the harder to contribute to. Contributors simply stop contributing to very abstract libraries out of fear of breaking vital functionality. If contributing to a library cannot break other fundamental code, not only is it more inviting for potential new contributors, but it is also easier to review and contribute to multiple parts in parallel. - -At Hugging Face, we call this design the **single-file policy** which means that almost all of the code of a certain class should be written in a single, self-contained file. To read more about the philosophy, you can have a look -at [this blog post](https://huggingface.co/blog/transformers-design-philosophy). - -In Diffusers, we follow this philosophy for both pipelines and schedulers, but only partly for diffusion models. The reason we don't follow this design fully for diffusion models is because almost all diffusion pipelines, such -as [DDPM](https://huggingface.co/docs/diffusers/api/pipelines/ddpm), [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview#stable-diffusion-pipelines), [unCLIP (DALL·E 2)](https://huggingface.co/docs/diffusers/api/pipelines/unclip) and [Imagen](https://imagen.research.google/) all rely on the same diffusion model, the [UNet](https://huggingface.co/docs/diffusers/api/models/unet2d-cond). - -Great, now you should have generally understood why 🧨 Diffusers is designed the way it is 🤗. -We try to apply these design principles consistently across the library. Nevertheless, there are some minor exceptions to the philosophy or some unlucky design choices. If you have feedback regarding the design, we would ❤️ to hear it [directly on GitHub](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=). - -## Design Philosophy in Details - -Now, let's look a bit into the nitty-gritty details of the design philosophy. Diffusers essentially consists of three major classes: [pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines), [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models), and [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers). -Let's walk through more detailed design decisions for each class. - -### Pipelines - -Pipelines are designed to be easy to use (therefore do not follow [*Simple over easy*](#simple-over-easy) 100%), are not feature complete, and should loosely be seen as examples of how to use [models](#models) and [schedulers](#schedulers) for inference. - -The following design principles are followed: -- Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as it’s done for [`src/diffusers/pipelines/stable-diffusion`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion). If pipelines share similar functionality, one can make use of the [# Copied from mechanism](https://github.com/huggingface/diffusers/blob/125d783076e5bd9785beb05367a2d2566843a271/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py#L251). -- Pipelines all inherit from [`DiffusionPipeline`]. -- Every pipeline consists of different model and scheduler components, that are documented in the [`model_index.json` file](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/model_index.json), are accessible under the same name as attributes of the pipeline and can be shared between pipelines with [`DiffusionPipeline.components`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.components) function. -- Every pipeline should be loadable via the [`DiffusionPipeline.from_pretrained`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained) function. -- Pipelines should be used **only** for inference. -- Pipelines should be very readable, self-explanatory, and easy to tweak. -- Pipelines should be designed to build on top of each other and be easy to integrate into higher-level APIs. -- Pipelines are **not** intended to be feature-complete user interfaces. For feature-complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner). -- Every pipeline should have one and only one way to run it via a `__call__` method. The naming of the `__call__` arguments should be shared across all pipelines. -- Pipelines should be named after the task they are intended to solve. -- In almost all cases, novel diffusion pipelines shall be implemented in a new pipeline folder/file. - -### Models - -Models are designed as configurable toolboxes that are natural extensions of [PyTorch's Module class](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). They only partly follow the **single-file policy**. - -The following design principles are followed: -- Models correspond to **a type of model architecture**. *E.g.* the [`UNet2DConditionModel`] class is used for all UNet variations that expect 2D image inputs and are conditioned on some context. -- All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unets/unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_condition.py), [`transformers/transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_2d.py), etc... -- Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py), [`resnet.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy. -- Models intend to expose complexity, just like PyTorch's `Module` class, and give clear error messages. -- Models all inherit from `ModelMixin` and `ConfigMixin`. -- Models can be optimized for performance when it doesn’t demand major code changes, keep backward compatibility, and give significant memory or compute gain. -- Models should by default have the highest precision and lowest performance setting. -- To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different. -- Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work. -- The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and -readable long-term, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). - -### Schedulers - -Schedulers are responsible to guide the denoising process for inference as well as to define a noise schedule for training. They are designed as individual classes with loadable configuration files and strongly follow the **single-file policy**. - -The following design principles are followed: -- All schedulers are found in [`src/diffusers/schedulers`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers). -- Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained. -- One scheduler Python file corresponds to one scheduler algorithm (as might be defined in a paper). -- If schedulers share similar functionalities, we can make use of the `# Copied from` mechanism. -- Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`. -- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](./docs/source/en/using-diffusers/schedulers.md). -- Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called. -- Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon. -- The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1). -- Given the complexity of diffusion schedulers, the `step` function does not expose all the complexity and can be a bit of a "black box". -- In almost all cases, novel schedulers shall be implemented in a new scheduling file. diff --git a/diffusers/README.md b/diffusers/README.md deleted file mode 100644 index b99ca828e4d0dcf44a6954b9ae686435fc3127e9..0000000000000000000000000000000000000000 --- a/diffusers/README.md +++ /dev/null @@ -1,239 +0,0 @@ - - -

-
- -
-

-

- GitHub - GitHub release - GitHub release - Contributor Covenant - X account -

- -🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or training your own diffusion models, 🤗 Diffusers is a modular toolbox that supports both. Our library is designed with a focus on [usability over performance](https://huggingface.co/docs/diffusers/conceptual/philosophy#usability-over-performance), [simple over easy](https://huggingface.co/docs/diffusers/conceptual/philosophy#simple-over-easy), and [customizability over abstractions](https://huggingface.co/docs/diffusers/conceptual/philosophy#tweakable-contributorfriendly-over-abstraction). - -🤗 Diffusers offers three core components: - -- State-of-the-art [diffusion pipelines](https://huggingface.co/docs/diffusers/api/pipelines/overview) that can be run in inference with just a few lines of code. -- Interchangeable noise [schedulers](https://huggingface.co/docs/diffusers/api/schedulers/overview) for different diffusion speeds and output quality. -- Pretrained [models](https://huggingface.co/docs/diffusers/api/models/overview) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems. - -## Installation - -We recommend installing 🤗 Diffusers in a virtual environment from PyPI or Conda. For more details about installing [PyTorch](https://pytorch.org/get-started/locally/) and [Flax](https://flax.readthedocs.io/en/latest/#installation), please refer to their official documentation. - -### PyTorch - -With `pip` (official package): - -```bash -pip install --upgrade diffusers[torch] -``` - -With `conda` (maintained by the community): - -```sh -conda install -c conda-forge diffusers -``` - -### Flax - -With `pip` (official package): - -```bash -pip install --upgrade diffusers[flax] -``` - -### Apple Silicon (M1/M2) support - -Please refer to the [How to use Stable Diffusion in Apple Silicon](https://huggingface.co/docs/diffusers/optimization/mps) guide. - -## Quickstart - -Generating outputs is super easy with 🤗 Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 30,000+ checkpoints): - -```python -from diffusers import DiffusionPipeline -import torch - -pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16) -pipeline.to("cuda") -pipeline("An image of a squirrel in Picasso style").images[0] -``` - -You can also dig into the models and schedulers toolbox to build your own diffusion system: - -```python -from diffusers import DDPMScheduler, UNet2DModel -from PIL import Image -import torch - -scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256") -model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda") -scheduler.set_timesteps(50) - -sample_size = model.config.sample_size -noise = torch.randn((1, 3, sample_size, sample_size), device="cuda") -input = noise - -for t in scheduler.timesteps: - with torch.no_grad(): - noisy_residual = model(input, t).sample - prev_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample - input = prev_noisy_sample - -image = (input / 2 + 0.5).clamp(0, 1) -image = image.cpu().permute(0, 2, 3, 1).numpy()[0] -image = Image.fromarray((image * 255).round().astype("uint8")) -image -``` - -Check out the [Quickstart](https://huggingface.co/docs/diffusers/quicktour) to launch your diffusion journey today! - -## How to navigate the documentation - -| **Documentation** | **What can I learn?** | -|---------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| [Tutorial](https://huggingface.co/docs/diffusers/tutorials/tutorial_overview) | A basic crash course for learning how to use the library's most important features like using models and schedulers to build your own diffusion system, and training your own diffusion model. | -| [Loading](https://huggingface.co/docs/diffusers/using-diffusers/loading_overview) | Guides for how to load and configure all the components (pipelines, models, and schedulers) of the library, as well as how to use different schedulers. | -| [Pipelines for inference](https://huggingface.co/docs/diffusers/using-diffusers/pipeline_overview) | Guides for how to use pipelines for different inference tasks, batched generation, controlling generated outputs and randomness, and how to contribute a pipeline to the library. | -| [Optimization](https://huggingface.co/docs/diffusers/optimization/opt_overview) | Guides for how to optimize your diffusion model to run faster and consume less memory. | -| [Training](https://huggingface.co/docs/diffusers/training/overview) | Guides for how to train a diffusion model for different tasks with different training techniques. | -## Contribution - -We ❤️ contributions from the open-source community! -If you want to contribute to this library, please check out our [Contribution guide](https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md). -You can look out for [issues](https://github.com/huggingface/diffusers/issues) you'd like to tackle to contribute to the library. -- See [Good first issues](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) for general opportunities to contribute -- See [New model/pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22) to contribute exciting new diffusion models / diffusion pipelines -- See [New scheduler](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22) - -Also, say 👋 in our public Discord channel Join us on Discord. We discuss the hottest trends about diffusion models, help each other with contributions, personal projects or just hang out ☕. - - -## Popular Tasks & Pipelines - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TaskPipeline🤗 Hub
Unconditional Image Generation DDPM google/ddpm-ema-church-256
Text-to-ImageStable Diffusion Text-to-Image stable-diffusion-v1-5/stable-diffusion-v1-5
Text-to-ImageunCLIP kakaobrain/karlo-v1-alpha
Text-to-ImageDeepFloyd IF DeepFloyd/IF-I-XL-v1.0
Text-to-ImageKandinsky kandinsky-community/kandinsky-2-2-decoder
Text-guided Image-to-ImageControlNet lllyasviel/sd-controlnet-canny
Text-guided Image-to-ImageInstructPix2Pix timbrooks/instruct-pix2pix
Text-guided Image-to-ImageStable Diffusion Image-to-Image stable-diffusion-v1-5/stable-diffusion-v1-5
Text-guided Image InpaintingStable Diffusion Inpainting runwayml/stable-diffusion-inpainting
Image VariationStable Diffusion Image Variation lambdalabs/sd-image-variations-diffusers
Super ResolutionStable Diffusion Upscale stabilityai/stable-diffusion-x4-upscaler
Super ResolutionStable Diffusion Latent Upscale stabilityai/sd-x2-latent-upscaler
- -## Popular libraries using 🧨 Diffusers - -- https://github.com/microsoft/TaskMatrix -- https://github.com/invoke-ai/InvokeAI -- https://github.com/InstantID/InstantID -- https://github.com/apple/ml-stable-diffusion -- https://github.com/Sanster/lama-cleaner -- https://github.com/IDEA-Research/Grounded-Segment-Anything -- https://github.com/ashawkey/stable-dreamfusion -- https://github.com/deep-floyd/IF -- https://github.com/bentoml/BentoML -- https://github.com/bmaltais/kohya_ss -- +14,000 other amazing GitHub repositories 💪 - -Thank you for using us ❤️. - -## Credits - -This library concretizes previous work by many different authors and would not have been possible without their great research and implementations. We'd like to thank, in particular, the following implementations which have helped us in our development and without which the API could not have been as polished today: - -- @CompVis' latent diffusion models library, available [here](https://github.com/CompVis/latent-diffusion) -- @hojonathanho original DDPM implementation, available [here](https://github.com/hojonathanho/diffusion) as well as the extremely useful translation into PyTorch by @pesser, available [here](https://github.com/pesser/pytorch_diffusion) -- @ermongroup's DDIM implementation, available [here](https://github.com/ermongroup/ddim) -- @yang-song's Score-VE and Score-VP implementations, available [here](https://github.com/yang-song/score_sde_pytorch) - -We also want to thank @heejkoo for the very helpful overview of papers, code and resources on diffusion models, available [here](https://github.com/heejkoo/Awesome-Diffusion-Models) as well as @crowsonkb and @rromb for useful discussions and insights. - -## Citation - -```bibtex -@misc{von-platen-etal-2022-diffusers, - author = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Pedro Cuenca and Nathan Lambert and Kashif Rasul and Mishig Davaadorj and Dhruv Nair and Sayak Paul and William Berman and Yiyi Xu and Steven Liu and Thomas Wolf}, - title = {Diffusers: State-of-the-art diffusion models}, - year = {2022}, - publisher = {GitHub}, - journal = {GitHub repository}, - howpublished = {\url{https://github.com/huggingface/diffusers}} -} -``` diff --git a/diffusers/_typos.toml b/diffusers/_typos.toml deleted file mode 100644 index 551099f981e7885fbda9ed28e297bace0e13407b..0000000000000000000000000000000000000000 --- a/diffusers/_typos.toml +++ /dev/null @@ -1,13 +0,0 @@ -# Files for typos -# Instruction: https://github.com/marketplace/actions/typos-action#getting-started - -[default.extend-identifiers] - -[default.extend-words] -NIN="NIN" # NIN is used in scripts/convert_ncsnpp_original_checkpoint_to_diffusers.py -nd="np" # nd may be np (numpy) -parms="parms" # parms is used in scripts/convert_original_stable_diffusion_to_diffusers.py - - -[files] -extend-exclude = ["_typos.toml"] diff --git a/diffusers/benchmarks/base_classes.py b/diffusers/benchmarks/base_classes.py deleted file mode 100644 index 45bf65c93c93a3a77d5b2e4b321b19a4cf9cd63f..0000000000000000000000000000000000000000 --- a/diffusers/benchmarks/base_classes.py +++ /dev/null @@ -1,346 +0,0 @@ -import os -import sys - -import torch - -from diffusers import ( - AutoPipelineForImage2Image, - AutoPipelineForInpainting, - AutoPipelineForText2Image, - ControlNetModel, - LCMScheduler, - StableDiffusionAdapterPipeline, - StableDiffusionControlNetPipeline, - StableDiffusionXLAdapterPipeline, - StableDiffusionXLControlNetPipeline, - T2IAdapter, - WuerstchenCombinedPipeline, -) -from diffusers.utils import load_image - - -sys.path.append(".") - -from utils import ( # noqa: E402 - BASE_PATH, - PROMPT, - BenchmarkInfo, - benchmark_fn, - bytes_to_giga_bytes, - flush, - generate_csv_dict, - write_to_csv, -) - - -RESOLUTION_MAPPING = { - "Lykon/DreamShaper": (512, 512), - "lllyasviel/sd-controlnet-canny": (512, 512), - "diffusers/controlnet-canny-sdxl-1.0": (1024, 1024), - "TencentARC/t2iadapter_canny_sd14v1": (512, 512), - "TencentARC/t2i-adapter-canny-sdxl-1.0": (1024, 1024), - "stabilityai/stable-diffusion-2-1": (768, 768), - "stabilityai/stable-diffusion-xl-base-1.0": (1024, 1024), - "stabilityai/stable-diffusion-xl-refiner-1.0": (1024, 1024), - "stabilityai/sdxl-turbo": (512, 512), -} - - -class BaseBenchmak: - pipeline_class = None - - def __init__(self, args): - super().__init__() - - def run_inference(self, args): - raise NotImplementedError - - def benchmark(self, args): - raise NotImplementedError - - def get_result_filepath(self, args): - pipeline_class_name = str(self.pipe.__class__.__name__) - name = ( - args.ckpt.replace("/", "_") - + "_" - + pipeline_class_name - + f"-bs@{args.batch_size}-steps@{args.num_inference_steps}-mco@{args.model_cpu_offload}-compile@{args.run_compile}.csv" - ) - filepath = os.path.join(BASE_PATH, name) - return filepath - - -class TextToImageBenchmark(BaseBenchmak): - pipeline_class = AutoPipelineForText2Image - - def __init__(self, args): - pipe = self.pipeline_class.from_pretrained(args.ckpt, torch_dtype=torch.float16) - pipe = pipe.to("cuda") - - if args.run_compile: - if not isinstance(pipe, WuerstchenCombinedPipeline): - pipe.unet.to(memory_format=torch.channels_last) - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - - if hasattr(pipe, "movq") and getattr(pipe, "movq", None) is not None: - pipe.movq.to(memory_format=torch.channels_last) - pipe.movq = torch.compile(pipe.movq, mode="reduce-overhead", fullgraph=True) - else: - print("Run torch compile") - pipe.decoder = torch.compile(pipe.decoder, mode="reduce-overhead", fullgraph=True) - pipe.vqgan = torch.compile(pipe.vqgan, mode="reduce-overhead", fullgraph=True) - - pipe.set_progress_bar_config(disable=True) - self.pipe = pipe - - def run_inference(self, pipe, args): - _ = pipe( - prompt=PROMPT, - num_inference_steps=args.num_inference_steps, - num_images_per_prompt=args.batch_size, - ) - - def benchmark(self, args): - flush() - - print(f"[INFO] {self.pipe.__class__.__name__}: Running benchmark with: {vars(args)}\n") - - time = benchmark_fn(self.run_inference, self.pipe, args) # in seconds. - memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) # in GBs. - benchmark_info = BenchmarkInfo(time=time, memory=memory) - - pipeline_class_name = str(self.pipe.__class__.__name__) - flush() - csv_dict = generate_csv_dict( - pipeline_cls=pipeline_class_name, ckpt=args.ckpt, args=args, benchmark_info=benchmark_info - ) - filepath = self.get_result_filepath(args) - write_to_csv(filepath, csv_dict) - print(f"Logs written to: {filepath}") - flush() - - -class TurboTextToImageBenchmark(TextToImageBenchmark): - def __init__(self, args): - super().__init__(args) - - def run_inference(self, pipe, args): - _ = pipe( - prompt=PROMPT, - num_inference_steps=args.num_inference_steps, - num_images_per_prompt=args.batch_size, - guidance_scale=0.0, - ) - - -class LCMLoRATextToImageBenchmark(TextToImageBenchmark): - lora_id = "latent-consistency/lcm-lora-sdxl" - - def __init__(self, args): - super().__init__(args) - self.pipe.load_lora_weights(self.lora_id) - self.pipe.fuse_lora() - self.pipe.unload_lora_weights() - self.pipe.scheduler = LCMScheduler.from_config(self.pipe.scheduler.config) - - def get_result_filepath(self, args): - pipeline_class_name = str(self.pipe.__class__.__name__) - name = ( - self.lora_id.replace("/", "_") - + "_" - + pipeline_class_name - + f"-bs@{args.batch_size}-steps@{args.num_inference_steps}-mco@{args.model_cpu_offload}-compile@{args.run_compile}.csv" - ) - filepath = os.path.join(BASE_PATH, name) - return filepath - - def run_inference(self, pipe, args): - _ = pipe( - prompt=PROMPT, - num_inference_steps=args.num_inference_steps, - num_images_per_prompt=args.batch_size, - guidance_scale=1.0, - ) - - def benchmark(self, args): - flush() - - print(f"[INFO] {self.pipe.__class__.__name__}: Running benchmark with: {vars(args)}\n") - - time = benchmark_fn(self.run_inference, self.pipe, args) # in seconds. - memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) # in GBs. - benchmark_info = BenchmarkInfo(time=time, memory=memory) - - pipeline_class_name = str(self.pipe.__class__.__name__) - flush() - csv_dict = generate_csv_dict( - pipeline_cls=pipeline_class_name, ckpt=self.lora_id, args=args, benchmark_info=benchmark_info - ) - filepath = self.get_result_filepath(args) - write_to_csv(filepath, csv_dict) - print(f"Logs written to: {filepath}") - flush() - - -class ImageToImageBenchmark(TextToImageBenchmark): - pipeline_class = AutoPipelineForImage2Image - url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/benchmarking/1665_Girl_with_a_Pearl_Earring.jpg" - image = load_image(url).convert("RGB") - - def __init__(self, args): - super().__init__(args) - self.image = self.image.resize(RESOLUTION_MAPPING[args.ckpt]) - - def run_inference(self, pipe, args): - _ = pipe( - prompt=PROMPT, - image=self.image, - num_inference_steps=args.num_inference_steps, - num_images_per_prompt=args.batch_size, - ) - - -class TurboImageToImageBenchmark(ImageToImageBenchmark): - def __init__(self, args): - super().__init__(args) - - def run_inference(self, pipe, args): - _ = pipe( - prompt=PROMPT, - image=self.image, - num_inference_steps=args.num_inference_steps, - num_images_per_prompt=args.batch_size, - guidance_scale=0.0, - strength=0.5, - ) - - -class InpaintingBenchmark(ImageToImageBenchmark): - pipeline_class = AutoPipelineForInpainting - mask_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/benchmarking/overture-creations-5sI6fQgYIuo_mask.png" - mask = load_image(mask_url).convert("RGB") - - def __init__(self, args): - super().__init__(args) - self.image = self.image.resize(RESOLUTION_MAPPING[args.ckpt]) - self.mask = self.mask.resize(RESOLUTION_MAPPING[args.ckpt]) - - def run_inference(self, pipe, args): - _ = pipe( - prompt=PROMPT, - image=self.image, - mask_image=self.mask, - num_inference_steps=args.num_inference_steps, - num_images_per_prompt=args.batch_size, - ) - - -class IPAdapterTextToImageBenchmark(TextToImageBenchmark): - url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/load_neg_embed.png" - image = load_image(url) - - def __init__(self, args): - pipe = self.pipeline_class.from_pretrained(args.ckpt, torch_dtype=torch.float16).to("cuda") - pipe.load_ip_adapter( - args.ip_adapter_id[0], - subfolder="models" if "sdxl" not in args.ip_adapter_id[1] else "sdxl_models", - weight_name=args.ip_adapter_id[1], - ) - - if args.run_compile: - pipe.unet.to(memory_format=torch.channels_last) - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - - pipe.set_progress_bar_config(disable=True) - self.pipe = pipe - - def run_inference(self, pipe, args): - _ = pipe( - prompt=PROMPT, - ip_adapter_image=self.image, - num_inference_steps=args.num_inference_steps, - num_images_per_prompt=args.batch_size, - ) - - -class ControlNetBenchmark(TextToImageBenchmark): - pipeline_class = StableDiffusionControlNetPipeline - aux_network_class = ControlNetModel - root_ckpt = "Lykon/DreamShaper" - - url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/benchmarking/canny_image_condition.png" - image = load_image(url).convert("RGB") - - def __init__(self, args): - aux_network = self.aux_network_class.from_pretrained(args.ckpt, torch_dtype=torch.float16) - pipe = self.pipeline_class.from_pretrained(self.root_ckpt, controlnet=aux_network, torch_dtype=torch.float16) - pipe = pipe.to("cuda") - - pipe.set_progress_bar_config(disable=True) - self.pipe = pipe - - if args.run_compile: - pipe.unet.to(memory_format=torch.channels_last) - pipe.controlnet.to(memory_format=torch.channels_last) - - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - pipe.controlnet = torch.compile(pipe.controlnet, mode="reduce-overhead", fullgraph=True) - - self.image = self.image.resize(RESOLUTION_MAPPING[args.ckpt]) - - def run_inference(self, pipe, args): - _ = pipe( - prompt=PROMPT, - image=self.image, - num_inference_steps=args.num_inference_steps, - num_images_per_prompt=args.batch_size, - ) - - -class ControlNetSDXLBenchmark(ControlNetBenchmark): - pipeline_class = StableDiffusionXLControlNetPipeline - root_ckpt = "stabilityai/stable-diffusion-xl-base-1.0" - - def __init__(self, args): - super().__init__(args) - - -class T2IAdapterBenchmark(ControlNetBenchmark): - pipeline_class = StableDiffusionAdapterPipeline - aux_network_class = T2IAdapter - root_ckpt = "Lykon/DreamShaper" - - url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/benchmarking/canny_for_adapter.png" - image = load_image(url).convert("L") - - def __init__(self, args): - aux_network = self.aux_network_class.from_pretrained(args.ckpt, torch_dtype=torch.float16) - pipe = self.pipeline_class.from_pretrained(self.root_ckpt, adapter=aux_network, torch_dtype=torch.float16) - pipe = pipe.to("cuda") - - pipe.set_progress_bar_config(disable=True) - self.pipe = pipe - - if args.run_compile: - pipe.unet.to(memory_format=torch.channels_last) - pipe.adapter.to(memory_format=torch.channels_last) - - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - pipe.adapter = torch.compile(pipe.adapter, mode="reduce-overhead", fullgraph=True) - - self.image = self.image.resize(RESOLUTION_MAPPING[args.ckpt]) - - -class T2IAdapterSDXLBenchmark(T2IAdapterBenchmark): - pipeline_class = StableDiffusionXLAdapterPipeline - root_ckpt = "stabilityai/stable-diffusion-xl-base-1.0" - - url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/benchmarking/canny_for_adapter_sdxl.png" - image = load_image(url) - - def __init__(self, args): - super().__init__(args) diff --git a/diffusers/benchmarks/benchmark_controlnet.py b/diffusers/benchmarks/benchmark_controlnet.py deleted file mode 100644 index 9217004461dc9352b1b9e6cda698dd866177eb67..0000000000000000000000000000000000000000 --- a/diffusers/benchmarks/benchmark_controlnet.py +++ /dev/null @@ -1,26 +0,0 @@ -import argparse -import sys - - -sys.path.append(".") -from base_classes import ControlNetBenchmark, ControlNetSDXLBenchmark # noqa: E402 - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument( - "--ckpt", - type=str, - default="lllyasviel/sd-controlnet-canny", - choices=["lllyasviel/sd-controlnet-canny", "diffusers/controlnet-canny-sdxl-1.0"], - ) - parser.add_argument("--batch_size", type=int, default=1) - parser.add_argument("--num_inference_steps", type=int, default=50) - parser.add_argument("--model_cpu_offload", action="store_true") - parser.add_argument("--run_compile", action="store_true") - args = parser.parse_args() - - benchmark_pipe = ( - ControlNetBenchmark(args) if args.ckpt == "lllyasviel/sd-controlnet-canny" else ControlNetSDXLBenchmark(args) - ) - benchmark_pipe.benchmark(args) diff --git a/diffusers/benchmarks/benchmark_ip_adapters.py b/diffusers/benchmarks/benchmark_ip_adapters.py deleted file mode 100644 index 9a31a21fc60ddd2d041d7896ee4144d1766fcc01..0000000000000000000000000000000000000000 --- a/diffusers/benchmarks/benchmark_ip_adapters.py +++ /dev/null @@ -1,33 +0,0 @@ -import argparse -import sys - - -sys.path.append(".") -from base_classes import IPAdapterTextToImageBenchmark # noqa: E402 - - -IP_ADAPTER_CKPTS = { - # because original SD v1.5 has been taken down. - "Lykon/DreamShaper": ("h94/IP-Adapter", "ip-adapter_sd15.bin"), - "stabilityai/stable-diffusion-xl-base-1.0": ("h94/IP-Adapter", "ip-adapter_sdxl.bin"), -} - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument( - "--ckpt", - type=str, - default="rstabilityai/stable-diffusion-xl-base-1.0", - choices=list(IP_ADAPTER_CKPTS.keys()), - ) - parser.add_argument("--batch_size", type=int, default=1) - parser.add_argument("--num_inference_steps", type=int, default=50) - parser.add_argument("--model_cpu_offload", action="store_true") - parser.add_argument("--run_compile", action="store_true") - args = parser.parse_args() - - args.ip_adapter_id = IP_ADAPTER_CKPTS[args.ckpt] - benchmark_pipe = IPAdapterTextToImageBenchmark(args) - args.ckpt = f"{args.ckpt} (IP-Adapter)" - benchmark_pipe.benchmark(args) diff --git a/diffusers/benchmarks/benchmark_sd_img.py b/diffusers/benchmarks/benchmark_sd_img.py deleted file mode 100644 index 772befe8795fb933fdb7831565eb28848e642af2..0000000000000000000000000000000000000000 --- a/diffusers/benchmarks/benchmark_sd_img.py +++ /dev/null @@ -1,29 +0,0 @@ -import argparse -import sys - - -sys.path.append(".") -from base_classes import ImageToImageBenchmark, TurboImageToImageBenchmark # noqa: E402 - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument( - "--ckpt", - type=str, - default="Lykon/DreamShaper", - choices=[ - "Lykon/DreamShaper", - "stabilityai/stable-diffusion-2-1", - "stabilityai/stable-diffusion-xl-refiner-1.0", - "stabilityai/sdxl-turbo", - ], - ) - parser.add_argument("--batch_size", type=int, default=1) - parser.add_argument("--num_inference_steps", type=int, default=50) - parser.add_argument("--model_cpu_offload", action="store_true") - parser.add_argument("--run_compile", action="store_true") - args = parser.parse_args() - - benchmark_pipe = ImageToImageBenchmark(args) if "turbo" not in args.ckpt else TurboImageToImageBenchmark(args) - benchmark_pipe.benchmark(args) diff --git a/diffusers/benchmarks/benchmark_sd_inpainting.py b/diffusers/benchmarks/benchmark_sd_inpainting.py deleted file mode 100644 index 143adcb0d87c7a9cf9c8ae0d19581fd48f9ac4c8..0000000000000000000000000000000000000000 --- a/diffusers/benchmarks/benchmark_sd_inpainting.py +++ /dev/null @@ -1,28 +0,0 @@ -import argparse -import sys - - -sys.path.append(".") -from base_classes import InpaintingBenchmark # noqa: E402 - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument( - "--ckpt", - type=str, - default="Lykon/DreamShaper", - choices=[ - "Lykon/DreamShaper", - "stabilityai/stable-diffusion-2-1", - "stabilityai/stable-diffusion-xl-base-1.0", - ], - ) - parser.add_argument("--batch_size", type=int, default=1) - parser.add_argument("--num_inference_steps", type=int, default=50) - parser.add_argument("--model_cpu_offload", action="store_true") - parser.add_argument("--run_compile", action="store_true") - args = parser.parse_args() - - benchmark_pipe = InpaintingBenchmark(args) - benchmark_pipe.benchmark(args) diff --git a/diffusers/benchmarks/benchmark_t2i_adapter.py b/diffusers/benchmarks/benchmark_t2i_adapter.py deleted file mode 100644 index 44b04b470ea65d5f3318bee21bb107c7b4b2b2f9..0000000000000000000000000000000000000000 --- a/diffusers/benchmarks/benchmark_t2i_adapter.py +++ /dev/null @@ -1,28 +0,0 @@ -import argparse -import sys - - -sys.path.append(".") -from base_classes import T2IAdapterBenchmark, T2IAdapterSDXLBenchmark # noqa: E402 - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument( - "--ckpt", - type=str, - default="TencentARC/t2iadapter_canny_sd14v1", - choices=["TencentARC/t2iadapter_canny_sd14v1", "TencentARC/t2i-adapter-canny-sdxl-1.0"], - ) - parser.add_argument("--batch_size", type=int, default=1) - parser.add_argument("--num_inference_steps", type=int, default=50) - parser.add_argument("--model_cpu_offload", action="store_true") - parser.add_argument("--run_compile", action="store_true") - args = parser.parse_args() - - benchmark_pipe = ( - T2IAdapterBenchmark(args) - if args.ckpt == "TencentARC/t2iadapter_canny_sd14v1" - else T2IAdapterSDXLBenchmark(args) - ) - benchmark_pipe.benchmark(args) diff --git a/diffusers/benchmarks/benchmark_t2i_lcm_lora.py b/diffusers/benchmarks/benchmark_t2i_lcm_lora.py deleted file mode 100644 index 957e0a463e28fccc51fe32cd975f3d5234cfd1f2..0000000000000000000000000000000000000000 --- a/diffusers/benchmarks/benchmark_t2i_lcm_lora.py +++ /dev/null @@ -1,23 +0,0 @@ -import argparse -import sys - - -sys.path.append(".") -from base_classes import LCMLoRATextToImageBenchmark # noqa: E402 - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument( - "--ckpt", - type=str, - default="stabilityai/stable-diffusion-xl-base-1.0", - ) - parser.add_argument("--batch_size", type=int, default=1) - parser.add_argument("--num_inference_steps", type=int, default=4) - parser.add_argument("--model_cpu_offload", action="store_true") - parser.add_argument("--run_compile", action="store_true") - args = parser.parse_args() - - benchmark_pipe = LCMLoRATextToImageBenchmark(args) - benchmark_pipe.benchmark(args) diff --git a/diffusers/benchmarks/benchmark_text_to_image.py b/diffusers/benchmarks/benchmark_text_to_image.py deleted file mode 100644 index ddc7fb2676a5c6090ce08e4488f5ca697c4329aa..0000000000000000000000000000000000000000 --- a/diffusers/benchmarks/benchmark_text_to_image.py +++ /dev/null @@ -1,40 +0,0 @@ -import argparse -import sys - - -sys.path.append(".") -from base_classes import TextToImageBenchmark, TurboTextToImageBenchmark # noqa: E402 - - -ALL_T2I_CKPTS = [ - "Lykon/DreamShaper", - "segmind/SSD-1B", - "stabilityai/stable-diffusion-xl-base-1.0", - "kandinsky-community/kandinsky-2-2-decoder", - "warp-ai/wuerstchen", - "stabilityai/sdxl-turbo", -] - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument( - "--ckpt", - type=str, - default="Lykon/DreamShaper", - choices=ALL_T2I_CKPTS, - ) - parser.add_argument("--batch_size", type=int, default=1) - parser.add_argument("--num_inference_steps", type=int, default=50) - parser.add_argument("--model_cpu_offload", action="store_true") - parser.add_argument("--run_compile", action="store_true") - args = parser.parse_args() - - benchmark_cls = None - if "turbo" in args.ckpt: - benchmark_cls = TurboTextToImageBenchmark - else: - benchmark_cls = TextToImageBenchmark - - benchmark_pipe = benchmark_cls(args) - benchmark_pipe.benchmark(args) diff --git a/diffusers/benchmarks/push_results.py b/diffusers/benchmarks/push_results.py deleted file mode 100644 index 71cd60f32c0fa9ddbf10f73dc3c9ee3e1531e07c..0000000000000000000000000000000000000000 --- a/diffusers/benchmarks/push_results.py +++ /dev/null @@ -1,72 +0,0 @@ -import glob -import sys - -import pandas as pd -from huggingface_hub import hf_hub_download, upload_file -from huggingface_hub.utils import EntryNotFoundError - - -sys.path.append(".") -from utils import BASE_PATH, FINAL_CSV_FILE, GITHUB_SHA, REPO_ID, collate_csv # noqa: E402 - - -def has_previous_benchmark() -> str: - csv_path = None - try: - csv_path = hf_hub_download(repo_id=REPO_ID, repo_type="dataset", filename=FINAL_CSV_FILE) - except EntryNotFoundError: - csv_path = None - return csv_path - - -def filter_float(value): - if isinstance(value, str): - return float(value.split()[0]) - return value - - -def push_to_hf_dataset(): - all_csvs = sorted(glob.glob(f"{BASE_PATH}/*.csv")) - collate_csv(all_csvs, FINAL_CSV_FILE) - - # If there's an existing benchmark file, we should report the changes. - csv_path = has_previous_benchmark() - if csv_path is not None: - current_results = pd.read_csv(FINAL_CSV_FILE) - previous_results = pd.read_csv(csv_path) - - numeric_columns = current_results.select_dtypes(include=["float64", "int64"]).columns - numeric_columns = [ - c for c in numeric_columns if c not in ["batch_size", "num_inference_steps", "actual_gpu_memory (gbs)"] - ] - - for column in numeric_columns: - previous_results[column] = previous_results[column].map(lambda x: filter_float(x)) - - # Calculate the percentage change - current_results[column] = current_results[column].astype(float) - previous_results[column] = previous_results[column].astype(float) - percent_change = ((current_results[column] - previous_results[column]) / previous_results[column]) * 100 - - # Format the values with '+' or '-' sign and append to original values - current_results[column] = current_results[column].map(str) + percent_change.map( - lambda x: f" ({'+' if x > 0 else ''}{x:.2f}%)" - ) - # There might be newly added rows. So, filter out the NaNs. - current_results[column] = current_results[column].map(lambda x: x.replace(" (nan%)", "")) - - # Overwrite the current result file. - current_results.to_csv(FINAL_CSV_FILE, index=False) - - commit_message = f"upload from sha: {GITHUB_SHA}" if GITHUB_SHA is not None else "upload benchmark results" - upload_file( - repo_id=REPO_ID, - path_in_repo=FINAL_CSV_FILE, - path_or_fileobj=FINAL_CSV_FILE, - repo_type="dataset", - commit_message=commit_message, - ) - - -if __name__ == "__main__": - push_to_hf_dataset() diff --git a/diffusers/benchmarks/run_all.py b/diffusers/benchmarks/run_all.py deleted file mode 100644 index c9932cc71c38513a301854a5ad227a2e3ec28d30..0000000000000000000000000000000000000000 --- a/diffusers/benchmarks/run_all.py +++ /dev/null @@ -1,101 +0,0 @@ -import glob -import subprocess -import sys -from typing import List - - -sys.path.append(".") -from benchmark_text_to_image import ALL_T2I_CKPTS # noqa: E402 - - -PATTERN = "benchmark_*.py" - - -class SubprocessCallException(Exception): - pass - - -# Taken from `test_examples_utils.py` -def run_command(command: List[str], return_stdout=False): - """ - Runs `command` with `subprocess.check_output` and will potentially return the `stdout`. Will also properly capture - if an error occurred while running `command` - """ - try: - output = subprocess.check_output(command, stderr=subprocess.STDOUT) - if return_stdout: - if hasattr(output, "decode"): - output = output.decode("utf-8") - return output - except subprocess.CalledProcessError as e: - raise SubprocessCallException( - f"Command `{' '.join(command)}` failed with the following error:\n\n{e.output.decode()}" - ) from e - - -def main(): - python_files = glob.glob(PATTERN) - - for file in python_files: - print(f"****** Running file: {file} ******") - - # Run with canonical settings. - if file != "benchmark_text_to_image.py" and file != "benchmark_ip_adapters.py": - command = f"python {file}" - run_command(command.split()) - - command += " --run_compile" - run_command(command.split()) - - # Run variants. - for file in python_files: - # See: https://github.com/pytorch/pytorch/issues/129637 - if file == "benchmark_ip_adapters.py": - continue - - if file == "benchmark_text_to_image.py": - for ckpt in ALL_T2I_CKPTS: - command = f"python {file} --ckpt {ckpt}" - - if "turbo" in ckpt: - command += " --num_inference_steps 1" - - run_command(command.split()) - - command += " --run_compile" - run_command(command.split()) - - elif file == "benchmark_sd_img.py": - for ckpt in ["stabilityai/stable-diffusion-xl-refiner-1.0", "stabilityai/sdxl-turbo"]: - command = f"python {file} --ckpt {ckpt}" - - if ckpt == "stabilityai/sdxl-turbo": - command += " --num_inference_steps 2" - - run_command(command.split()) - command += " --run_compile" - run_command(command.split()) - - elif file in ["benchmark_sd_inpainting.py", "benchmark_ip_adapters.py"]: - sdxl_ckpt = "stabilityai/stable-diffusion-xl-base-1.0" - command = f"python {file} --ckpt {sdxl_ckpt}" - run_command(command.split()) - - command += " --run_compile" - run_command(command.split()) - - elif file in ["benchmark_controlnet.py", "benchmark_t2i_adapter.py"]: - sdxl_ckpt = ( - "diffusers/controlnet-canny-sdxl-1.0" - if "controlnet" in file - else "TencentARC/t2i-adapter-canny-sdxl-1.0" - ) - command = f"python {file} --ckpt {sdxl_ckpt}" - run_command(command.split()) - - command += " --run_compile" - run_command(command.split()) - - -if __name__ == "__main__": - main() diff --git a/diffusers/benchmarks/utils.py b/diffusers/benchmarks/utils.py deleted file mode 100644 index 5fce920ac6c3549e3654b1cfb2f0e79096aa019d..0000000000000000000000000000000000000000 --- a/diffusers/benchmarks/utils.py +++ /dev/null @@ -1,98 +0,0 @@ -import argparse -import csv -import gc -import os -from dataclasses import dataclass -from typing import Dict, List, Union - -import torch -import torch.utils.benchmark as benchmark - - -GITHUB_SHA = os.getenv("GITHUB_SHA", None) -BENCHMARK_FIELDS = [ - "pipeline_cls", - "ckpt_id", - "batch_size", - "num_inference_steps", - "model_cpu_offload", - "run_compile", - "time (secs)", - "memory (gbs)", - "actual_gpu_memory (gbs)", - "github_sha", -] - -PROMPT = "ghibli style, a fantasy landscape with castles" -BASE_PATH = os.getenv("BASE_PATH", ".") -TOTAL_GPU_MEMORY = float(os.getenv("TOTAL_GPU_MEMORY", torch.cuda.get_device_properties(0).total_memory / (1024**3))) - -REPO_ID = "diffusers/benchmarks" -FINAL_CSV_FILE = "collated_results.csv" - - -@dataclass -class BenchmarkInfo: - time: float - memory: float - - -def flush(): - """Wipes off memory.""" - gc.collect() - torch.cuda.empty_cache() - torch.cuda.reset_max_memory_allocated() - torch.cuda.reset_peak_memory_stats() - - -def bytes_to_giga_bytes(bytes): - return f"{(bytes / 1024 / 1024 / 1024):.3f}" - - -def benchmark_fn(f, *args, **kwargs): - t0 = benchmark.Timer( - stmt="f(*args, **kwargs)", - globals={"args": args, "kwargs": kwargs, "f": f}, - num_threads=torch.get_num_threads(), - ) - return f"{(t0.blocked_autorange().mean):.3f}" - - -def generate_csv_dict( - pipeline_cls: str, ckpt: str, args: argparse.Namespace, benchmark_info: BenchmarkInfo -) -> Dict[str, Union[str, bool, float]]: - """Packs benchmarking data into a dictionary for latter serialization.""" - data_dict = { - "pipeline_cls": pipeline_cls, - "ckpt_id": ckpt, - "batch_size": args.batch_size, - "num_inference_steps": args.num_inference_steps, - "model_cpu_offload": args.model_cpu_offload, - "run_compile": args.run_compile, - "time (secs)": benchmark_info.time, - "memory (gbs)": benchmark_info.memory, - "actual_gpu_memory (gbs)": f"{(TOTAL_GPU_MEMORY):.3f}", - "github_sha": GITHUB_SHA, - } - return data_dict - - -def write_to_csv(file_name: str, data_dict: Dict[str, Union[str, bool, float]]): - """Serializes a dictionary into a CSV file.""" - with open(file_name, mode="w", newline="") as csvfile: - writer = csv.DictWriter(csvfile, fieldnames=BENCHMARK_FIELDS) - writer.writeheader() - writer.writerow(data_dict) - - -def collate_csv(input_files: List[str], output_file: str): - """Collates multiple identically structured CSVs into a single CSV file.""" - with open(output_file, mode="w", newline="") as outfile: - writer = csv.DictWriter(outfile, fieldnames=BENCHMARK_FIELDS) - writer.writeheader() - - for file in input_files: - with open(file, mode="r") as infile: - reader = csv.DictReader(infile) - for row in reader: - writer.writerow(row) diff --git a/diffusers/docker/diffusers-doc-builder/Dockerfile b/diffusers/docker/diffusers-doc-builder/Dockerfile deleted file mode 100644 index c9fc62707cb0dac3126c06ebc22420c1353146ff..0000000000000000000000000000000000000000 --- a/diffusers/docker/diffusers-doc-builder/Dockerfile +++ /dev/null @@ -1,52 +0,0 @@ -FROM ubuntu:20.04 -LABEL maintainer="Hugging Face" -LABEL repository="diffusers" - -ENV DEBIAN_FRONTEND=noninteractive - -RUN apt-get -y update \ - && apt-get install -y software-properties-common \ - && add-apt-repository ppa:deadsnakes/ppa - -RUN apt install -y bash \ - build-essential \ - git \ - git-lfs \ - curl \ - ca-certificates \ - libsndfile1-dev \ - python3.10 \ - python3-pip \ - libgl1 \ - zip \ - wget \ - python3.10-venv && \ - rm -rf /var/lib/apt/lists - -# make sure to use venv -RUN python3.10 -m venv /opt/venv -ENV PATH="/opt/venv/bin:$PATH" - -# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) -RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \ - python3.10 -m uv pip install --no-cache-dir \ - torch \ - torchvision \ - torchaudio \ - invisible_watermark \ - --extra-index-url https://download.pytorch.org/whl/cpu && \ - python3.10 -m uv pip install --no-cache-dir \ - accelerate \ - datasets \ - hf-doc-builder \ - huggingface-hub \ - Jinja2 \ - librosa \ - numpy==1.26.4 \ - scipy \ - tensorboard \ - transformers \ - matplotlib \ - setuptools==69.5.1 - -CMD ["/bin/bash"] diff --git a/diffusers/docker/diffusers-flax-cpu/Dockerfile b/diffusers/docker/diffusers-flax-cpu/Dockerfile deleted file mode 100644 index 051008aa9a2ee057a56bfe55b7af34595c26c413..0000000000000000000000000000000000000000 --- a/diffusers/docker/diffusers-flax-cpu/Dockerfile +++ /dev/null @@ -1,49 +0,0 @@ -FROM ubuntu:20.04 -LABEL maintainer="Hugging Face" -LABEL repository="diffusers" - -ENV DEBIAN_FRONTEND=noninteractive - -RUN apt-get -y update \ - && apt-get install -y software-properties-common \ - && add-apt-repository ppa:deadsnakes/ppa - -RUN apt install -y bash \ - build-essential \ - git \ - git-lfs \ - curl \ - ca-certificates \ - libsndfile1-dev \ - libgl1 \ - python3.10 \ - python3-pip \ - python3.10-venv && \ - rm -rf /var/lib/apt/lists - -# make sure to use venv -RUN python3.10 -m venv /opt/venv -ENV PATH="/opt/venv/bin:$PATH" - -# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) -# follow the instructions here: https://cloud.google.com/tpu/docs/run-in-container#train_a_jax_model_in_a_docker_container -RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \ - python3 -m uv pip install --upgrade --no-cache-dir \ - clu \ - "jax[cpu]>=0.2.16,!=0.3.2" \ - "flax>=0.4.1" \ - "jaxlib>=0.1.65" && \ - python3 -m uv pip install --no-cache-dir \ - accelerate \ - datasets \ - hf-doc-builder \ - huggingface-hub \ - Jinja2 \ - librosa \ - numpy==1.26.4 \ - scipy \ - tensorboard \ - transformers \ - hf_transfer - -CMD ["/bin/bash"] \ No newline at end of file diff --git a/diffusers/docker/diffusers-flax-tpu/Dockerfile b/diffusers/docker/diffusers-flax-tpu/Dockerfile deleted file mode 100644 index 405f068923b79136197daa8089340734d1435c60..0000000000000000000000000000000000000000 --- a/diffusers/docker/diffusers-flax-tpu/Dockerfile +++ /dev/null @@ -1,51 +0,0 @@ -FROM ubuntu:20.04 -LABEL maintainer="Hugging Face" -LABEL repository="diffusers" - -ENV DEBIAN_FRONTEND=noninteractive - -RUN apt-get -y update \ - && apt-get install -y software-properties-common \ - && add-apt-repository ppa:deadsnakes/ppa - -RUN apt install -y bash \ - build-essential \ - git \ - git-lfs \ - curl \ - ca-certificates \ - libsndfile1-dev \ - libgl1 \ - python3.10 \ - python3-pip \ - python3.10-venv && \ - rm -rf /var/lib/apt/lists - -# make sure to use venv -RUN python3.10 -m venv /opt/venv -ENV PATH="/opt/venv/bin:$PATH" - -# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) -# follow the instructions here: https://cloud.google.com/tpu/docs/run-in-container#train_a_jax_model_in_a_docker_container -RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \ - python3 -m pip install --no-cache-dir \ - "jax[tpu]>=0.2.16,!=0.3.2" \ - -f https://storage.googleapis.com/jax-releases/libtpu_releases.html && \ - python3 -m uv pip install --upgrade --no-cache-dir \ - clu \ - "flax>=0.4.1" \ - "jaxlib>=0.1.65" && \ - python3 -m uv pip install --no-cache-dir \ - accelerate \ - datasets \ - hf-doc-builder \ - huggingface-hub \ - Jinja2 \ - librosa \ - numpy==1.26.4 \ - scipy \ - tensorboard \ - transformers \ - hf_transfer - -CMD ["/bin/bash"] \ No newline at end of file diff --git a/diffusers/docker/diffusers-onnxruntime-cpu/Dockerfile b/diffusers/docker/diffusers-onnxruntime-cpu/Dockerfile deleted file mode 100644 index 6f4b13e8a9ba0368bcc55336e7574f326b9cef80..0000000000000000000000000000000000000000 --- a/diffusers/docker/diffusers-onnxruntime-cpu/Dockerfile +++ /dev/null @@ -1,49 +0,0 @@ -FROM ubuntu:20.04 -LABEL maintainer="Hugging Face" -LABEL repository="diffusers" - -ENV DEBIAN_FRONTEND=noninteractive - -RUN apt-get -y update \ - && apt-get install -y software-properties-common \ - && add-apt-repository ppa:deadsnakes/ppa - -RUN apt install -y bash \ - build-essential \ - git \ - git-lfs \ - curl \ - ca-certificates \ - libsndfile1-dev \ - libgl1 \ - python3.10 \ - python3-pip \ - python3.10-venv && \ - rm -rf /var/lib/apt/lists - -# make sure to use venv -RUN python3.10 -m venv /opt/venv -ENV PATH="/opt/venv/bin:$PATH" - -# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) -RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \ - python3 -m uv pip install --no-cache-dir \ - torch==2.1.2 \ - torchvision==0.16.2 \ - torchaudio==2.1.2 \ - onnxruntime \ - --extra-index-url https://download.pytorch.org/whl/cpu && \ - python3 -m uv pip install --no-cache-dir \ - accelerate \ - datasets \ - hf-doc-builder \ - huggingface-hub \ - Jinja2 \ - librosa \ - numpy==1.26.4 \ - scipy \ - tensorboard \ - transformers \ - hf_transfer - -CMD ["/bin/bash"] \ No newline at end of file diff --git a/diffusers/docker/diffusers-onnxruntime-cuda/Dockerfile b/diffusers/docker/diffusers-onnxruntime-cuda/Dockerfile deleted file mode 100644 index 6124172e109eda37b627bf8bdc7759f4e4ca563c..0000000000000000000000000000000000000000 --- a/diffusers/docker/diffusers-onnxruntime-cuda/Dockerfile +++ /dev/null @@ -1,50 +0,0 @@ -FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04 -LABEL maintainer="Hugging Face" -LABEL repository="diffusers" - -ENV DEBIAN_FRONTEND=noninteractive - -RUN apt-get -y update \ - && apt-get install -y software-properties-common \ - && add-apt-repository ppa:deadsnakes/ppa - -RUN apt install -y bash \ - build-essential \ - git \ - git-lfs \ - curl \ - ca-certificates \ - libsndfile1-dev \ - libgl1 \ - python3.10 \ - python3-pip \ - python3.10-venv && \ - rm -rf /var/lib/apt/lists - -# make sure to use venv -RUN python3.10 -m venv /opt/venv -ENV PATH="/opt/venv/bin:$PATH" - -# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) -RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \ - python3.10 -m uv pip install --no-cache-dir \ - "torch<2.5.0" \ - torchvision \ - torchaudio \ - "onnxruntime-gpu>=1.13.1" \ - --extra-index-url https://download.pytorch.org/whl/cu117 && \ - python3.10 -m uv pip install --no-cache-dir \ - accelerate \ - datasets \ - hf-doc-builder \ - huggingface-hub \ - hf_transfer \ - Jinja2 \ - librosa \ - numpy==1.26.4 \ - scipy \ - tensorboard \ - transformers \ - hf_transfer - -CMD ["/bin/bash"] \ No newline at end of file diff --git a/diffusers/docker/diffusers-pytorch-compile-cuda/Dockerfile b/diffusers/docker/diffusers-pytorch-compile-cuda/Dockerfile deleted file mode 100644 index 9d7578f5a4dc6b8a7f3ecfb02361e0ef188df5d6..0000000000000000000000000000000000000000 --- a/diffusers/docker/diffusers-pytorch-compile-cuda/Dockerfile +++ /dev/null @@ -1,50 +0,0 @@ -FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04 -LABEL maintainer="Hugging Face" -LABEL repository="diffusers" - -ENV DEBIAN_FRONTEND=noninteractive - -RUN apt-get -y update \ - && apt-get install -y software-properties-common \ - && add-apt-repository ppa:deadsnakes/ppa - -RUN apt install -y bash \ - build-essential \ - git \ - git-lfs \ - curl \ - ca-certificates \ - libsndfile1-dev \ - libgl1 \ - python3.10 \ - python3.10-dev \ - python3-pip \ - python3.10-venv && \ - rm -rf /var/lib/apt/lists - -# make sure to use venv -RUN python3.10 -m venv /opt/venv -ENV PATH="/opt/venv/bin:$PATH" - -# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) -RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \ - python3.10 -m uv pip install --no-cache-dir \ - "torch<2.5.0" \ - torchvision \ - torchaudio \ - invisible_watermark && \ - python3.10 -m pip install --no-cache-dir \ - accelerate \ - datasets \ - hf-doc-builder \ - huggingface-hub \ - hf_transfer \ - Jinja2 \ - librosa \ - numpy==1.26.4 \ - scipy \ - tensorboard \ - transformers \ - hf_transfer - -CMD ["/bin/bash"] diff --git a/diffusers/docker/diffusers-pytorch-cpu/Dockerfile b/diffusers/docker/diffusers-pytorch-cpu/Dockerfile deleted file mode 100644 index 1b39e58ca273b14cfd2c4ec6f3ce743f9b3a3854..0000000000000000000000000000000000000000 --- a/diffusers/docker/diffusers-pytorch-cpu/Dockerfile +++ /dev/null @@ -1,50 +0,0 @@ -FROM ubuntu:20.04 -LABEL maintainer="Hugging Face" -LABEL repository="diffusers" - -ENV DEBIAN_FRONTEND=noninteractive - -RUN apt-get -y update \ - && apt-get install -y software-properties-common \ - && add-apt-repository ppa:deadsnakes/ppa - -RUN apt install -y bash \ - build-essential \ - git \ - git-lfs \ - curl \ - ca-certificates \ - libsndfile1-dev \ - python3.10 \ - python3.10-dev \ - python3-pip \ - libgl1 \ - python3.10-venv && \ - rm -rf /var/lib/apt/lists - -# make sure to use venv -RUN python3.10 -m venv /opt/venv -ENV PATH="/opt/venv/bin:$PATH" - -# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) -RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \ - python3.10 -m uv pip install --no-cache-dir \ - "torch<2.5.0" \ - torchvision \ - torchaudio \ - invisible_watermark \ - --extra-index-url https://download.pytorch.org/whl/cpu && \ - python3.10 -m uv pip install --no-cache-dir \ - accelerate \ - datasets \ - hf-doc-builder \ - huggingface-hub \ - Jinja2 \ - librosa \ - numpy==1.26.4 \ - scipy \ - tensorboard \ - transformers matplotlib \ - hf_transfer - -CMD ["/bin/bash"] diff --git a/diffusers/docker/diffusers-pytorch-cuda/Dockerfile b/diffusers/docker/diffusers-pytorch-cuda/Dockerfile deleted file mode 100644 index 7317ef642aa5749f53c9dc3a8458b0e4664de601..0000000000000000000000000000000000000000 --- a/diffusers/docker/diffusers-pytorch-cuda/Dockerfile +++ /dev/null @@ -1,51 +0,0 @@ -FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04 -LABEL maintainer="Hugging Face" -LABEL repository="diffusers" - -ENV DEBIAN_FRONTEND=noninteractive - -RUN apt-get -y update \ - && apt-get install -y software-properties-common \ - && add-apt-repository ppa:deadsnakes/ppa - -RUN apt install -y bash \ - build-essential \ - git \ - git-lfs \ - curl \ - ca-certificates \ - libsndfile1-dev \ - libgl1 \ - python3.10 \ - python3.10-dev \ - python3-pip \ - python3.10-venv && \ - rm -rf /var/lib/apt/lists - -# make sure to use venv -RUN python3.10 -m venv /opt/venv -ENV PATH="/opt/venv/bin:$PATH" - -# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) -RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \ - python3.10 -m uv pip install --no-cache-dir \ - "torch<2.5.0" \ - torchvision \ - torchaudio \ - invisible_watermark && \ - python3.10 -m pip install --no-cache-dir \ - accelerate \ - datasets \ - hf-doc-builder \ - huggingface-hub \ - hf_transfer \ - Jinja2 \ - librosa \ - numpy==1.26.4 \ - scipy \ - tensorboard \ - transformers \ - pytorch-lightning \ - hf_transfer - -CMD ["/bin/bash"] diff --git a/diffusers/docker/diffusers-pytorch-xformers-cuda/Dockerfile b/diffusers/docker/diffusers-pytorch-xformers-cuda/Dockerfile deleted file mode 100644 index 356445a6d173375f2956920fd8c55bac48c5f213..0000000000000000000000000000000000000000 --- a/diffusers/docker/diffusers-pytorch-xformers-cuda/Dockerfile +++ /dev/null @@ -1,51 +0,0 @@ -FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04 -LABEL maintainer="Hugging Face" -LABEL repository="diffusers" - -ENV DEBIAN_FRONTEND=noninteractive - -RUN apt-get -y update \ - && apt-get install -y software-properties-common \ - && add-apt-repository ppa:deadsnakes/ppa - -RUN apt install -y bash \ - build-essential \ - git \ - git-lfs \ - curl \ - ca-certificates \ - libsndfile1-dev \ - libgl1 \ - python3.10 \ - python3.10-dev \ - python3-pip \ - python3.10-venv && \ - rm -rf /var/lib/apt/lists - -# make sure to use venv -RUN python3.10 -m venv /opt/venv -ENV PATH="/opt/venv/bin:$PATH" - -# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) -RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \ - python3.10 -m pip install --no-cache-dir \ - "torch<2.5.0" \ - torchvision \ - torchaudio \ - invisible_watermark && \ - python3.10 -m uv pip install --no-cache-dir \ - accelerate \ - datasets \ - hf-doc-builder \ - huggingface-hub \ - hf_transfer \ - Jinja2 \ - librosa \ - numpy==1.26.4 \ - scipy \ - tensorboard \ - transformers \ - xformers \ - hf_transfer - -CMD ["/bin/bash"] diff --git a/diffusers/docs/README.md b/diffusers/docs/README.md deleted file mode 100644 index f36b76fb07891cc556db8ac30633abcea01c4a41..0000000000000000000000000000000000000000 --- a/diffusers/docs/README.md +++ /dev/null @@ -1,268 +0,0 @@ - - -# Generating the documentation - -To generate the documentation, you first have to build it. Several packages are necessary to build the doc, -you can install them with the following command, at the root of the code repository: - -```bash -pip install -e ".[docs]" -``` - -Then you need to install our open source documentation builder tool: - -```bash -pip install git+https://github.com/huggingface/doc-builder -``` - ---- -**NOTE** - -You only need to generate the documentation to inspect it locally (if you're planning changes and want to -check how they look before committing for instance). You don't have to commit the built documentation. - ---- - -## Previewing the documentation - -To preview the docs, first install the `watchdog` module with: - -```bash -pip install watchdog -``` - -Then run the following command: - -```bash -doc-builder preview {package_name} {path_to_docs} -``` - -For example: - -```bash -doc-builder preview diffusers docs/source/en -``` - -The docs will be viewable at [http://localhost:3000](http://localhost:3000). You can also preview the docs once you have opened a PR. You will see a bot add a comment to a link where the documentation with your changes lives. - ---- -**NOTE** - -The `preview` command only works with existing doc files. When you add a completely new file, you need to update `_toctree.yml` & restart `preview` command (`ctrl-c` to stop it & call `doc-builder preview ...` again). - ---- - -## Adding a new element to the navigation bar - -Accepted files are Markdown (.md). - -Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting -the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/diffusers/blob/main/docs/source/en/_toctree.yml) file. - -## Renaming section headers and moving sections - -It helps to keep the old links working when renaming the section header and/or moving sections from one document to another. This is because the old links are likely to be used in Issues, Forums, and Social media and it'd make for a much more superior user experience if users reading those months later could still easily navigate to the originally intended information. - -Therefore, we simply keep a little map of moved sections at the end of the document where the original section was. The key is to preserve the original anchor. - -So if you renamed a section from: "Section A" to "Section B", then you can add at the end of the file: - -```md -Sections that were moved: - -[ Section A ] -``` -and of course, if you moved it to another file, then: - -```md -Sections that were moved: - -[ Section A ] -``` - -Use the relative style to link to the new file so that the versioned docs continue to work. - -For an example of a rich moved section set please see the very end of [the transformers Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.md). - - -## Writing Documentation - Specification - -The `huggingface/diffusers` documentation follows the -[Google documentation](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) style for docstrings, -although we can write them directly in Markdown. - -### Adding a new tutorial - -Adding a new tutorial or section is done in two steps: - -- Add a new Markdown (.md) file under `docs/source/`. -- Link that file in `docs/source//_toctree.yml` on the correct toc-tree. - -Make sure to put your new file under the proper section. It's unlikely to go in the first section (*Get Started*), so -depending on the intended targets (beginners, more advanced users, or researchers) it should go in sections two, three, or four. - -### Adding a new pipeline/scheduler - -When adding a new pipeline: - -- Create a file `xxx.md` under `docs/source//api/pipelines` (don't hesitate to copy an existing file as template). -- Link that file in (*Diffusers Summary*) section in `docs/source/api/pipelines/overview.md`, along with the link to the paper, and a colab notebook (if available). -- Write a short overview of the diffusion model: - - Overview with paper & authors - - Paper abstract - - Tips and tricks and how to use it best - - Possible an end-to-end example of how to use it -- Add all the pipeline classes that should be linked in the diffusion model. These classes should be added using our Markdown syntax. By default as follows: - -``` -[[autodoc]] XXXPipeline - - all - - __call__ -``` - -This will include every public method of the pipeline that is documented, as well as the `__call__` method that is not documented by default. If you just want to add additional methods that are not documented, you can put the list of all methods to add in a list that contains `all`. - -``` -[[autodoc]] XXXPipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention -``` - -You can follow the same process to create a new scheduler under the `docs/source//api/schedulers` folder. - -### Writing source documentation - -Values that should be put in `code` should either be surrounded by backticks: \`like so\`. Note that argument names -and objects like True, None, or any strings should usually be put in `code`. - -When mentioning a class, function, or method, it is recommended to use our syntax for internal links so that our tool -adds a link to its documentation with this syntax: \[\`XXXClass\`\] or \[\`function\`\]. This requires the class or -function to be in the main package. - -If you want to create a link to some internal class or function, you need to -provide its path. For instance: \[\`pipelines.ImagePipelineOutput\`\]. This will be converted into a link with -`pipelines.ImagePipelineOutput` in the description. To get rid of the path and only keep the name of the object you are -linking to in the description, add a ~: \[\`~pipelines.ImagePipelineOutput\`\] will generate a link with `ImagePipelineOutput` in the description. - -The same works for methods so you can either use \[\`XXXClass.method\`\] or \[\`~XXXClass.method\`\]. - -#### Defining arguments in a method - -Arguments should be defined with the `Args:` (or `Arguments:` or `Parameters:`) prefix, followed by a line return and -an indentation. The argument should be followed by its type, with its shape if it is a tensor, a colon, and its -description: - -``` - Args: - n_layers (`int`): The number of layers of the model. -``` - -If the description is too long to fit in one line, another indentation is necessary before writing the description -after the argument. - -Here's an example showcasing everything so far: - -``` - Args: - input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): - Indices of input sequence tokens in the vocabulary. - - Indices can be obtained using [`AlbertTokenizer`]. See [`~PreTrainedTokenizer.encode`] and - [`~PreTrainedTokenizer.__call__`] for details. - - [What are input IDs?](../glossary#input-ids) -``` - -For optional arguments or arguments with defaults we follow the following syntax: imagine we have a function with the -following signature: - -```py -def my_function(x: str=None, a: float=3.14): -``` - -then its documentation should look like this: - -``` - Args: - x (`str`, *optional*): - This argument controls ... - a (`float`, *optional*, defaults to `3.14`): - This argument is used to ... -``` - -Note that we always omit the "defaults to \`None\`" when None is the default for any argument. Also note that even -if the first line describing your argument type and its default gets long, you can't break it on several lines. You can -however write as many lines as you want in the indented description (see the example above with `input_ids`). - -#### Writing a multi-line code block - -Multi-line code blocks can be useful for displaying examples. They are done between two lines of three backticks as usual in Markdown: - - -```` -``` -# first line of code -# second line -# etc -``` -```` - -#### Writing a return block - -The return block should be introduced with the `Returns:` prefix, followed by a line return and an indentation. -The first line should be the type of the return, followed by a line return. No need to indent further for the elements -building the return. - -Here's an example of a single value return: - -``` - Returns: - `List[int]`: A list of integers in the range [0, 1] --- 1 for a special token, 0 for a sequence token. -``` - -Here's an example of a tuple return, comprising several objects: - -``` - Returns: - `tuple(torch.Tensor)` comprising various elements depending on the configuration ([`BertConfig`]) and inputs: - - ** loss** (*optional*, returned when `masked_lm_labels` is provided) `torch.Tensor` of shape `(1,)` -- - Total loss is the sum of the masked language modeling loss and the next sequence prediction (classification) loss. - - **prediction_scores** (`torch.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`) -- - Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). -``` - -#### Adding an image - -Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos, and other non-text files. We prefer to leverage a hf.co hosted `dataset` like -the ones hosted on [`hf-internal-testing`](https://huggingface.co/hf-internal-testing) in which to place these files and reference -them by URL. We recommend putting them in the following dataset: [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images). -If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images -to this dataset. - -## Styling the docstring - -We have an automatic script running with the `make style` command that will make sure that: -- the docstrings fully take advantage of the line width -- all code examples are formatted using black, like the code of the Transformers library - -This script may have some weird failures if you made a syntax mistake or if you uncover a bug. Therefore, it's -recommended to commit your changes before running `make style`, so you can revert the changes done by that script -easily. diff --git a/diffusers/docs/TRANSLATING.md b/diffusers/docs/TRANSLATING.md deleted file mode 100644 index f88bec8595c87cc329fea49c922ccdf592d2a3bb..0000000000000000000000000000000000000000 --- a/diffusers/docs/TRANSLATING.md +++ /dev/null @@ -1,69 +0,0 @@ - - -### Translating the Diffusers documentation into your language - -As part of our mission to democratize machine learning, we'd love to make the Diffusers library available in many more languages! Follow the steps below if you want to help translate the documentation into your language 🙏. - -**🗞️ Open an issue** - -To get started, navigate to the [Issues](https://github.com/huggingface/diffusers/issues) page of this repo and check if anyone else has opened an issue for your language. If not, open a new issue by selecting the "🌐 Translating a New Language?" from the "New issue" button. - -Once an issue exists, post a comment to indicate which chapters you'd like to work on, and we'll add your name to the list. - - -**🍴 Fork the repository** - -First, you'll need to [fork the Diffusers repo](https://docs.github.com/en/get-started/quickstart/fork-a-repo). You can do this by clicking on the **Fork** button on the top-right corner of this repo's page. - -Once you've forked the repo, you'll want to get the files on your local machine for editing. You can do that by cloning the fork with Git as follows: - -```bash -git clone https://github.com//diffusers.git -``` - -**📋 Copy-paste the English version with a new language code** - -The documentation files are in one leading directory: - -- [`docs/source`](https://github.com/huggingface/diffusers/tree/main/docs/source): All the documentation materials are organized here by language. - -You'll only need to copy the files in the [`docs/source/en`](https://github.com/huggingface/diffusers/tree/main/docs/source/en) directory, so first navigate to your fork of the repo and run the following: - -```bash -cd ~/path/to/diffusers/docs -cp -r source/en source/ -``` - -Here, `` should be one of the ISO 639-1 or ISO 639-2 language codes -- see [here](https://www.loc.gov/standards/iso639-2/php/code_list.php) for a handy table. - -**✍️ Start translating** - -The fun part comes - translating the text! - -The first thing we recommend is translating the part of the `_toctree.yml` file that corresponds to your doc chapter. This file is used to render the table of contents on the website. - -> 🙋 If the `_toctree.yml` file doesn't yet exist for your language, you can create one by copy-pasting from the English version and deleting the sections unrelated to your chapter. Just make sure it exists in the `docs/source//` directory! - -The fields you should add are `local` (with the name of the file containing the translation; e.g. `autoclass_tutorial`), and `title` (with the title of the doc in your language; e.g. `Load pretrained instances with an AutoClass`) -- as a reference, here is the `_toctree.yml` for [English](https://github.com/huggingface/diffusers/blob/main/docs/source/en/_toctree.yml): - -```yaml -- sections: - - local: pipeline_tutorial # Do not change this! Use the same name for your .md file - title: Pipelines for inference # Translate this! - ... - title: Tutorials # Translate this! -``` - -Once you have translated the `_toctree.yml` file, you can start translating the [MDX](https://mdxjs.com/) files associated with your docs chapter. - -> 🙋 If you'd like others to help you with the translation, you should [open an issue](https://github.com/huggingface/diffusers/issues) and tag @patrickvonplaten. diff --git a/diffusers/docs/source/_config.py b/diffusers/docs/source/_config.py deleted file mode 100644 index 3d0d73dcb951ea5b8b91e255d79b893a2a103ed3..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/_config.py +++ /dev/null @@ -1,9 +0,0 @@ -# docstyle-ignore -INSTALL_CONTENT = """ -# Diffusers installation -! pip install diffusers transformers datasets accelerate -# To install from source instead of the last release, comment the command above and uncomment the following one. -# ! pip install git+https://github.com/huggingface/diffusers.git -""" - -notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}] diff --git a/diffusers/docs/source/en/_toctree.yml b/diffusers/docs/source/en/_toctree.yml deleted file mode 100644 index de6cd2981b96dc56e06908b111d11a59e832ffa1..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/_toctree.yml +++ /dev/null @@ -1,570 +0,0 @@ -- sections: - - local: index - title: 🧨 Diffusers - - local: quicktour - title: Quicktour - - local: stable_diffusion - title: Effective and efficient diffusion - - local: installation - title: Installation - title: Get started -- sections: - - local: tutorials/tutorial_overview - title: Overview - - local: using-diffusers/write_own_pipeline - title: Understanding pipelines, models and schedulers - - local: tutorials/autopipeline - title: AutoPipeline - - local: tutorials/basic_training - title: Train a diffusion model - - local: tutorials/using_peft_for_inference - title: Load LoRAs for inference - - local: tutorials/fast_diffusion - title: Accelerate inference of text-to-image diffusion models - - local: tutorials/inference_with_big_models - title: Working with big models - title: Tutorials -- sections: - - local: using-diffusers/loading - title: Load pipelines - - local: using-diffusers/custom_pipeline_overview - title: Load community pipelines and components - - local: using-diffusers/schedulers - title: Load schedulers and models - - local: using-diffusers/other-formats - title: Model files and layouts - - local: using-diffusers/loading_adapters - title: Load adapters - - local: using-diffusers/push_to_hub - title: Push files to the Hub - title: Load pipelines and adapters -- sections: - - local: using-diffusers/unconditional_image_generation - title: Unconditional image generation - - local: using-diffusers/conditional_image_generation - title: Text-to-image - - local: using-diffusers/img2img - title: Image-to-image - - local: using-diffusers/inpaint - title: Inpainting - - local: using-diffusers/text-img2vid - title: Text or image-to-video - - local: using-diffusers/depth2img - title: Depth-to-image - title: Generative tasks -- sections: - - local: using-diffusers/overview_techniques - title: Overview - - local: training/distributed_inference - title: Distributed inference - - local: using-diffusers/merge_loras - title: Merge LoRAs - - local: using-diffusers/scheduler_features - title: Scheduler features - - local: using-diffusers/callback - title: Pipeline callbacks - - local: using-diffusers/reusing_seeds - title: Reproducible pipelines - - local: using-diffusers/image_quality - title: Controlling image quality - - local: using-diffusers/weighted_prompts - title: Prompt techniques - title: Inference techniques -- sections: - - local: advanced_inference/outpaint - title: Outpainting - title: Advanced inference -- sections: - - local: using-diffusers/cogvideox - title: CogVideoX - - local: using-diffusers/sdxl - title: Stable Diffusion XL - - local: using-diffusers/sdxl_turbo - title: SDXL Turbo - - local: using-diffusers/kandinsky - title: Kandinsky - - local: using-diffusers/ip_adapter - title: IP-Adapter - - local: using-diffusers/pag - title: PAG - - local: using-diffusers/controlnet - title: ControlNet - - local: using-diffusers/t2i_adapter - title: T2I-Adapter - - local: using-diffusers/inference_with_lcm - title: Latent Consistency Model - - local: using-diffusers/textual_inversion_inference - title: Textual inversion - - local: using-diffusers/shap-e - title: Shap-E - - local: using-diffusers/diffedit - title: DiffEdit - - local: using-diffusers/inference_with_tcd_lora - title: Trajectory Consistency Distillation-LoRA - - local: using-diffusers/svd - title: Stable Video Diffusion - - local: using-diffusers/marigold_usage - title: Marigold Computer Vision - title: Specific pipeline examples -- sections: - - local: training/overview - title: Overview - - local: training/create_dataset - title: Create a dataset for training - - local: training/adapt_a_model - title: Adapt a model to a new task - - isExpanded: false - sections: - - local: training/unconditional_training - title: Unconditional image generation - - local: training/text2image - title: Text-to-image - - local: training/sdxl - title: Stable Diffusion XL - - local: training/kandinsky - title: Kandinsky 2.2 - - local: training/wuerstchen - title: Wuerstchen - - local: training/controlnet - title: ControlNet - - local: training/t2i_adapters - title: T2I-Adapters - - local: training/instructpix2pix - title: InstructPix2Pix - - local: training/cogvideox - title: CogVideoX - title: Models - - isExpanded: false - sections: - - local: training/text_inversion - title: Textual Inversion - - local: training/dreambooth - title: DreamBooth - - local: training/lora - title: LoRA - - local: training/custom_diffusion - title: Custom Diffusion - - local: training/lcm_distill - title: Latent Consistency Distillation - - local: training/ddpo - title: Reinforcement learning training with DDPO - title: Methods - title: Training -- sections: - - local: quantization/overview - title: Getting Started - - local: quantization/bitsandbytes - title: bitsandbytes - title: Quantization Methods -- sections: - - local: optimization/fp16 - title: Speed up inference - - local: optimization/memory - title: Reduce memory usage - - local: optimization/torch2.0 - title: PyTorch 2.0 - - local: optimization/xformers - title: xFormers - - local: optimization/tome - title: Token merging - - local: optimization/deepcache - title: DeepCache - - local: optimization/tgate - title: TGATE - - local: optimization/xdit - title: xDiT - - sections: - - local: using-diffusers/stable_diffusion_jax_how_to - title: JAX/Flax - - local: optimization/onnx - title: ONNX - - local: optimization/open_vino - title: OpenVINO - - local: optimization/coreml - title: Core ML - title: Optimized model formats - - sections: - - local: optimization/mps - title: Metal Performance Shaders (MPS) - - local: optimization/habana - title: Habana Gaudi - - local: optimization/neuron - title: AWS Neuron - title: Optimized hardware - title: Accelerate inference and reduce memory -- sections: - - local: conceptual/philosophy - title: Philosophy - - local: using-diffusers/controlling_generation - title: Controlled generation - - local: conceptual/contribution - title: How to contribute? - - local: conceptual/ethical_guidelines - title: Diffusers' Ethical Guidelines - - local: conceptual/evaluation - title: Evaluating Diffusion Models - title: Conceptual Guides -- sections: - - local: community_projects - title: Projects built with Diffusers - title: Community Projects -- sections: - - isExpanded: false - sections: - - local: api/configuration - title: Configuration - - local: api/logging - title: Logging - - local: api/outputs - title: Outputs - - local: api/quantization - title: Quantization - title: Main Classes - - isExpanded: false - sections: - - local: api/loaders/ip_adapter - title: IP-Adapter - - local: api/loaders/lora - title: LoRA - - local: api/loaders/single_file - title: Single files - - local: api/loaders/textual_inversion - title: Textual Inversion - - local: api/loaders/unet - title: UNet - - local: api/loaders/peft - title: PEFT - title: Loaders - - isExpanded: false - sections: - - local: api/models/overview - title: Overview - - sections: - - local: api/models/controlnet - title: ControlNetModel - - local: api/models/controlnet_flux - title: FluxControlNetModel - - local: api/models/controlnet_hunyuandit - title: HunyuanDiT2DControlNetModel - - local: api/models/controlnet_sd3 - title: SD3ControlNetModel - - local: api/models/controlnet_sparsectrl - title: SparseControlNetModel - title: ControlNets - - sections: - - local: api/models/allegro_transformer3d - title: AllegroTransformer3DModel - - local: api/models/aura_flow_transformer2d - title: AuraFlowTransformer2DModel - - local: api/models/cogvideox_transformer3d - title: CogVideoXTransformer3DModel - - local: api/models/cogview3plus_transformer2d - title: CogView3PlusTransformer2DModel - - local: api/models/dit_transformer2d - title: DiTTransformer2DModel - - local: api/models/flux_transformer - title: FluxTransformer2DModel - - local: api/models/hunyuan_transformer2d - title: HunyuanDiT2DModel - - local: api/models/latte_transformer3d - title: LatteTransformer3DModel - - local: api/models/lumina_nextdit2d - title: LuminaNextDiT2DModel - - local: api/models/mochi_transformer3d - title: MochiTransformer3DModel - - local: api/models/pixart_transformer2d - title: PixArtTransformer2DModel - - local: api/models/prior_transformer - title: PriorTransformer - - local: api/models/sd3_transformer2d - title: SD3Transformer2DModel - - local: api/models/stable_audio_transformer - title: StableAudioDiTModel - - local: api/models/transformer2d - title: Transformer2DModel - - local: api/models/transformer_temporal - title: TransformerTemporalModel - title: Transformers - - sections: - - local: api/models/stable_cascade_unet - title: StableCascadeUNet - - local: api/models/unet - title: UNet1DModel - - local: api/models/unet2d - title: UNet2DModel - - local: api/models/unet2d-cond - title: UNet2DConditionModel - - local: api/models/unet3d-cond - title: UNet3DConditionModel - - local: api/models/unet-motion - title: UNetMotionModel - - local: api/models/uvit2d - title: UViT2DModel - title: UNets - - sections: - - local: api/models/autoencoderkl - title: AutoencoderKL - - local: api/models/autoencoderkl_allegro - title: AutoencoderKLAllegro - - local: api/models/autoencoderkl_cogvideox - title: AutoencoderKLCogVideoX - - local: api/models/autoencoderkl_mochi - title: AutoencoderKLMochi - - local: api/models/asymmetricautoencoderkl - title: AsymmetricAutoencoderKL - - local: api/models/consistency_decoder_vae - title: ConsistencyDecoderVAE - - local: api/models/autoencoder_oobleck - title: Oobleck AutoEncoder - - local: api/models/autoencoder_tiny - title: Tiny AutoEncoder - - local: api/models/vq - title: VQModel - title: VAEs - title: Models - - isExpanded: false - sections: - - local: api/pipelines/overview - title: Overview - - local: api/pipelines/allegro - title: Allegro - - local: api/pipelines/amused - title: aMUSEd - - local: api/pipelines/animatediff - title: AnimateDiff - - local: api/pipelines/attend_and_excite - title: Attend-and-Excite - - local: api/pipelines/audioldm - title: AudioLDM - - local: api/pipelines/audioldm2 - title: AudioLDM 2 - - local: api/pipelines/aura_flow - title: AuraFlow - - local: api/pipelines/auto_pipeline - title: AutoPipeline - - local: api/pipelines/blip_diffusion - title: BLIP-Diffusion - - local: api/pipelines/cogvideox - title: CogVideoX - - local: api/pipelines/cogview3 - title: CogView3 - - local: api/pipelines/consistency_models - title: Consistency Models - - local: api/pipelines/controlnet - title: ControlNet - - local: api/pipelines/controlnet_flux - title: ControlNet with Flux.1 - - local: api/pipelines/controlnet_hunyuandit - title: ControlNet with Hunyuan-DiT - - local: api/pipelines/controlnet_sd3 - title: ControlNet with Stable Diffusion 3 - - local: api/pipelines/controlnet_sdxl - title: ControlNet with Stable Diffusion XL - - local: api/pipelines/controlnetxs - title: ControlNet-XS - - local: api/pipelines/controlnetxs_sdxl - title: ControlNet-XS with Stable Diffusion XL - - local: api/pipelines/dance_diffusion - title: Dance Diffusion - - local: api/pipelines/ddim - title: DDIM - - local: api/pipelines/ddpm - title: DDPM - - local: api/pipelines/deepfloyd_if - title: DeepFloyd IF - - local: api/pipelines/diffedit - title: DiffEdit - - local: api/pipelines/dit - title: DiT - - local: api/pipelines/flux - title: Flux - - local: api/pipelines/hunyuandit - title: Hunyuan-DiT - - local: api/pipelines/i2vgenxl - title: I2VGen-XL - - local: api/pipelines/pix2pix - title: InstructPix2Pix - - local: api/pipelines/kandinsky - title: Kandinsky 2.1 - - local: api/pipelines/kandinsky_v22 - title: Kandinsky 2.2 - - local: api/pipelines/kandinsky3 - title: Kandinsky 3 - - local: api/pipelines/kolors - title: Kolors - - local: api/pipelines/latent_consistency_models - title: Latent Consistency Models - - local: api/pipelines/latent_diffusion - title: Latent Diffusion - - local: api/pipelines/latte - title: Latte - - local: api/pipelines/ledits_pp - title: LEDITS++ - - local: api/pipelines/lumina - title: Lumina-T2X - - local: api/pipelines/marigold - title: Marigold - - local: api/pipelines/mochi - title: Mochi - - local: api/pipelines/panorama - title: MultiDiffusion - - local: api/pipelines/musicldm - title: MusicLDM - - local: api/pipelines/pag - title: PAG - - local: api/pipelines/paint_by_example - title: Paint by Example - - local: api/pipelines/pia - title: Personalized Image Animator (PIA) - - local: api/pipelines/pixart - title: PixArt-α - - local: api/pipelines/pixart_sigma - title: PixArt-Σ - - local: api/pipelines/self_attention_guidance - title: Self-Attention Guidance - - local: api/pipelines/semantic_stable_diffusion - title: Semantic Guidance - - local: api/pipelines/shap_e - title: Shap-E - - local: api/pipelines/stable_audio - title: Stable Audio - - local: api/pipelines/stable_cascade - title: Stable Cascade - - sections: - - local: api/pipelines/stable_diffusion/overview - title: Overview - - local: api/pipelines/stable_diffusion/text2img - title: Text-to-image - - local: api/pipelines/stable_diffusion/img2img - title: Image-to-image - - local: api/pipelines/stable_diffusion/svd - title: Image-to-video - - local: api/pipelines/stable_diffusion/inpaint - title: Inpainting - - local: api/pipelines/stable_diffusion/depth2img - title: Depth-to-image - - local: api/pipelines/stable_diffusion/image_variation - title: Image variation - - local: api/pipelines/stable_diffusion/stable_diffusion_safe - title: Safe Stable Diffusion - - local: api/pipelines/stable_diffusion/stable_diffusion_2 - title: Stable Diffusion 2 - - local: api/pipelines/stable_diffusion/stable_diffusion_3 - title: Stable Diffusion 3 - - local: api/pipelines/stable_diffusion/stable_diffusion_xl - title: Stable Diffusion XL - - local: api/pipelines/stable_diffusion/sdxl_turbo - title: SDXL Turbo - - local: api/pipelines/stable_diffusion/latent_upscale - title: Latent upscaler - - local: api/pipelines/stable_diffusion/upscale - title: Super-resolution - - local: api/pipelines/stable_diffusion/k_diffusion - title: K-Diffusion - - local: api/pipelines/stable_diffusion/ldm3d_diffusion - title: LDM3D Text-to-(RGB, Depth), Text-to-(RGB-pano, Depth-pano), LDM3D Upscaler - - local: api/pipelines/stable_diffusion/adapter - title: T2I-Adapter - - local: api/pipelines/stable_diffusion/gligen - title: GLIGEN (Grounded Language-to-Image Generation) - title: Stable Diffusion - - local: api/pipelines/stable_unclip - title: Stable unCLIP - - local: api/pipelines/text_to_video - title: Text-to-video - - local: api/pipelines/text_to_video_zero - title: Text2Video-Zero - - local: api/pipelines/unclip - title: unCLIP - - local: api/pipelines/unidiffuser - title: UniDiffuser - - local: api/pipelines/value_guided_sampling - title: Value-guided sampling - - local: api/pipelines/wuerstchen - title: Wuerstchen - title: Pipelines - - isExpanded: false - sections: - - local: api/schedulers/overview - title: Overview - - local: api/schedulers/cm_stochastic_iterative - title: CMStochasticIterativeScheduler - - local: api/schedulers/consistency_decoder - title: ConsistencyDecoderScheduler - - local: api/schedulers/cosine_dpm - title: CosineDPMSolverMultistepScheduler - - local: api/schedulers/ddim_inverse - title: DDIMInverseScheduler - - local: api/schedulers/ddim - title: DDIMScheduler - - local: api/schedulers/ddpm - title: DDPMScheduler - - local: api/schedulers/deis - title: DEISMultistepScheduler - - local: api/schedulers/multistep_dpm_solver_inverse - title: DPMSolverMultistepInverse - - local: api/schedulers/multistep_dpm_solver - title: DPMSolverMultistepScheduler - - local: api/schedulers/dpm_sde - title: DPMSolverSDEScheduler - - local: api/schedulers/singlestep_dpm_solver - title: DPMSolverSinglestepScheduler - - local: api/schedulers/edm_multistep_dpm_solver - title: EDMDPMSolverMultistepScheduler - - local: api/schedulers/edm_euler - title: EDMEulerScheduler - - local: api/schedulers/euler_ancestral - title: EulerAncestralDiscreteScheduler - - local: api/schedulers/euler - title: EulerDiscreteScheduler - - local: api/schedulers/flow_match_euler_discrete - title: FlowMatchEulerDiscreteScheduler - - local: api/schedulers/flow_match_heun_discrete - title: FlowMatchHeunDiscreteScheduler - - local: api/schedulers/heun - title: HeunDiscreteScheduler - - local: api/schedulers/ipndm - title: IPNDMScheduler - - local: api/schedulers/stochastic_karras_ve - title: KarrasVeScheduler - - local: api/schedulers/dpm_discrete_ancestral - title: KDPM2AncestralDiscreteScheduler - - local: api/schedulers/dpm_discrete - title: KDPM2DiscreteScheduler - - local: api/schedulers/lcm - title: LCMScheduler - - local: api/schedulers/lms_discrete - title: LMSDiscreteScheduler - - local: api/schedulers/pndm - title: PNDMScheduler - - local: api/schedulers/repaint - title: RePaintScheduler - - local: api/schedulers/score_sde_ve - title: ScoreSdeVeScheduler - - local: api/schedulers/score_sde_vp - title: ScoreSdeVpScheduler - - local: api/schedulers/tcd - title: TCDScheduler - - local: api/schedulers/unipc - title: UniPCMultistepScheduler - - local: api/schedulers/vq_diffusion - title: VQDiffusionScheduler - title: Schedulers - - isExpanded: false - sections: - - local: api/internal_classes_overview - title: Overview - - local: api/attnprocessor - title: Attention Processor - - local: api/activations - title: Custom activation functions - - local: api/normalization - title: Custom normalization layers - - local: api/utilities - title: Utilities - - local: api/image_processor - title: VAE Image Processor - - local: api/video_processor - title: Video Processor - title: Internal classes - title: API diff --git a/diffusers/docs/source/en/advanced_inference/outpaint.md b/diffusers/docs/source/en/advanced_inference/outpaint.md deleted file mode 100644 index f3a7bd99d8fadbe8cd3920e5dd9fab1862a304c9..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/advanced_inference/outpaint.md +++ /dev/null @@ -1,231 +0,0 @@ - - -# Outpainting - -Outpainting extends an image beyond its original boundaries, allowing you to add, replace, or modify visual elements in an image while preserving the original image. Like [inpainting](../using-diffusers/inpaint), you want to fill the white area (in this case, the area outside of the original image) with new visual elements while keeping the original image (represented by a mask of black pixels). There are a couple of ways to outpaint, such as with a [ControlNet](https://hf.co/blog/OzzyGT/outpainting-controlnet) or with [Differential Diffusion](https://hf.co/blog/OzzyGT/outpainting-differential-diffusion). - -This guide will show you how to outpaint with an inpainting model, ControlNet, and a ZoeDepth estimator. - -Before you begin, make sure you have the [controlnet_aux](https://github.com/huggingface/controlnet_aux) library installed so you can use the ZoeDepth estimator. - -```py -!pip install -q controlnet_aux -``` - -## Image preparation - -Start by picking an image to outpaint with and remove the background with a Space like [BRIA-RMBG-1.4](https://hf.co/spaces/briaai/BRIA-RMBG-1.4). - - - -For example, remove the background from this image of a pair of shoes. - -
-
- -
original image
-
-
- -
background removed
-
-
- -[Stable Diffusion XL (SDXL)](../using-diffusers/sdxl) models work best with 1024x1024 images, but you can resize the image to any size as long as your hardware has enough memory to support it. The transparent background in the image should also be replaced with a white background. Create a function (like the one below) that scales and pastes the image onto a white background. - -```py -import random - -import requests -import torch -from controlnet_aux import ZoeDetector -from PIL import Image, ImageOps - -from diffusers import ( - AutoencoderKL, - ControlNetModel, - StableDiffusionXLControlNetPipeline, - StableDiffusionXLInpaintPipeline, -) - -def scale_and_paste(original_image): - aspect_ratio = original_image.width / original_image.height - - if original_image.width > original_image.height: - new_width = 1024 - new_height = round(new_width / aspect_ratio) - else: - new_height = 1024 - new_width = round(new_height * aspect_ratio) - - resized_original = original_image.resize((new_width, new_height), Image.LANCZOS) - white_background = Image.new("RGBA", (1024, 1024), "white") - x = (1024 - new_width) // 2 - y = (1024 - new_height) // 2 - white_background.paste(resized_original, (x, y), resized_original) - - return resized_original, white_background - -original_image = Image.open( - requests.get( - "https://huggingface.co/datasets/stevhliu/testing-images/resolve/main/no-background-jordan.png", - stream=True, - ).raw -).convert("RGBA") -resized_img, white_bg_image = scale_and_paste(original_image) -``` - -To avoid adding unwanted extra details, use the ZoeDepth estimator to provide additional guidance during generation and to ensure the shoes remain consistent with the original image. - -```py -zoe = ZoeDetector.from_pretrained("lllyasviel/Annotators") -image_zoe = zoe(white_bg_image, detect_resolution=512, image_resolution=1024) -image_zoe -``` - -
- -
- -## Outpaint - -Once your image is ready, you can generate content in the white area around the shoes with [controlnet-inpaint-dreamer-sdxl](https://hf.co/destitech/controlnet-inpaint-dreamer-sdxl), a SDXL ControlNet trained for inpainting. - -Load the inpainting ControlNet, ZoeDepth model, VAE and pass them to the [`StableDiffusionXLControlNetPipeline`]. Then you can create an optional `generate_image` function (for convenience) to outpaint an initial image. - -```py -controlnets = [ - ControlNetModel.from_pretrained( - "destitech/controlnet-inpaint-dreamer-sdxl", torch_dtype=torch.float16, variant="fp16" - ), - ControlNetModel.from_pretrained( - "diffusers/controlnet-zoe-depth-sdxl-1.0", torch_dtype=torch.float16 - ), -] -vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda") -pipeline = StableDiffusionXLControlNetPipeline.from_pretrained( - "SG161222/RealVisXL_V4.0", torch_dtype=torch.float16, variant="fp16", controlnet=controlnets, vae=vae -).to("cuda") - -def generate_image(prompt, negative_prompt, inpaint_image, zoe_image, seed: int = None): - if seed is None: - seed = random.randint(0, 2**32 - 1) - - generator = torch.Generator(device="cpu").manual_seed(seed) - - image = pipeline( - prompt, - negative_prompt=negative_prompt, - image=[inpaint_image, zoe_image], - guidance_scale=6.5, - num_inference_steps=25, - generator=generator, - controlnet_conditioning_scale=[0.5, 0.8], - control_guidance_end=[0.9, 0.6], - ).images[0] - - return image - -prompt = "nike air jordans on a basketball court" -negative_prompt = "" - -temp_image = generate_image(prompt, negative_prompt, white_bg_image, image_zoe, 908097) -``` - -Paste the original image over the initial outpainted image. You'll improve the outpainted background in a later step. - -```py -x = (1024 - resized_img.width) // 2 -y = (1024 - resized_img.height) // 2 -temp_image.paste(resized_img, (x, y), resized_img) -temp_image -``` - -
- -
- -> [!TIP] -> Now is a good time to free up some memory if you're running low! -> -> ```py -> pipeline=None -> torch.cuda.empty_cache() -> ``` - -Now that you have an initial outpainted image, load the [`StableDiffusionXLInpaintPipeline`] with the [RealVisXL](https://hf.co/SG161222/RealVisXL_V4.0) model to generate the final outpainted image with better quality. - -```py -pipeline = StableDiffusionXLInpaintPipeline.from_pretrained( - "OzzyGT/RealVisXL_V4.0_inpainting", - torch_dtype=torch.float16, - variant="fp16", - vae=vae, -).to("cuda") -``` - -Prepare a mask for the final outpainted image. To create a more natural transition between the original image and the outpainted background, blur the mask to help it blend better. - -```py -mask = Image.new("L", temp_image.size) -mask.paste(resized_img.split()[3], (x, y)) -mask = ImageOps.invert(mask) -final_mask = mask.point(lambda p: p > 128 and 255) -mask_blurred = pipeline.mask_processor.blur(final_mask, blur_factor=20) -mask_blurred -``` - -
- -
- -Create a better prompt and pass it to the `generate_outpaint` function to generate the final outpainted image. Again, paste the original image over the final outpainted background. - -```py -def generate_outpaint(prompt, negative_prompt, image, mask, seed: int = None): - if seed is None: - seed = random.randint(0, 2**32 - 1) - - generator = torch.Generator(device="cpu").manual_seed(seed) - - image = pipeline( - prompt, - negative_prompt=negative_prompt, - image=image, - mask_image=mask, - guidance_scale=10.0, - strength=0.8, - num_inference_steps=30, - generator=generator, - ).images[0] - - return image - -prompt = "high quality photo of nike air jordans on a basketball court, highly detailed" -negative_prompt = "" - -final_image = generate_outpaint(prompt, negative_prompt, temp_image, mask_blurred, 7688778) -x = (1024 - resized_img.width) // 2 -y = (1024 - resized_img.height) // 2 -final_image.paste(resized_img, (x, y), resized_img) -final_image -``` - -
- -
diff --git a/diffusers/docs/source/en/api/activations.md b/diffusers/docs/source/en/api/activations.md deleted file mode 100644 index 3bef28a5ab0db570f00c0a24388ce4e9ba90f5a9..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/activations.md +++ /dev/null @@ -1,27 +0,0 @@ - - -# Activation functions - -Customized activation functions for supporting various models in 🤗 Diffusers. - -## GELU - -[[autodoc]] models.activations.GELU - -## GEGLU - -[[autodoc]] models.activations.GEGLU - -## ApproximateGELU - -[[autodoc]] models.activations.ApproximateGELU diff --git a/diffusers/docs/source/en/api/attnprocessor.md b/diffusers/docs/source/en/api/attnprocessor.md deleted file mode 100644 index 5b1f0be72ae6efa1cf11860257c232bae1ecad7f..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/attnprocessor.md +++ /dev/null @@ -1,54 +0,0 @@ - - -# Attention Processor - -An attention processor is a class for applying different types of attention mechanisms. - -## AttnProcessor -[[autodoc]] models.attention_processor.AttnProcessor - -## AttnProcessor2_0 -[[autodoc]] models.attention_processor.AttnProcessor2_0 - -## AttnAddedKVProcessor -[[autodoc]] models.attention_processor.AttnAddedKVProcessor - -## AttnAddedKVProcessor2_0 -[[autodoc]] models.attention_processor.AttnAddedKVProcessor2_0 - -## CrossFrameAttnProcessor -[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor - -## CustomDiffusionAttnProcessor -[[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor - -## CustomDiffusionAttnProcessor2_0 -[[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor2_0 - -## CustomDiffusionXFormersAttnProcessor -[[autodoc]] models.attention_processor.CustomDiffusionXFormersAttnProcessor - -## FusedAttnProcessor2_0 -[[autodoc]] models.attention_processor.FusedAttnProcessor2_0 - -## SlicedAttnProcessor -[[autodoc]] models.attention_processor.SlicedAttnProcessor - -## SlicedAttnAddedKVProcessor -[[autodoc]] models.attention_processor.SlicedAttnAddedKVProcessor - -## XFormersAttnProcessor -[[autodoc]] models.attention_processor.XFormersAttnProcessor - -## AttnProcessorNPU -[[autodoc]] models.attention_processor.AttnProcessorNPU diff --git a/diffusers/docs/source/en/api/configuration.md b/diffusers/docs/source/en/api/configuration.md deleted file mode 100644 index 31d70232a95c1b610030c983e75045e286280327..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/configuration.md +++ /dev/null @@ -1,30 +0,0 @@ - - -# Configuration - -Schedulers from [`~schedulers.scheduling_utils.SchedulerMixin`] and models from [`ModelMixin`] inherit from [`ConfigMixin`] which stores all the parameters that are passed to their respective `__init__` methods in a JSON-configuration file. - - - -To use private or [gated](https://huggingface.co/docs/hub/models-gated#gated-models) models, log-in with `huggingface-cli login`. - - - -## ConfigMixin - -[[autodoc]] ConfigMixin - - load_config - - from_config - - save_config - - to_json_file - - to_json_string diff --git a/diffusers/docs/source/en/api/image_processor.md b/diffusers/docs/source/en/api/image_processor.md deleted file mode 100644 index e633a936103da470b8c5767d14da6926af5fb88d..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/image_processor.md +++ /dev/null @@ -1,35 +0,0 @@ - - -# VAE Image Processor - -The [`VaeImageProcessor`] provides a unified API for [`StableDiffusionPipeline`]s to prepare image inputs for VAE encoding and post-processing outputs once they're decoded. This includes transformations such as resizing, normalization, and conversion between PIL Image, PyTorch, and NumPy arrays. - -All pipelines with [`VaeImageProcessor`] accept PIL Image, PyTorch tensor, or NumPy arrays as image inputs and return outputs based on the `output_type` argument by the user. You can pass encoded image latents directly to the pipeline and return latents from the pipeline as a specific output with the `output_type` argument (for example `output_type="latent"`). This allows you to take the generated latents from one pipeline and pass it to another pipeline as input without leaving the latent space. It also makes it much easier to use multiple pipelines together by passing PyTorch tensors directly between different pipelines. - -## VaeImageProcessor - -[[autodoc]] image_processor.VaeImageProcessor - -## VaeImageProcessorLDM3D - -The [`VaeImageProcessorLDM3D`] accepts RGB and depth inputs and returns RGB and depth outputs. - -[[autodoc]] image_processor.VaeImageProcessorLDM3D - -## PixArtImageProcessor - -[[autodoc]] image_processor.PixArtImageProcessor - -## IPAdapterMaskProcessor - -[[autodoc]] image_processor.IPAdapterMaskProcessor diff --git a/diffusers/docs/source/en/api/internal_classes_overview.md b/diffusers/docs/source/en/api/internal_classes_overview.md deleted file mode 100644 index 38e8124cd4a00609b29470b705e30b3ca4791bc2..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/internal_classes_overview.md +++ /dev/null @@ -1,15 +0,0 @@ - - -# Overview - -The APIs in this section are more experimental and prone to breaking changes. Most of them are used internally for development, but they may also be useful to you if you're interested in building a diffusion model with some custom parts or if you're interested in some of our helper utilities for working with 🤗 Diffusers. diff --git a/diffusers/docs/source/en/api/loaders/ip_adapter.md b/diffusers/docs/source/en/api/loaders/ip_adapter.md deleted file mode 100644 index a10f30ef8e5bd56d70ee63820ca886f486e81915..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/loaders/ip_adapter.md +++ /dev/null @@ -1,29 +0,0 @@ - - -# IP-Adapter - -[IP-Adapter](https://hf.co/papers/2308.06721) is a lightweight adapter that enables prompting a diffusion model with an image. This method decouples the cross-attention layers of the image and text features. The image features are generated from an image encoder. - - - -Learn how to load an IP-Adapter checkpoint and image in the IP-Adapter [loading](../../using-diffusers/loading_adapters#ip-adapter) guide, and you can see how to use it in the [usage](../../using-diffusers/ip_adapter) guide. - - - -## IPAdapterMixin - -[[autodoc]] loaders.ip_adapter.IPAdapterMixin - -## IPAdapterMaskProcessor - -[[autodoc]] image_processor.IPAdapterMaskProcessor \ No newline at end of file diff --git a/diffusers/docs/source/en/api/loaders/lora.md b/diffusers/docs/source/en/api/loaders/lora.md deleted file mode 100644 index 2060a1eefd52ebfb98a070c7a086af1be1340577..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/loaders/lora.md +++ /dev/null @@ -1,47 +0,0 @@ - - -# LoRA - -LoRA is a fast and lightweight training method that inserts and trains a significantly smaller number of parameters instead of all the model parameters. This produces a smaller file (~100 MBs) and makes it easier to quickly train a model to learn a new concept. LoRA weights are typically loaded into the denoiser, text encoder or both. The denoiser usually corresponds to a UNet ([`UNet2DConditionModel`], for example) or a Transformer ([`SD3Transformer2DModel`], for example). There are several classes for loading LoRA weights: - -- [`StableDiffusionLoraLoaderMixin`] provides functions for loading and unloading, fusing and unfusing, enabling and disabling, and more functions for managing LoRA weights. This class can be used with any model. -- [`StableDiffusionXLLoraLoaderMixin`] is a [Stable Diffusion (SDXL)](../../api/pipelines/stable_diffusion/stable_diffusion_xl) version of the [`StableDiffusionLoraLoaderMixin`] class for loading and saving LoRA weights. It can only be used with the SDXL model. -- [`SD3LoraLoaderMixin`] provides similar functions for [Stable Diffusion 3](https://huggingface.co/blog/sd3). -- [`AmusedLoraLoaderMixin`] is for the [`AmusedPipeline`]. -- [`LoraBaseMixin`] provides a base class with several utility methods to fuse, unfuse, unload, LoRAs and more. - - - -To learn more about how to load LoRA weights, see the [LoRA](../../using-diffusers/loading_adapters#lora) loading guide. - - - -## StableDiffusionLoraLoaderMixin - -[[autodoc]] loaders.lora_pipeline.StableDiffusionLoraLoaderMixin - -## StableDiffusionXLLoraLoaderMixin - -[[autodoc]] loaders.lora_pipeline.StableDiffusionXLLoraLoaderMixin - -## SD3LoraLoaderMixin - -[[autodoc]] loaders.lora_pipeline.SD3LoraLoaderMixin - -## AmusedLoraLoaderMixin - -[[autodoc]] loaders.lora_pipeline.AmusedLoraLoaderMixin - -## LoraBaseMixin - -[[autodoc]] loaders.lora_base.LoraBaseMixin \ No newline at end of file diff --git a/diffusers/docs/source/en/api/loaders/peft.md b/diffusers/docs/source/en/api/loaders/peft.md deleted file mode 100644 index 67a4a7f2a490b17cd8a8eeea74a50cb45958e4c3..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/loaders/peft.md +++ /dev/null @@ -1,25 +0,0 @@ - - -# PEFT - -Diffusers supports loading adapters such as [LoRA](../../using-diffusers/loading_adapters) with the [PEFT](https://huggingface.co/docs/peft/index) library with the [`~loaders.peft.PeftAdapterMixin`] class. This allows modeling classes in Diffusers like [`UNet2DConditionModel`], [`SD3Transformer2DModel`] to operate with an adapter. - - - -Refer to the [Inference with PEFT](../../tutorials/using_peft_for_inference.md) tutorial for an overview of how to use PEFT in Diffusers for inference. - - - -## PeftAdapterMixin - -[[autodoc]] loaders.peft.PeftAdapterMixin diff --git a/diffusers/docs/source/en/api/loaders/single_file.md b/diffusers/docs/source/en/api/loaders/single_file.md deleted file mode 100644 index 64ca02fd83870f41d12a2c5adbd79648c80dd46a..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/loaders/single_file.md +++ /dev/null @@ -1,62 +0,0 @@ - - -# Single files - -The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load: - -* a model stored in a single file, which is useful if you're working with models from the diffusion ecosystem, like Automatic1111, and commonly rely on a single-file layout to store and share models -* a model stored in their originally distributed layout, which is useful if you're working with models finetuned with other services, and want to load it directly into Diffusers model objects and pipelines - -> [!TIP] -> Read the [Model files and layouts](../../using-diffusers/other-formats) guide to learn more about the Diffusers-multifolder layout versus the single-file layout, and how to load models stored in these different layouts. - -## Supported pipelines - -- [`StableDiffusionPipeline`] -- [`StableDiffusionImg2ImgPipeline`] -- [`StableDiffusionInpaintPipeline`] -- [`StableDiffusionControlNetPipeline`] -- [`StableDiffusionControlNetImg2ImgPipeline`] -- [`StableDiffusionControlNetInpaintPipeline`] -- [`StableDiffusionUpscalePipeline`] -- [`StableDiffusionXLPipeline`] -- [`StableDiffusionXLImg2ImgPipeline`] -- [`StableDiffusionXLInpaintPipeline`] -- [`StableDiffusionXLInstructPix2PixPipeline`] -- [`StableDiffusionXLControlNetPipeline`] -- [`StableDiffusionXLKDiffusionPipeline`] -- [`StableDiffusion3Pipeline`] -- [`LatentConsistencyModelPipeline`] -- [`LatentConsistencyModelImg2ImgPipeline`] -- [`StableDiffusionControlNetXSPipeline`] -- [`StableDiffusionXLControlNetXSPipeline`] -- [`LEditsPPPipelineStableDiffusion`] -- [`LEditsPPPipelineStableDiffusionXL`] -- [`PIAPipeline`] - -## Supported models - -- [`UNet2DConditionModel`] -- [`StableCascadeUNet`] -- [`AutoencoderKL`] -- [`ControlNetModel`] -- [`SD3Transformer2DModel`] -- [`FluxTransformer2DModel`] - -## FromSingleFileMixin - -[[autodoc]] loaders.single_file.FromSingleFileMixin - -## FromOriginalModelMixin - -[[autodoc]] loaders.single_file_model.FromOriginalModelMixin diff --git a/diffusers/docs/source/en/api/loaders/textual_inversion.md b/diffusers/docs/source/en/api/loaders/textual_inversion.md deleted file mode 100644 index c900e22af847167a3738eb0ba9aec766919c630e..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/loaders/textual_inversion.md +++ /dev/null @@ -1,27 +0,0 @@ - - -# Textual Inversion - -Textual Inversion is a training method for personalizing models by learning new text embeddings from a few example images. The file produced from training is extremely small (a few KBs) and the new embeddings can be loaded into the text encoder. - -[`TextualInversionLoaderMixin`] provides a function for loading Textual Inversion embeddings from Diffusers and Automatic1111 into the text encoder and loading a special token to activate the embeddings. - - - -To learn more about how to load Textual Inversion embeddings, see the [Textual Inversion](../../using-diffusers/loading_adapters#textual-inversion) loading guide. - - - -## TextualInversionLoaderMixin - -[[autodoc]] loaders.textual_inversion.TextualInversionLoaderMixin \ No newline at end of file diff --git a/diffusers/docs/source/en/api/loaders/unet.md b/diffusers/docs/source/en/api/loaders/unet.md deleted file mode 100644 index 16cc319b4ed02a46e287fac88d81e1e21799938a..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/loaders/unet.md +++ /dev/null @@ -1,27 +0,0 @@ - - -# UNet - -Some training methods - like LoRA and Custom Diffusion - typically target the UNet's attention layers, but these training methods can also target other non-attention layers. Instead of training all of a model's parameters, only a subset of the parameters are trained, which is faster and more efficient. This class is useful if you're *only* loading weights into a UNet. If you need to load weights into the text encoder or a text encoder and UNet, try using the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] function instead. - -The [`UNet2DConditionLoadersMixin`] class provides functions for loading and saving weights, fusing and unfusing LoRAs, disabling and enabling LoRAs, and setting and deleting adapters. - - - -To learn more about how to load LoRA weights, see the [LoRA](../../using-diffusers/loading_adapters#lora) loading guide. - - - -## UNet2DConditionLoadersMixin - -[[autodoc]] loaders.unet.UNet2DConditionLoadersMixin \ No newline at end of file diff --git a/diffusers/docs/source/en/api/logging.md b/diffusers/docs/source/en/api/logging.md deleted file mode 100644 index 1b219645da6b904d1bbb4c18d35f5f3796ed326e..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/logging.md +++ /dev/null @@ -1,96 +0,0 @@ - - -# Logging - -🤗 Diffusers has a centralized logging system to easily manage the verbosity of the library. The default verbosity is set to `WARNING`. - -To change the verbosity level, use one of the direct setters. For instance, to change the verbosity to the `INFO` level. - -```python -import diffusers - -diffusers.logging.set_verbosity_info() -``` - -You can also use the environment variable `DIFFUSERS_VERBOSITY` to override the default verbosity. You can set it -to one of the following: `debug`, `info`, `warning`, `error`, `critical`. For example: - -```bash -DIFFUSERS_VERBOSITY=error ./myprogram.py -``` - -Additionally, some `warnings` can be disabled by setting the environment variable -`DIFFUSERS_NO_ADVISORY_WARNINGS` to a true value, like `1`. This disables any warning logged by -[`logger.warning_advice`]. For example: - -```bash -DIFFUSERS_NO_ADVISORY_WARNINGS=1 ./myprogram.py -``` - -Here is an example of how to use the same logger as the library in your own module or script: - -```python -from diffusers.utils import logging - -logging.set_verbosity_info() -logger = logging.get_logger("diffusers") -logger.info("INFO") -logger.warning("WARN") -``` - - -All methods of the logging module are documented below. The main methods are -[`logging.get_verbosity`] to get the current level of verbosity in the logger and -[`logging.set_verbosity`] to set the verbosity to the level of your choice. - -In order from the least verbose to the most verbose: - -| Method | Integer value | Description | -|----------------------------------------------------------:|--------------:|----------------------------------------------------:| -| `diffusers.logging.CRITICAL` or `diffusers.logging.FATAL` | 50 | only report the most critical errors | -| `diffusers.logging.ERROR` | 40 | only report errors | -| `diffusers.logging.WARNING` or `diffusers.logging.WARN` | 30 | only report errors and warnings (default) | -| `diffusers.logging.INFO` | 20 | only report errors, warnings, and basic information | -| `diffusers.logging.DEBUG` | 10 | report all information | - -By default, `tqdm` progress bars are displayed during model download. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] are used to enable or disable this behavior. - -## Base setters - -[[autodoc]] utils.logging.set_verbosity_error - -[[autodoc]] utils.logging.set_verbosity_warning - -[[autodoc]] utils.logging.set_verbosity_info - -[[autodoc]] utils.logging.set_verbosity_debug - -## Other functions - -[[autodoc]] utils.logging.get_verbosity - -[[autodoc]] utils.logging.set_verbosity - -[[autodoc]] utils.logging.get_logger - -[[autodoc]] utils.logging.enable_default_handler - -[[autodoc]] utils.logging.disable_default_handler - -[[autodoc]] utils.logging.enable_explicit_format - -[[autodoc]] utils.logging.reset_format - -[[autodoc]] utils.logging.enable_progress_bar - -[[autodoc]] utils.logging.disable_progress_bar diff --git a/diffusers/docs/source/en/api/models/allegro_transformer3d.md b/diffusers/docs/source/en/api/models/allegro_transformer3d.md deleted file mode 100644 index e70026fe4bfc5c6689f40afdf0bd4730c71f6db4..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/allegro_transformer3d.md +++ /dev/null @@ -1,30 +0,0 @@ - - -# AllegroTransformer3DModel - -A Diffusion Transformer model for 3D data from [Allegro](https://github.com/rhymes-ai/Allegro) was introduced in [Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) by RhymesAI. - -The model can be loaded with the following code snippet. - -```python -from diffusers import AllegroTransformer3DModel - -vae = AllegroTransformer3DModel.from_pretrained("rhymes-ai/Allegro", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") -``` - -## AllegroTransformer3DModel - -[[autodoc]] AllegroTransformer3DModel - -## Transformer2DModelOutput - -[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/diffusers/docs/source/en/api/models/asymmetricautoencoderkl.md b/diffusers/docs/source/en/api/models/asymmetricautoencoderkl.md deleted file mode 100644 index 2023dcf97f9d6c6d009bbd7389f78057eee0215e..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/asymmetricautoencoderkl.md +++ /dev/null @@ -1,60 +0,0 @@ - - -# AsymmetricAutoencoderKL - -Improved larger variational autoencoder (VAE) model with KL loss for inpainting task: [Designing a Better Asymmetric VQGAN for StableDiffusion](https://arxiv.org/abs/2306.04632) by Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua. - -The abstract from the paper is: - -*StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at https://github.com/buxiangzhiren/Asymmetric_VQGAN* - -Evaluation results can be found in section 4.1 of the original paper. - -## Available checkpoints - -* [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5) -* [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2) - -## Example Usage - -```python -from diffusers import AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline -from diffusers.utils import load_image, make_image_grid - - -prompt = "a photo of a person with beard" -img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png" -mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png" - -original_image = load_image(img_url).resize((512, 512)) -mask_image = load_image(mask_url).resize((512, 512)) - -pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting") -pipe.vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5") -pipe.to("cuda") - -image = pipe(prompt=prompt, image=original_image, mask_image=mask_image).images[0] -make_image_grid([original_image, mask_image, image], rows=1, cols=3) -``` - -## AsymmetricAutoencoderKL - -[[autodoc]] models.autoencoders.autoencoder_asym_kl.AsymmetricAutoencoderKL - -## AutoencoderKLOutput - -[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput - -## DecoderOutput - -[[autodoc]] models.autoencoders.vae.DecoderOutput diff --git a/diffusers/docs/source/en/api/models/aura_flow_transformer2d.md b/diffusers/docs/source/en/api/models/aura_flow_transformer2d.md deleted file mode 100644 index d07806bcc2155362dfa9e09a164e021969b488e0..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/aura_flow_transformer2d.md +++ /dev/null @@ -1,19 +0,0 @@ - - -# AuraFlowTransformer2DModel - -A Transformer model for image-like data from [AuraFlow](https://blog.fal.ai/auraflow/). - -## AuraFlowTransformer2DModel - -[[autodoc]] AuraFlowTransformer2DModel diff --git a/diffusers/docs/source/en/api/models/autoencoder_oobleck.md b/diffusers/docs/source/en/api/models/autoencoder_oobleck.md deleted file mode 100644 index bbc00e048b64e19bc4654bd450fd4076e24db10f..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/autoencoder_oobleck.md +++ /dev/null @@ -1,38 +0,0 @@ - - -# AutoencoderOobleck - -The Oobleck variational autoencoder (VAE) model with KL loss was introduced in [Stability-AI/stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) and [Stable Audio Open](https://huggingface.co/papers/2407.14358) by Stability AI. The model is used in 🤗 Diffusers to encode audio waveforms into latents and to decode latent representations into audio waveforms. - -The abstract from the paper is: - -*Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.* - -## AutoencoderOobleck - -[[autodoc]] AutoencoderOobleck - - decode - - encode - - all - -## OobleckDecoderOutput - -[[autodoc]] models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput - -## OobleckDecoderOutput - -[[autodoc]] models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput - -## AutoencoderOobleckOutput - -[[autodoc]] models.autoencoders.autoencoder_oobleck.AutoencoderOobleckOutput diff --git a/diffusers/docs/source/en/api/models/autoencoder_tiny.md b/diffusers/docs/source/en/api/models/autoencoder_tiny.md deleted file mode 100644 index 25fe2b7a8ab9be7f37e585e27c13365c369878b2..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/autoencoder_tiny.md +++ /dev/null @@ -1,57 +0,0 @@ - - -# Tiny AutoEncoder - -Tiny AutoEncoder for Stable Diffusion (TAESD) was introduced in [madebyollin/taesd](https://github.com/madebyollin/taesd) by Ollin Boer Bohan. It is a tiny distilled version of Stable Diffusion's VAE that can quickly decode the latents in a [`StableDiffusionPipeline`] or [`StableDiffusionXLPipeline`] almost instantly. - -To use with Stable Diffusion v-2.1: - -```python -import torch -from diffusers import DiffusionPipeline, AutoencoderTiny - -pipe = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16 -) -pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd", torch_dtype=torch.float16) -pipe = pipe.to("cuda") - -prompt = "slice of delicious New York-style berry cheesecake" -image = pipe(prompt, num_inference_steps=25).images[0] -image -``` - -To use with Stable Diffusion XL 1.0 - -```python -import torch -from diffusers import DiffusionPipeline, AutoencoderTiny - -pipe = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 -) -pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesdxl", torch_dtype=torch.float16) -pipe = pipe.to("cuda") - -prompt = "slice of delicious New York-style berry cheesecake" -image = pipe(prompt, num_inference_steps=25).images[0] -image -``` - -## AutoencoderTiny - -[[autodoc]] AutoencoderTiny - -## AutoencoderTinyOutput - -[[autodoc]] models.autoencoders.autoencoder_tiny.AutoencoderTinyOutput diff --git a/diffusers/docs/source/en/api/models/autoencoderkl.md b/diffusers/docs/source/en/api/models/autoencoderkl.md deleted file mode 100644 index dd881089ad00334fcb5de196b0c2e8be650206d7..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/autoencoderkl.md +++ /dev/null @@ -1,58 +0,0 @@ - - -# AutoencoderKL - -The variational autoencoder (VAE) model with KL loss was introduced in [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114v11) by Diederik P. Kingma and Max Welling. The model is used in 🤗 Diffusers to encode images into latents and to decode latent representations into images. - -The abstract from the paper is: - -*How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.* - -## Loading from the original format - -By default the [`AutoencoderKL`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded -from the original format using [`FromOriginalModelMixin.from_single_file`] as follows: - -```py -from diffusers import AutoencoderKL - -url = "https://huggingface.co/stabilityai/sd-vae-ft-mse-original/blob/main/vae-ft-mse-840000-ema-pruned.safetensors" # can also be a local file -model = AutoencoderKL.from_single_file(url) -``` - -## AutoencoderKL - -[[autodoc]] AutoencoderKL - - decode - - encode - - all - -## AutoencoderKLOutput - -[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput - -## DecoderOutput - -[[autodoc]] models.autoencoders.vae.DecoderOutput - -## FlaxAutoencoderKL - -[[autodoc]] FlaxAutoencoderKL - -## FlaxAutoencoderKLOutput - -[[autodoc]] models.vae_flax.FlaxAutoencoderKLOutput - -## FlaxDecoderOutput - -[[autodoc]] models.vae_flax.FlaxDecoderOutput diff --git a/diffusers/docs/source/en/api/models/autoencoderkl_allegro.md b/diffusers/docs/source/en/api/models/autoencoderkl_allegro.md deleted file mode 100644 index fd9d10d5724bea312abf1918702899a0f7a77ff9..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/autoencoderkl_allegro.md +++ /dev/null @@ -1,37 +0,0 @@ - - -# AutoencoderKLAllegro - -The 3D variational autoencoder (VAE) model with KL loss used in [Allegro](https://github.com/rhymes-ai/Allegro) was introduced in [Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) by RhymesAI. - -The model can be loaded with the following code snippet. - -```python -from diffusers import AutoencoderKLAllegro - -vae = AutoencoderKLCogVideoX.from_pretrained("rhymes-ai/Allegro", subfolder="vae", torch_dtype=torch.float32).to("cuda") -``` - -## AutoencoderKLAllegro - -[[autodoc]] AutoencoderKLAllegro - - decode - - encode - - all - -## AutoencoderKLOutput - -[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput - -## DecoderOutput - -[[autodoc]] models.autoencoders.vae.DecoderOutput diff --git a/diffusers/docs/source/en/api/models/autoencoderkl_cogvideox.md b/diffusers/docs/source/en/api/models/autoencoderkl_cogvideox.md deleted file mode 100644 index 122812b31d2e52c763a4f9ad4431d3be00beb98b..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/autoencoderkl_cogvideox.md +++ /dev/null @@ -1,37 +0,0 @@ - - -# AutoencoderKLCogVideoX - -The 3D variational autoencoder (VAE) model with KL loss used in [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI. - -The model can be loaded with the following code snippet. - -```python -from diffusers import AutoencoderKLCogVideoX - -vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.float16).to("cuda") -``` - -## AutoencoderKLCogVideoX - -[[autodoc]] AutoencoderKLCogVideoX - - decode - - encode - - all - -## AutoencoderKLOutput - -[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput - -## DecoderOutput - -[[autodoc]] models.autoencoders.vae.DecoderOutput diff --git a/diffusers/docs/source/en/api/models/autoencoderkl_mochi.md b/diffusers/docs/source/en/api/models/autoencoderkl_mochi.md deleted file mode 100644 index 9747de4af93700f12a1a0d10760430ee123558c1..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/autoencoderkl_mochi.md +++ /dev/null @@ -1,32 +0,0 @@ - - -# AutoencoderKLMochi - -The 3D variational autoencoder (VAE) model with KL loss used in [Mochi](https://github.com/genmoai/models) was introduced in [Mochi 1 Preview](https://huggingface.co/genmo/mochi-1-preview) by Tsinghua University & ZhipuAI. - -The model can be loaded with the following code snippet. - -```python -from diffusers import AutoencoderKLMochi - -vae = AutoencoderKLMochi.from_pretrained("genmo/mochi-1-preview", subfolder="vae", torch_dtype=torch.float32).to("cuda") -``` - -## AutoencoderKLMochi - -[[autodoc]] AutoencoderKLMochi - - decode - - all - -## DecoderOutput - -[[autodoc]] models.autoencoders.vae.DecoderOutput diff --git a/diffusers/docs/source/en/api/models/cogvideox_transformer3d.md b/diffusers/docs/source/en/api/models/cogvideox_transformer3d.md deleted file mode 100644 index 8c8baae7b537115fbcba99c213ceafd1c253652c..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/cogvideox_transformer3d.md +++ /dev/null @@ -1,30 +0,0 @@ - - -# CogVideoXTransformer3DModel - -A Diffusion Transformer model for 3D data from [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI. - -The model can be loaded with the following code snippet. - -```python -from diffusers import CogVideoXTransformer3DModel - -vae = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-2b", subfolder="transformer", torch_dtype=torch.float16).to("cuda") -``` - -## CogVideoXTransformer3DModel - -[[autodoc]] CogVideoXTransformer3DModel - -## Transformer2DModelOutput - -[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/diffusers/docs/source/en/api/models/cogview3plus_transformer2d.md b/diffusers/docs/source/en/api/models/cogview3plus_transformer2d.md deleted file mode 100644 index 16f71a58cfb44eaab401103d13407650d757abf5..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/cogview3plus_transformer2d.md +++ /dev/null @@ -1,30 +0,0 @@ - - -# CogView3PlusTransformer2DModel - -A Diffusion Transformer model for 2D data from [CogView3Plus](https://github.com/THUDM/CogView3) was introduced in [CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://huggingface.co/papers/2403.05121) by Tsinghua University & ZhipuAI. - -The model can be loaded with the following code snippet. - -```python -from diffusers import CogView3PlusTransformer2DModel - -vae = CogView3PlusTransformer2DModel.from_pretrained("THUDM/CogView3Plus-3b", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") -``` - -## CogView3PlusTransformer2DModel - -[[autodoc]] CogView3PlusTransformer2DModel - -## Transformer2DModelOutput - -[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/diffusers/docs/source/en/api/models/consistency_decoder_vae.md b/diffusers/docs/source/en/api/models/consistency_decoder_vae.md deleted file mode 100644 index 94a64820ebb19bc83aad1fbd49914ce9aaad3f1d..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/consistency_decoder_vae.md +++ /dev/null @@ -1,30 +0,0 @@ - - -# Consistency Decoder - -Consistency decoder can be used to decode the latents from the denoising UNet in the [`StableDiffusionPipeline`]. This decoder was introduced in the [DALL-E 3 technical report](https://openai.com/dall-e-3). - -The original codebase can be found at [openai/consistencydecoder](https://github.com/openai/consistencydecoder). - - - -Inference is only supported for 2 iterations as of now. - - - -The pipeline could not have been contributed without the help of [madebyollin](https://github.com/madebyollin) and [mrsteyk](https://github.com/mrsteyk) from [this issue](https://github.com/openai/consistencydecoder/issues/1). - -## ConsistencyDecoderVAE -[[autodoc]] ConsistencyDecoderVAE - - all - - decode diff --git a/diffusers/docs/source/en/api/models/controlnet.md b/diffusers/docs/source/en/api/models/controlnet.md deleted file mode 100644 index 5d4cac6658ccfbe3a36e2420ded010a9abe8f290..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/controlnet.md +++ /dev/null @@ -1,50 +0,0 @@ - - -# ControlNetModel - -The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection. - -The abstract from the paper is: - -*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* - -## Loading from the original format - -By default the [`ControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded -from the original format using [`FromOriginalModelMixin.from_single_file`] as follows: - -```py -from diffusers import StableDiffusionControlNetPipeline, ControlNetModel - -url = "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth" # can also be a local path -controlnet = ControlNetModel.from_single_file(url) - -url = "https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors" # can also be a local path -pipe = StableDiffusionControlNetPipeline.from_single_file(url, controlnet=controlnet) -``` - -## ControlNetModel - -[[autodoc]] ControlNetModel - -## ControlNetOutput - -[[autodoc]] models.controlnets.controlnet.ControlNetOutput - -## FlaxControlNetModel - -[[autodoc]] FlaxControlNetModel - -## FlaxControlNetOutput - -[[autodoc]] models.controlnets.controlnet_flax.FlaxControlNetOutput diff --git a/diffusers/docs/source/en/api/models/controlnet_flux.md b/diffusers/docs/source/en/api/models/controlnet_flux.md deleted file mode 100644 index 422d066d95ff2c6e1792c8026cfb2cd3ae9d4ec9..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/controlnet_flux.md +++ /dev/null @@ -1,45 +0,0 @@ - - -# FluxControlNetModel - -FluxControlNetModel is an implementation of ControlNet for Flux.1. - -The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection. - -The abstract from the paper is: - -*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* - -## Loading from the original format - -By default the [`FluxControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`]. - -```py -from diffusers import FluxControlNetPipeline -from diffusers.models import FluxControlNetModel, FluxMultiControlNetModel - -controlnet = FluxControlNetModel.from_pretrained("InstantX/FLUX.1-dev-Controlnet-Canny") -pipe = FluxControlNetPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", controlnet=controlnet) - -controlnet = FluxControlNetModel.from_pretrained("InstantX/FLUX.1-dev-Controlnet-Canny") -controlnet = FluxMultiControlNetModel([controlnet]) -pipe = FluxControlNetPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", controlnet=controlnet) -``` - -## FluxControlNetModel - -[[autodoc]] FluxControlNetModel - -## FluxControlNetOutput - -[[autodoc]] models.controlnet_flux.FluxControlNetOutput \ No newline at end of file diff --git a/diffusers/docs/source/en/api/models/controlnet_hunyuandit.md b/diffusers/docs/source/en/api/models/controlnet_hunyuandit.md deleted file mode 100644 index b73a893cce85c95217af934633017b0003c30c79..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/controlnet_hunyuandit.md +++ /dev/null @@ -1,37 +0,0 @@ - - -# HunyuanDiT2DControlNetModel - -HunyuanDiT2DControlNetModel is an implementation of ControlNet for [Hunyuan-DiT](https://arxiv.org/abs/2405.08748). - -ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. - -With a ControlNet model, you can provide an additional control image to condition and control Hunyuan-DiT generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. - -The abstract from the paper is: - -*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* - -This code is implemented by Tencent Hunyuan Team. You can find pre-trained checkpoints for Hunyuan-DiT ControlNets on [Tencent Hunyuan](https://huggingface.co/Tencent-Hunyuan). - -## Example For Loading HunyuanDiT2DControlNetModel - -```py -from diffusers import HunyuanDiT2DControlNetModel -import torch -controlnet = HunyuanDiT2DControlNetModel.from_pretrained("Tencent-Hunyuan/HunyuanDiT-v1.1-ControlNet-Diffusers-Pose", torch_dtype=torch.float16) -``` - -## HunyuanDiT2DControlNetModel - -[[autodoc]] HunyuanDiT2DControlNetModel \ No newline at end of file diff --git a/diffusers/docs/source/en/api/models/controlnet_sd3.md b/diffusers/docs/source/en/api/models/controlnet_sd3.md deleted file mode 100644 index 78564d238eea1be753f21b1c5a41f615a4e872dc..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/controlnet_sd3.md +++ /dev/null @@ -1,42 +0,0 @@ - - -# SD3ControlNetModel - -SD3ControlNetModel is an implementation of ControlNet for Stable Diffusion 3. - -The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection. - -The abstract from the paper is: - -*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* - -## Loading from the original format - -By default the [`SD3ControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`]. - -```py -from diffusers import StableDiffusion3ControlNetPipeline -from diffusers.models import SD3ControlNetModel, SD3MultiControlNetModel - -controlnet = SD3ControlNetModel.from_pretrained("InstantX/SD3-Controlnet-Canny") -pipe = StableDiffusion3ControlNetPipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", controlnet=controlnet) -``` - -## SD3ControlNetModel - -[[autodoc]] SD3ControlNetModel - -## SD3ControlNetOutput - -[[autodoc]] models.controlnets.controlnet_sd3.SD3ControlNetOutput - diff --git a/diffusers/docs/source/en/api/models/controlnet_sparsectrl.md b/diffusers/docs/source/en/api/models/controlnet_sparsectrl.md deleted file mode 100644 index d5d7d358c4d27ea10085fe368286c55dc6f9b57c..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/controlnet_sparsectrl.md +++ /dev/null @@ -1,46 +0,0 @@ - - -# SparseControlNetModel - -SparseControlNetModel is an implementation of ControlNet for [AnimateDiff](https://arxiv.org/abs/2307.04725). - -ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. - -The SparseCtrl version of ControlNet was introduced in [SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://arxiv.org/abs/2311.16933) for achieving controlled generation in text-to-video diffusion models by Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. - -The abstract from the paper is: - -*The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at [this https URL](https://guoyww.github.io/projects/SparseCtrl).* - -## Example for loading SparseControlNetModel - -```python -import torch -from diffusers import SparseControlNetModel - -# fp32 variant in float16 -# 1. Scribble checkpoint -controlnet = SparseControlNetModel.from_pretrained("guoyww/animatediff-sparsectrl-scribble", torch_dtype=torch.float16) - -# 2. RGB checkpoint -controlnet = SparseControlNetModel.from_pretrained("guoyww/animatediff-sparsectrl-rgb", torch_dtype=torch.float16) - -# For loading fp16 variant, pass `variant="fp16"` as an additional parameter -``` - -## SparseControlNetModel - -[[autodoc]] SparseControlNetModel - -## SparseControlNetOutput - -[[autodoc]] models.controlnet_sparsectrl.SparseControlNetOutput diff --git a/diffusers/docs/source/en/api/models/dit_transformer2d.md b/diffusers/docs/source/en/api/models/dit_transformer2d.md deleted file mode 100644 index afac62d53cb453202465fbf62fc1d07080009505..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/dit_transformer2d.md +++ /dev/null @@ -1,19 +0,0 @@ - - -# DiTTransformer2DModel - -A Transformer model for image-like data from [DiT](https://huggingface.co/papers/2212.09748). - -## DiTTransformer2DModel - -[[autodoc]] DiTTransformer2DModel diff --git a/diffusers/docs/source/en/api/models/flux_transformer.md b/diffusers/docs/source/en/api/models/flux_transformer.md deleted file mode 100644 index 381593f1bfe6c121bc6a4ee51f84a04fb8e4a8c4..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/flux_transformer.md +++ /dev/null @@ -1,19 +0,0 @@ - - -# FluxTransformer2DModel - -A Transformer model for image-like data from [Flux](https://blackforestlabs.ai/announcing-black-forest-labs/). - -## FluxTransformer2DModel - -[[autodoc]] FluxTransformer2DModel diff --git a/diffusers/docs/source/en/api/models/hunyuan_transformer2d.md b/diffusers/docs/source/en/api/models/hunyuan_transformer2d.md deleted file mode 100644 index fe137236d18e02e379bddfda4ce3b938b5a57832..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/hunyuan_transformer2d.md +++ /dev/null @@ -1,20 +0,0 @@ - - -# HunyuanDiT2DModel - -A Diffusion Transformer model for 2D data from [Hunyuan-DiT](https://github.com/Tencent/HunyuanDiT). - -## HunyuanDiT2DModel - -[[autodoc]] HunyuanDiT2DModel - diff --git a/diffusers/docs/source/en/api/models/latte_transformer3d.md b/diffusers/docs/source/en/api/models/latte_transformer3d.md deleted file mode 100644 index f87926aefc9fca098b5838c59ad54fbddbd3e11c..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/latte_transformer3d.md +++ /dev/null @@ -1,19 +0,0 @@ - - -## LatteTransformer3DModel - -A Diffusion Transformer model for 3D data from [Latte](https://github.com/Vchitect/Latte). - -## LatteTransformer3DModel - -[[autodoc]] LatteTransformer3DModel diff --git a/diffusers/docs/source/en/api/models/lumina_nextdit2d.md b/diffusers/docs/source/en/api/models/lumina_nextdit2d.md deleted file mode 100644 index fe28918e2b580b3f7a02e43073b6b9bae6a7f6ab..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/lumina_nextdit2d.md +++ /dev/null @@ -1,20 +0,0 @@ - - -# LuminaNextDiT2DModel - -A Next Version of Diffusion Transformer model for 2D data from [Lumina-T2X](https://github.com/Alpha-VLLM/Lumina-T2X). - -## LuminaNextDiT2DModel - -[[autodoc]] LuminaNextDiT2DModel - diff --git a/diffusers/docs/source/en/api/models/mochi_transformer3d.md b/diffusers/docs/source/en/api/models/mochi_transformer3d.md deleted file mode 100644 index 05e28654d58c352c034b0b078c33232cb4b24911..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/mochi_transformer3d.md +++ /dev/null @@ -1,30 +0,0 @@ - - -# MochiTransformer3DModel - -A Diffusion Transformer model for 3D video-like data was introduced in [Mochi-1 Preview](https://huggingface.co/genmo/mochi-1-preview) by Genmo. - -The model can be loaded with the following code snippet. - -```python -from diffusers import MochiTransformer3DModel - -vae = MochiTransformer3DModel.from_pretrained("genmo/mochi-1-preview", subfolder="transformer", torch_dtype=torch.float16).to("cuda") -``` - -## MochiTransformer3DModel - -[[autodoc]] MochiTransformer3DModel - -## Transformer2DModelOutput - -[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/diffusers/docs/source/en/api/models/overview.md b/diffusers/docs/source/en/api/models/overview.md deleted file mode 100644 index 62e75f26b5b09adedfbb14e9dd39731059443b11..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/overview.md +++ /dev/null @@ -1,28 +0,0 @@ - - -# Models - -🤗 Diffusers provides pretrained models for popular algorithms and modules to create custom diffusion systems. The primary function of models is to denoise an input sample as modeled by the distribution \\(p_{\theta}(x_{t-1}|x_{t})\\). - -All models are built from the base [`ModelMixin`] class which is a [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) providing basic functionality for saving and loading models, locally and from the Hugging Face Hub. - -## ModelMixin -[[autodoc]] ModelMixin - -## FlaxModelMixin - -[[autodoc]] FlaxModelMixin - -## PushToHubMixin - -[[autodoc]] utils.PushToHubMixin diff --git a/diffusers/docs/source/en/api/models/pixart_transformer2d.md b/diffusers/docs/source/en/api/models/pixart_transformer2d.md deleted file mode 100644 index 1d392f4e7c2c3cf37878d38f6fd3e6797cb44db7..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/pixart_transformer2d.md +++ /dev/null @@ -1,19 +0,0 @@ - - -# PixArtTransformer2DModel - -A Transformer model for image-like data from [PixArt-Alpha](https://huggingface.co/papers/2310.00426) and [PixArt-Sigma](https://huggingface.co/papers/2403.04692). - -## PixArtTransformer2DModel - -[[autodoc]] PixArtTransformer2DModel diff --git a/diffusers/docs/source/en/api/models/prior_transformer.md b/diffusers/docs/source/en/api/models/prior_transformer.md deleted file mode 100644 index 3d4e3a81782c78c62932403fefe58f2c3a5bad28..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/prior_transformer.md +++ /dev/null @@ -1,27 +0,0 @@ - - -# PriorTransformer - -The Prior Transformer was originally introduced in [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) by Ramesh et al. It is used to predict CLIP image embeddings from CLIP text embeddings; image embeddings are predicted through a denoising diffusion process. - -The abstract from the paper is: - -*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.* - -## PriorTransformer - -[[autodoc]] PriorTransformer - -## PriorTransformerOutput - -[[autodoc]] models.transformers.prior_transformer.PriorTransformerOutput diff --git a/diffusers/docs/source/en/api/models/sd3_transformer2d.md b/diffusers/docs/source/en/api/models/sd3_transformer2d.md deleted file mode 100644 index feef87db3a635e5de489e62bd7fbc5dbe3ecffcf..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/sd3_transformer2d.md +++ /dev/null @@ -1,19 +0,0 @@ - - -# SD3 Transformer Model - -The Transformer model introduced in [Stable Diffusion 3](https://hf.co/papers/2403.03206). Its novelty lies in the MMDiT transformer block. - -## SD3Transformer2DModel - -[[autodoc]] SD3Transformer2DModel \ No newline at end of file diff --git a/diffusers/docs/source/en/api/models/stable_audio_transformer.md b/diffusers/docs/source/en/api/models/stable_audio_transformer.md deleted file mode 100644 index 396b96c8c710cab9b0cd577614b5c36df0f0f41d..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/stable_audio_transformer.md +++ /dev/null @@ -1,19 +0,0 @@ - - -# StableAudioDiTModel - -A Transformer model for audio waveforms from [Stable Audio Open](https://huggingface.co/papers/2407.14358). - -## StableAudioDiTModel - -[[autodoc]] StableAudioDiTModel diff --git a/diffusers/docs/source/en/api/models/stable_cascade_unet.md b/diffusers/docs/source/en/api/models/stable_cascade_unet.md deleted file mode 100644 index 31b780079d76359836ec36f38255499995b9edaa..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/stable_cascade_unet.md +++ /dev/null @@ -1,19 +0,0 @@ - - -# StableCascadeUNet - -A UNet model from the [Stable Cascade pipeline](../pipelines/stable_cascade.md). - -## StableCascadeUNet - -[[autodoc]] models.unets.unet_stable_cascade.StableCascadeUNet diff --git a/diffusers/docs/source/en/api/models/transformer2d.md b/diffusers/docs/source/en/api/models/transformer2d.md deleted file mode 100644 index 077ccbb6b235f25fec4ee0c9f40875b41932a3dd..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/transformer2d.md +++ /dev/null @@ -1,41 +0,0 @@ - - -# Transformer2DModel - -A Transformer model for image-like data from [CompVis](https://huggingface.co/CompVis) that is based on the [Vision Transformer](https://huggingface.co/papers/2010.11929) introduced by Dosovitskiy et al. The [`Transformer2DModel`] accepts discrete (classes of vector embeddings) or continuous (actual embeddings) inputs. - -When the input is **continuous**: - -1. Project the input and reshape it to `(batch_size, sequence_length, feature_dimension)`. -2. Apply the Transformer blocks in the standard way. -3. Reshape to image. - -When the input is **discrete**: - - - -It is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image don't contain a prediction for the masked pixel because the unnoised image cannot be masked. - - - -1. Convert input (classes of latent pixels) to embeddings and apply positional embeddings. -2. Apply the Transformer blocks in the standard way. -3. Predict classes of unnoised image. - -## Transformer2DModel - -[[autodoc]] Transformer2DModel - -## Transformer2DModelOutput - -[[autodoc]] models.modeling_outputs.Transformer2DModelOutput diff --git a/diffusers/docs/source/en/api/models/transformer_temporal.md b/diffusers/docs/source/en/api/models/transformer_temporal.md deleted file mode 100644 index 02d075dea3f39a753bf8d6e44b2de624418d8fd5..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/transformer_temporal.md +++ /dev/null @@ -1,23 +0,0 @@ - - -# TransformerTemporalModel - -A Transformer model for video-like data. - -## TransformerTemporalModel - -[[autodoc]] models.transformers.transformer_temporal.TransformerTemporalModel - -## TransformerTemporalModelOutput - -[[autodoc]] models.transformers.transformer_temporal.TransformerTemporalModelOutput diff --git a/diffusers/docs/source/en/api/models/unet-motion.md b/diffusers/docs/source/en/api/models/unet-motion.md deleted file mode 100644 index 9396f6477bf1756960d82672bdb115b133223476..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/unet-motion.md +++ /dev/null @@ -1,25 +0,0 @@ - - -# UNetMotionModel - -The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet model. - -The abstract from the paper is: - -*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.* - -## UNetMotionModel -[[autodoc]] UNetMotionModel - -## UNet3DConditionOutput -[[autodoc]] models.unets.unet_3d_condition.UNet3DConditionOutput diff --git a/diffusers/docs/source/en/api/models/unet.md b/diffusers/docs/source/en/api/models/unet.md deleted file mode 100644 index bf36aae1f6d91c87b2cfe4872ccfdb251bd5c62e..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/unet.md +++ /dev/null @@ -1,25 +0,0 @@ - - -# UNet1DModel - -The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al. for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 1D UNet model. - -The abstract from the paper is: - -*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.* - -## UNet1DModel -[[autodoc]] UNet1DModel - -## UNet1DOutput -[[autodoc]] models.unets.unet_1d.UNet1DOutput diff --git a/diffusers/docs/source/en/api/models/unet2d-cond.md b/diffusers/docs/source/en/api/models/unet2d-cond.md deleted file mode 100644 index a3cc5da674c353347618ca127e999f4c03bcf7f0..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/unet2d-cond.md +++ /dev/null @@ -1,31 +0,0 @@ - - -# UNet2DConditionModel - -The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al. for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet conditional model. - -The abstract from the paper is: - -*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.* - -## UNet2DConditionModel -[[autodoc]] UNet2DConditionModel - -## UNet2DConditionOutput -[[autodoc]] models.unets.unet_2d_condition.UNet2DConditionOutput - -## FlaxUNet2DConditionModel -[[autodoc]] models.unets.unet_2d_condition_flax.FlaxUNet2DConditionModel - -## FlaxUNet2DConditionOutput -[[autodoc]] models.unets.unet_2d_condition_flax.FlaxUNet2DConditionOutput diff --git a/diffusers/docs/source/en/api/models/unet2d.md b/diffusers/docs/source/en/api/models/unet2d.md deleted file mode 100644 index fe88b8d8ac506308f532f5578d95ea73f912c687..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/unet2d.md +++ /dev/null @@ -1,25 +0,0 @@ - - -# UNet2DModel - -The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al. for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet model. - -The abstract from the paper is: - -*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.* - -## UNet2DModel -[[autodoc]] UNet2DModel - -## UNet2DOutput -[[autodoc]] models.unets.unet_2d.UNet2DOutput diff --git a/diffusers/docs/source/en/api/models/unet3d-cond.md b/diffusers/docs/source/en/api/models/unet3d-cond.md deleted file mode 100644 index 52e3086166ac7d566e37fed5dce521d685cefab8..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/unet3d-cond.md +++ /dev/null @@ -1,25 +0,0 @@ - - -# UNet3DConditionModel - -The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al. for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 3D UNet conditional model. - -The abstract from the paper is: - -*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.* - -## UNet3DConditionModel -[[autodoc]] UNet3DConditionModel - -## UNet3DConditionOutput -[[autodoc]] models.unets.unet_3d_condition.UNet3DConditionOutput diff --git a/diffusers/docs/source/en/api/models/uvit2d.md b/diffusers/docs/source/en/api/models/uvit2d.md deleted file mode 100644 index abea0fdc38c3aa631ae6b19673b07579e1277569..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/uvit2d.md +++ /dev/null @@ -1,39 +0,0 @@ - - -# UVit2DModel - -The [U-ViT](https://hf.co/papers/2301.11093) model is a vision transformer (ViT) based UNet. This model incorporates elements from ViT (considers all inputs such as time, conditions and noisy image patches as tokens) and a UNet (long skip connections between the shallow and deep layers). The skip connection is important for predicting pixel-level features. An additional 3x3 convolutional block is applied prior to the final output to improve image quality. - -The abstract from the paper is: - -*Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.* - -## UVit2DModel - -[[autodoc]] UVit2DModel - -## UVit2DConvEmbed - -[[autodoc]] models.unets.uvit_2d.UVit2DConvEmbed - -## UVitBlock - -[[autodoc]] models.unets.uvit_2d.UVitBlock - -## ConvNextBlock - -[[autodoc]] models.unets.uvit_2d.ConvNextBlock - -## ConvMlmLayer - -[[autodoc]] models.unets.uvit_2d.ConvMlmLayer diff --git a/diffusers/docs/source/en/api/models/vq.md b/diffusers/docs/source/en/api/models/vq.md deleted file mode 100644 index fa0631e6fe0bae581a6d3e33382b9b9d3babac9f..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/models/vq.md +++ /dev/null @@ -1,27 +0,0 @@ - - -# VQModel - -The VQ-VAE model was introduced in [Neural Discrete Representation Learning](https://huggingface.co/papers/1711.00937) by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu. The model is used in 🤗 Diffusers to decode latent representations into images. Unlike [`AutoencoderKL`], the [`VQModel`] works in a quantized latent space. - -The abstract from the paper is: - -*Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.* - -## VQModel - -[[autodoc]] VQModel - -## VQEncoderOutput - -[[autodoc]] models.autoencoders.vq_model.VQEncoderOutput diff --git a/diffusers/docs/source/en/api/normalization.md b/diffusers/docs/source/en/api/normalization.md deleted file mode 100644 index ef4b694a4d8533717b698e07810741a5f665a141..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/normalization.md +++ /dev/null @@ -1,31 +0,0 @@ - - -# Normalization layers - -Customized normalization layers for supporting various models in 🤗 Diffusers. - -## AdaLayerNorm - -[[autodoc]] models.normalization.AdaLayerNorm - -## AdaLayerNormZero - -[[autodoc]] models.normalization.AdaLayerNormZero - -## AdaLayerNormSingle - -[[autodoc]] models.normalization.AdaLayerNormSingle - -## AdaGroupNorm - -[[autodoc]] models.normalization.AdaGroupNorm diff --git a/diffusers/docs/source/en/api/outputs.md b/diffusers/docs/source/en/api/outputs.md deleted file mode 100644 index 759444852ba08c0da64bbb3c294458bec176e991..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/outputs.md +++ /dev/null @@ -1,67 +0,0 @@ - - -# Outputs - -All model outputs are subclasses of [`~utils.BaseOutput`], data structures containing all the information returned by the model. The outputs can also be used as tuples or dictionaries. - -For example: - -```python -from diffusers import DDIMPipeline - -pipeline = DDIMPipeline.from_pretrained("google/ddpm-cifar10-32") -outputs = pipeline() -``` - -The `outputs` object is a [`~pipelines.ImagePipelineOutput`] which means it has an image attribute. - -You can access each attribute as you normally would or with a keyword lookup, and if that attribute is not returned by the model, you will get `None`: - -```python -outputs.images -outputs["images"] -``` - -When considering the `outputs` object as a tuple, it only considers the attributes that don't have `None` values. -For instance, retrieving an image by indexing into it returns the tuple `(outputs.images)`: - -```python -outputs[:1] -``` - - - -To check a specific pipeline or model output, refer to its corresponding API documentation. - - - -## BaseOutput - -[[autodoc]] utils.BaseOutput - - to_tuple - -## ImagePipelineOutput - -[[autodoc]] pipelines.ImagePipelineOutput - -## FlaxImagePipelineOutput - -[[autodoc]] pipelines.pipeline_flax_utils.FlaxImagePipelineOutput - -## AudioPipelineOutput - -[[autodoc]] pipelines.AudioPipelineOutput - -## ImageTextPipelineOutput - -[[autodoc]] ImageTextPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/allegro.md b/diffusers/docs/source/en/api/pipelines/allegro.md deleted file mode 100644 index e13e339944e5183aad0081cd8e227508ab42883c..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/allegro.md +++ /dev/null @@ -1,34 +0,0 @@ - - -# Allegro - -[Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) from RhymesAI, by Yuan Zhou, Qiuyue Wang, Yuxuan Cai, Huan Yang. - -The abstract from the paper is: - -*Significant advancements have been made in the field of video generation, with the open-source community contributing a wealth of research papers and tools for training high-quality models. However, despite these efforts, the available information and resources remain insufficient for achieving commercial-level performance. In this report, we open the black box and introduce Allegro, an advanced video generation model that excels in both quality and temporal consistency. We also highlight the current limitations in the field and present a comprehensive methodology for training high-performance, commercial-level video generation models, addressing key aspects such as data, model architecture, training pipeline, and evaluation. Our user study shows that Allegro surpasses existing open-source models and most commercial models, ranking just behind Hailuo and Kling. Code: https://github.com/rhymes-ai/Allegro , Model: https://huggingface.co/rhymes-ai/Allegro , Gallery: https://rhymes.ai/allegro_gallery .* - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. - - - -## AllegroPipeline - -[[autodoc]] AllegroPipeline - - all - - __call__ - -## AllegroPipelineOutput - -[[autodoc]] pipelines.allegro.pipeline_output.AllegroPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/amused.md b/diffusers/docs/source/en/api/pipelines/amused.md deleted file mode 100644 index af20fcea177387fac6d9b1acbb2bbec9a09cf28c..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/amused.md +++ /dev/null @@ -1,48 +0,0 @@ - - -# aMUSEd - -aMUSEd was introduced in [aMUSEd: An Open MUSE Reproduction](https://huggingface.co/papers/2401.01808) by Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen. - -Amused is a lightweight text to image model based off of the [MUSE](https://arxiv.org/abs/2301.00704) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once. - -Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder CLIP-L/14 instead of t5-xxl. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes. - -The abstract from the paper is: - -*We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. We hope to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code. We also release checkpoints for two models which directly produce images at 256x256 and 512x512 resolutions.* - -| Model | Params | -|-------|--------| -| [amused-256](https://huggingface.co/amused/amused-256) | 603M | -| [amused-512](https://huggingface.co/amused/amused-512) | 608M | - -## AmusedPipeline - -[[autodoc]] AmusedPipeline - - __call__ - - all - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - -[[autodoc]] AmusedImg2ImgPipeline - - __call__ - - all - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - -[[autodoc]] AmusedInpaintPipeline - - __call__ - - all - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention \ No newline at end of file diff --git a/diffusers/docs/source/en/api/pipelines/animatediff.md b/diffusers/docs/source/en/api/pipelines/animatediff.md deleted file mode 100644 index 7359012803624eca36fa279f42fcbf5e087df8b3..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/animatediff.md +++ /dev/null @@ -1,1052 +0,0 @@ - - -# Text-to-Video Generation with AnimateDiff - -## Overview - -[AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning](https://arxiv.org/abs/2307.04725) by Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai. - -The abstract of the paper is the following: - -*With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. Subsequently, there is a great demand for image animation techniques to further combine generated static images with motion dynamics. In this report, we propose a practical framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to insert a newly initialized motion modeling module into the frozen text-to-image model and train it on video clips to distill reasonable motion priors. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base T2I readily become text-driven models that produce diverse and personalized animated images. We conduct our evaluation on several public representative personalized text-to-image models across anime pictures and realistic photographs, and demonstrate that our proposed framework helps these models generate temporally smooth animation clips while preserving the domain and diversity of their outputs. Code and pre-trained weights will be publicly available at [this https URL](https://animatediff.github.io/).* - -## Available Pipelines - -| Pipeline | Tasks | Demo -|---|---|:---:| -| [AnimateDiffPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff.py) | *Text-to-Video Generation with AnimateDiff* | -| [AnimateDiffControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_controlnet.py) | *Controlled Video-to-Video Generation with AnimateDiff using ControlNet* | -| [AnimateDiffSparseControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_sparsectrl.py) | *Controlled Video-to-Video Generation with AnimateDiff using SparseCtrl* | -| [AnimateDiffSDXLPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_sdxl.py) | *Video-to-Video Generation with AnimateDiff* | -| [AnimateDiffVideoToVideoPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py) | *Video-to-Video Generation with AnimateDiff* | -| [AnimateDiffVideoToVideoControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py) | *Video-to-Video Generation with AnimateDiff using ControlNet* | - -## Available checkpoints - -Motion Adapter checkpoints can be found under [guoyww](https://huggingface.co/guoyww/). These checkpoints are meant to work with any model based on Stable Diffusion 1.4/1.5. - -## Usage example - -### AnimateDiffPipeline - -AnimateDiff works with a MotionAdapter checkpoint and a Stable Diffusion model checkpoint. The MotionAdapter is a collection of Motion Modules that are responsible for adding coherent motion across image frames. These modules are applied after the Resnet and Attention blocks in Stable Diffusion UNet. - -The following example demonstrates how to use a *MotionAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5. - -```python -import torch -from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter -from diffusers.utils import export_to_gif - -# Load the motion adapter -adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) -# load SD 1.5 based finetuned model -model_id = "SG161222/Realistic_Vision_V5.1_noVAE" -pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16) -scheduler = DDIMScheduler.from_pretrained( - model_id, - subfolder="scheduler", - clip_sample=False, - timestep_spacing="linspace", - beta_schedule="linear", - steps_offset=1, -) -pipe.scheduler = scheduler - -# enable memory savings -pipe.enable_vae_slicing() -pipe.enable_model_cpu_offload() - -output = pipe( - prompt=( - "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, " - "orange sky, warm lighting, fishing boats, ocean waves seagulls, " - "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, " - "golden hour, coastal landscape, seaside scenery" - ), - negative_prompt="bad quality, worse quality", - num_frames=16, - guidance_scale=7.5, - num_inference_steps=25, - generator=torch.Generator("cpu").manual_seed(42), -) -frames = output.frames[0] -export_to_gif(frames, "animation.gif") -``` - -Here are some sample outputs: - - - - - -
- masterpiece, bestquality, sunset. -
- masterpiece, bestquality, sunset -
- - - -AnimateDiff tends to work better with finetuned Stable Diffusion models. If you plan on using a scheduler that can clip samples, make sure to disable it by setting `clip_sample=False` in the scheduler as this can also have an adverse effect on generated samples. Additionally, the AnimateDiff checkpoints can be sensitive to the beta schedule of the scheduler. We recommend setting this to `linear`. - - - -### AnimateDiffControlNetPipeline - -AnimateDiff can also be used with ControlNets ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide depth maps, the ControlNet model generates a video that'll preserve the spatial information from the depth maps. It is a more flexible and accurate way to control the video generation process. - -```python -import torch -from diffusers import AnimateDiffControlNetPipeline, AutoencoderKL, ControlNetModel, MotionAdapter, LCMScheduler -from diffusers.utils import export_to_gif, load_video - -# Additionally, you will need a preprocess videos before they can be used with the ControlNet -# HF maintains just the right package for it: `pip install controlnet_aux` -from controlnet_aux.processor import ZoeDetector - -# Download controlnets from https://huggingface.co/lllyasviel/ControlNet-v1-1 to use .from_single_file -# Download Diffusers-format controlnets, such as https://huggingface.co/lllyasviel/sd-controlnet-depth, to use .from_pretrained() -controlnet = ControlNetModel.from_single_file("control_v11f1p_sd15_depth.pth", torch_dtype=torch.float16) - -# We use AnimateLCM for this example but one can use the original motion adapters as well (for example, https://huggingface.co/guoyww/animatediff-motion-adapter-v1-5-3) -motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM") - -vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16) -pipe: AnimateDiffControlNetPipeline = AnimateDiffControlNetPipeline.from_pretrained( - "SG161222/Realistic_Vision_V5.1_noVAE", - motion_adapter=motion_adapter, - controlnet=controlnet, - vae=vae, -).to(device="cuda", dtype=torch.float16) -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear") -pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm-lora") -pipe.set_adapters(["lcm-lora"], [0.8]) - -depth_detector = ZoeDetector.from_pretrained("lllyasviel/Annotators").to("cuda") -video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif") -conditioning_frames = [] - -with pipe.progress_bar(total=len(video)) as progress_bar: - for frame in video: - conditioning_frames.append(depth_detector(frame)) - progress_bar.update() - -prompt = "a panda, playing a guitar, sitting in a pink boat, in the ocean, mountains in background, realistic, high quality" -negative_prompt = "bad quality, worst quality" - -video = pipe( - prompt=prompt, - negative_prompt=negative_prompt, - num_frames=len(video), - num_inference_steps=10, - guidance_scale=2.0, - conditioning_frames=conditioning_frames, - generator=torch.Generator().manual_seed(42), -).frames[0] - -export_to_gif(video, "animatediff_controlnet.gif", fps=8) -``` - -Here are some sample outputs: - - - - - - - - - - -
Source VideoOutput Video
- raccoon playing a guitar -
- racoon playing a guitar -
- a panda, playing a guitar, sitting in a pink boat, in the ocean, mountains in background, realistic, high quality -
- a panda, playing a guitar, sitting in a pink boat, in the ocean, mountains in background, realistic, high quality -
- -### AnimateDiffSparseControlNetPipeline - -[SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://arxiv.org/abs/2311.16933) for achieving controlled generation in text-to-video diffusion models by Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. - -The abstract from the paper is: - -*The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at [this https URL](https://guoyww.github.io/projects/SparseCtrl).* - -SparseCtrl introduces the following checkpoints for controlled text-to-video generation: - -- [SparseCtrl Scribble](https://huggingface.co/guoyww/animatediff-sparsectrl-scribble) -- [SparseCtrl RGB](https://huggingface.co/guoyww/animatediff-sparsectrl-rgb) - -#### Using SparseCtrl Scribble - -```python -import torch - -from diffusers import AnimateDiffSparseControlNetPipeline -from diffusers.models import AutoencoderKL, MotionAdapter, SparseControlNetModel -from diffusers.schedulers import DPMSolverMultistepScheduler -from diffusers.utils import export_to_gif, load_image - - -model_id = "SG161222/Realistic_Vision_V5.1_noVAE" -motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-3" -controlnet_id = "guoyww/animatediff-sparsectrl-scribble" -lora_adapter_id = "guoyww/animatediff-motion-lora-v1-5-3" -vae_id = "stabilityai/sd-vae-ft-mse" -device = "cuda" - -motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id, torch_dtype=torch.float16).to(device) -controlnet = SparseControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16).to(device) -vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16).to(device) -scheduler = DPMSolverMultistepScheduler.from_pretrained( - model_id, - subfolder="scheduler", - beta_schedule="linear", - algorithm_type="dpmsolver++", - use_karras_sigmas=True, -) -pipe = AnimateDiffSparseControlNetPipeline.from_pretrained( - model_id, - motion_adapter=motion_adapter, - controlnet=controlnet, - vae=vae, - scheduler=scheduler, - torch_dtype=torch.float16, -).to(device) -pipe.load_lora_weights(lora_adapter_id, adapter_name="motion_lora") -pipe.fuse_lora(lora_scale=1.0) - -prompt = "an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality" -negative_prompt = "low quality, worst quality, letterboxed" - -image_files = [ - "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-1.png", - "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-2.png", - "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-3.png" -] -condition_frame_indices = [0, 8, 15] -conditioning_frames = [load_image(img_file) for img_file in image_files] - -video = pipe( - prompt=prompt, - negative_prompt=negative_prompt, - num_inference_steps=25, - conditioning_frames=conditioning_frames, - controlnet_conditioning_scale=1.0, - controlnet_frame_indices=condition_frame_indices, - generator=torch.Generator().manual_seed(1337), -).frames[0] -export_to_gif(video, "output.gif") -``` - -Here are some sample outputs: - - - -
- an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality -
- - - - - - - - - -
-
- scribble-1 -
-
-
- scribble-2 -
-
-
- scribble-3 -
-
-
- an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality -
-
- -#### Using SparseCtrl RGB - -```python -import torch - -from diffusers import AnimateDiffSparseControlNetPipeline -from diffusers.models import AutoencoderKL, MotionAdapter, SparseControlNetModel -from diffusers.schedulers import DPMSolverMultistepScheduler -from diffusers.utils import export_to_gif, load_image - - -model_id = "SG161222/Realistic_Vision_V5.1_noVAE" -motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-3" -controlnet_id = "guoyww/animatediff-sparsectrl-rgb" -lora_adapter_id = "guoyww/animatediff-motion-lora-v1-5-3" -vae_id = "stabilityai/sd-vae-ft-mse" -device = "cuda" - -motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id, torch_dtype=torch.float16).to(device) -controlnet = SparseControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16).to(device) -vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16).to(device) -scheduler = DPMSolverMultistepScheduler.from_pretrained( - model_id, - subfolder="scheduler", - beta_schedule="linear", - algorithm_type="dpmsolver++", - use_karras_sigmas=True, -) -pipe = AnimateDiffSparseControlNetPipeline.from_pretrained( - model_id, - motion_adapter=motion_adapter, - controlnet=controlnet, - vae=vae, - scheduler=scheduler, - torch_dtype=torch.float16, -).to(device) -pipe.load_lora_weights(lora_adapter_id, adapter_name="motion_lora") - -image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-firework.png") - -video = pipe( - prompt="closeup face photo of man in black clothes, night city street, bokeh, fireworks in background", - negative_prompt="low quality, worst quality", - num_inference_steps=25, - conditioning_frames=image, - controlnet_frame_indices=[0], - controlnet_conditioning_scale=1.0, - generator=torch.Generator().manual_seed(42), -).frames[0] -export_to_gif(video, "output.gif") -``` - -Here are some sample outputs: - - - -
- closeup face photo of man in black clothes, night city street, bokeh, fireworks in background -
- - - - - -
-
- closeup face photo of man in black clothes, night city street, bokeh, fireworks in background -
-
-
- closeup face photo of man in black clothes, night city street, bokeh, fireworks in background -
-
- -### AnimateDiffSDXLPipeline - -AnimateDiff can also be used with SDXL models. This is currently an experimental feature as only a beta release of the motion adapter checkpoint is available. - -```python -import torch -from diffusers.models import MotionAdapter -from diffusers import AnimateDiffSDXLPipeline, DDIMScheduler -from diffusers.utils import export_to_gif - -adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-sdxl-beta", torch_dtype=torch.float16) - -model_id = "stabilityai/stable-diffusion-xl-base-1.0" -scheduler = DDIMScheduler.from_pretrained( - model_id, - subfolder="scheduler", - clip_sample=False, - timestep_spacing="linspace", - beta_schedule="linear", - steps_offset=1, -) -pipe = AnimateDiffSDXLPipeline.from_pretrained( - model_id, - motion_adapter=adapter, - scheduler=scheduler, - torch_dtype=torch.float16, - variant="fp16", -).to("cuda") - -# enable memory savings -pipe.enable_vae_slicing() -pipe.enable_vae_tiling() - -output = pipe( - prompt="a panda surfing in the ocean, realistic, high quality", - negative_prompt="low quality, worst quality", - num_inference_steps=20, - guidance_scale=8, - width=1024, - height=1024, - num_frames=16, -) - -frames = output.frames[0] -export_to_gif(frames, "animation.gif") -``` - -### AnimateDiffVideoToVideoPipeline - -AnimateDiff can also be used to generate visually similar videos or enable style/character/background or other edits starting from an initial video, allowing you to seamlessly explore creative possibilities. - -```python -import imageio -import requests -import torch -from diffusers import AnimateDiffVideoToVideoPipeline, DDIMScheduler, MotionAdapter -from diffusers.utils import export_to_gif -from io import BytesIO -from PIL import Image - -# Load the motion adapter -adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) -# load SD 1.5 based finetuned model -model_id = "SG161222/Realistic_Vision_V5.1_noVAE" -pipe = AnimateDiffVideoToVideoPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16) -scheduler = DDIMScheduler.from_pretrained( - model_id, - subfolder="scheduler", - clip_sample=False, - timestep_spacing="linspace", - beta_schedule="linear", - steps_offset=1, -) -pipe.scheduler = scheduler - -# enable memory savings -pipe.enable_vae_slicing() -pipe.enable_model_cpu_offload() - -# helper function to load videos -def load_video(file_path: str): - images = [] - - if file_path.startswith(('http://', 'https://')): - # If the file_path is a URL - response = requests.get(file_path) - response.raise_for_status() - content = BytesIO(response.content) - vid = imageio.get_reader(content) - else: - # Assuming it's a local file path - vid = imageio.get_reader(file_path) - - for frame in vid: - pil_image = Image.fromarray(frame) - images.append(pil_image) - - return images - -video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif") - -output = pipe( - video = video, - prompt="panda playing a guitar, on a boat, in the ocean, high quality", - negative_prompt="bad quality, worse quality", - guidance_scale=7.5, - num_inference_steps=25, - strength=0.5, - generator=torch.Generator("cpu").manual_seed(42), -) -frames = output.frames[0] -export_to_gif(frames, "animation.gif") -``` - -Here are some sample outputs: - - - - - - - - - - - - - - -
Source VideoOutput Video
- raccoon playing a guitar -
- racoon playing a guitar -
- panda playing a guitar -
- panda playing a guitar -
- closeup of margot robbie, fireworks in the background, high quality -
- closeup of margot robbie, fireworks in the background, high quality -
- closeup of tony stark, robert downey jr, fireworks -
- closeup of tony stark, robert downey jr, fireworks -
- - - -### AnimateDiffVideoToVideoControlNetPipeline - -AnimateDiff can be used together with ControlNets to enhance video-to-video generation by allowing for precise control over the output. ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala, and allows you to condition Stable Diffusion with an additional control image to ensure that the spatial information is preserved throughout the video. - -This pipeline allows you to condition your generation both on the original video and on a sequence of control images. - -```python -import torch -from PIL import Image -from tqdm.auto import tqdm - -from controlnet_aux.processor import OpenposeDetector -from diffusers import AnimateDiffVideoToVideoControlNetPipeline -from diffusers.utils import export_to_gif, load_video -from diffusers import AutoencoderKL, ControlNetModel, MotionAdapter, LCMScheduler - -# Load the ControlNet -controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16) -# Load the motion adapter -motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM") -# Load SD 1.5 based finetuned model -vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16) -pipe = AnimateDiffVideoToVideoControlNetPipeline.from_pretrained( - "SG161222/Realistic_Vision_V5.1_noVAE", - motion_adapter=motion_adapter, - controlnet=controlnet, - vae=vae, -).to(device="cuda", dtype=torch.float16) - -# Enable LCM to speed up inference -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear") -pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm-lora") -pipe.set_adapters(["lcm-lora"], [0.8]) - -video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/dance.gif") -video = [frame.convert("RGB") for frame in video] - -prompt = "astronaut in space, dancing" -negative_prompt = "bad quality, worst quality, jpeg artifacts, ugly" - -# Create controlnet preprocessor -open_pose = OpenposeDetector.from_pretrained("lllyasviel/Annotators").to("cuda") - -# Preprocess controlnet images -conditioning_frames = [] -for frame in tqdm(video): - conditioning_frames.append(open_pose(frame)) - -strength = 0.8 -with torch.inference_mode(): - video = pipe( - video=video, - prompt=prompt, - negative_prompt=negative_prompt, - num_inference_steps=10, - guidance_scale=2.0, - controlnet_conditioning_scale=0.75, - conditioning_frames=conditioning_frames, - strength=strength, - generator=torch.Generator().manual_seed(42), - ).frames[0] - -video = [frame.resize(conditioning_frames[0].size) for frame in video] -export_to_gif(video, f"animatediff_vid2vid_controlnet.gif", fps=8) -``` - -Here are some sample outputs: - - - - - - - - - - -
Source VideoOutput Video
- anime girl, dancing -
- anime girl, dancing -
- astronaut in space, dancing -
- astronaut in space, dancing -
- -**The lights and composition were transferred from the Source Video.** - -## Using Motion LoRAs - -Motion LoRAs are a collection of LoRAs that work with the `guoyww/animatediff-motion-adapter-v1-5-2` checkpoint. These LoRAs are responsible for adding specific types of motion to the animations. - -```python -import torch -from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter -from diffusers.utils import export_to_gif - -# Load the motion adapter -adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) -# load SD 1.5 based finetuned model -model_id = "SG161222/Realistic_Vision_V5.1_noVAE" -pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16) -pipe.load_lora_weights( - "guoyww/animatediff-motion-lora-zoom-out", adapter_name="zoom-out" -) - -scheduler = DDIMScheduler.from_pretrained( - model_id, - subfolder="scheduler", - clip_sample=False, - beta_schedule="linear", - timestep_spacing="linspace", - steps_offset=1, -) -pipe.scheduler = scheduler - -# enable memory savings -pipe.enable_vae_slicing() -pipe.enable_model_cpu_offload() - -output = pipe( - prompt=( - "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, " - "orange sky, warm lighting, fishing boats, ocean waves seagulls, " - "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, " - "golden hour, coastal landscape, seaside scenery" - ), - negative_prompt="bad quality, worse quality", - num_frames=16, - guidance_scale=7.5, - num_inference_steps=25, - generator=torch.Generator("cpu").manual_seed(42), -) -frames = output.frames[0] -export_to_gif(frames, "animation.gif") -``` - - - - - -
- masterpiece, bestquality, sunset. -
- masterpiece, bestquality, sunset -
- -## Using Motion LoRAs with PEFT - -You can also leverage the [PEFT](https://github.com/huggingface/peft) backend to combine Motion LoRA's and create more complex animations. - -First install PEFT with - -```shell -pip install peft -``` - -Then you can use the following code to combine Motion LoRAs. - -```python -import torch -from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter -from diffusers.utils import export_to_gif - -# Load the motion adapter -adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) -# load SD 1.5 based finetuned model -model_id = "SG161222/Realistic_Vision_V5.1_noVAE" -pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16) - -pipe.load_lora_weights( - "diffusers/animatediff-motion-lora-zoom-out", adapter_name="zoom-out", -) -pipe.load_lora_weights( - "diffusers/animatediff-motion-lora-pan-left", adapter_name="pan-left", -) -pipe.set_adapters(["zoom-out", "pan-left"], adapter_weights=[1.0, 1.0]) - -scheduler = DDIMScheduler.from_pretrained( - model_id, - subfolder="scheduler", - clip_sample=False, - timestep_spacing="linspace", - beta_schedule="linear", - steps_offset=1, -) -pipe.scheduler = scheduler - -# enable memory savings -pipe.enable_vae_slicing() -pipe.enable_model_cpu_offload() - -output = pipe( - prompt=( - "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, " - "orange sky, warm lighting, fishing boats, ocean waves seagulls, " - "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, " - "golden hour, coastal landscape, seaside scenery" - ), - negative_prompt="bad quality, worse quality", - num_frames=16, - guidance_scale=7.5, - num_inference_steps=25, - generator=torch.Generator("cpu").manual_seed(42), -) -frames = output.frames[0] -export_to_gif(frames, "animation.gif") -``` - - - - - -
- masterpiece, bestquality, sunset. -
- masterpiece, bestquality, sunset -
- -## Using FreeInit - -[FreeInit: Bridging Initialization Gap in Video Diffusion Models](https://arxiv.org/abs/2312.07537) by Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu. - -FreeInit is an effective method that improves temporal consistency and overall quality of videos generated using video-diffusion-models without any addition training. It can be applied to AnimateDiff, ModelScope, VideoCrafter and various other video generation models seamlessly at inference time, and works by iteratively refining the latent-initialization noise. More details can be found it the paper. - -The following example demonstrates the usage of FreeInit. - -```python -import torch -from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler -from diffusers.utils import export_to_gif - -adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2") -model_id = "SG161222/Realistic_Vision_V5.1_noVAE" -pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16).to("cuda") -pipe.scheduler = DDIMScheduler.from_pretrained( - model_id, - subfolder="scheduler", - beta_schedule="linear", - clip_sample=False, - timestep_spacing="linspace", - steps_offset=1 -) - -# enable memory savings -pipe.enable_vae_slicing() -pipe.enable_vae_tiling() - -# enable FreeInit -# Refer to the enable_free_init documentation for a full list of configurable parameters -pipe.enable_free_init(method="butterworth", use_fast_sampling=True) - -# run inference -output = pipe( - prompt="a panda playing a guitar, on a boat, in the ocean, high quality", - negative_prompt="bad quality, worse quality", - num_frames=16, - guidance_scale=7.5, - num_inference_steps=20, - generator=torch.Generator("cpu").manual_seed(666), -) - -# disable FreeInit -pipe.disable_free_init() - -frames = output.frames[0] -export_to_gif(frames, "animation.gif") -``` - - - -FreeInit is not really free - the improved quality comes at the cost of extra computation. It requires sampling a few extra times depending on the `num_iters` parameter that is set when enabling it. Setting the `use_fast_sampling` parameter to `True` can improve the overall performance (at the cost of lower quality compared to when `use_fast_sampling=False` but still better results than vanilla video generation models). - - - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - - - - - - - - - - -
Without FreeInit enabledWith FreeInit enabled
- panda playing a guitar -
- panda playing a guitar -
- panda playing a guitar -
- panda playing a guitar -
- -## Using AnimateLCM - -[AnimateLCM](https://animatelcm.github.io/) is a motion module checkpoint and an [LCM LoRA](https://huggingface.co/docs/diffusers/using-diffusers/inference_with_lcm_lora) that have been created using a consistency learning strategy that decouples the distillation of the image generation priors and the motion generation priors. - -```python -import torch -from diffusers import AnimateDiffPipeline, LCMScheduler, MotionAdapter -from diffusers.utils import export_to_gif - -adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM") -pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter) -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear") - -pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="sd15_lora_beta.safetensors", adapter_name="lcm-lora") - -pipe.enable_vae_slicing() -pipe.enable_model_cpu_offload() - -output = pipe( - prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution", - negative_prompt="bad quality, worse quality, low resolution", - num_frames=16, - guidance_scale=1.5, - num_inference_steps=6, - generator=torch.Generator("cpu").manual_seed(0), -) -frames = output.frames[0] -export_to_gif(frames, "animatelcm.gif") -``` - - - - - -
- A space rocket, 4K. -
- A space rocket, 4K -
- -AnimateLCM is also compatible with existing [Motion LoRAs](https://huggingface.co/collections/dn6/animatediff-motion-loras-654cb8ad732b9e3cf4d3c17e). - -```python -import torch -from diffusers import AnimateDiffPipeline, LCMScheduler, MotionAdapter -from diffusers.utils import export_to_gif - -adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM") -pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter) -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear") - -pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="sd15_lora_beta.safetensors", adapter_name="lcm-lora") -pipe.load_lora_weights("guoyww/animatediff-motion-lora-tilt-up", adapter_name="tilt-up") - -pipe.set_adapters(["lcm-lora", "tilt-up"], [1.0, 0.8]) -pipe.enable_vae_slicing() -pipe.enable_model_cpu_offload() - -output = pipe( - prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution", - negative_prompt="bad quality, worse quality, low resolution", - num_frames=16, - guidance_scale=1.5, - num_inference_steps=6, - generator=torch.Generator("cpu").manual_seed(0), -) -frames = output.frames[0] -export_to_gif(frames, "animatelcm-motion-lora.gif") -``` - - - - - -
- A space rocket, 4K. -
- A space rocket, 4K -
- -## Using FreeNoise - -[FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling](https://arxiv.org/abs/2310.15169) by Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu. - -FreeNoise is a sampling mechanism that can generate longer videos with short-video generation models by employing noise-rescheduling, temporal attention over sliding windows, and weighted averaging of latent frames. It also can be used with multiple prompts to allow for interpolated video generations. More details are available in the paper. - -The currently supported AnimateDiff pipelines that can be used with FreeNoise are: -- [`AnimateDiffPipeline`] -- [`AnimateDiffControlNetPipeline`] -- [`AnimateDiffVideoToVideoPipeline`] -- [`AnimateDiffVideoToVideoControlNetPipeline`] - -In order to use FreeNoise, a single line needs to be added to the inference code after loading your pipelines. - -```diff -+ pipe.enable_free_noise() -``` - -After this, either a single prompt could be used, or multiple prompts can be passed as a dictionary of integer-string pairs. The integer keys of the dictionary correspond to the frame index at which the influence of that prompt would be maximum. Each frame index should map to a single string prompt. The prompts for intermediate frame indices, that are not passed in the dictionary, are created by interpolating between the frame prompts that are passed. By default, simple linear interpolation is used. However, you can customize this behaviour with a callback to the `prompt_interpolation_callback` parameter when enabling FreeNoise. - -Full example: - -```python -import torch -from diffusers import AutoencoderKL, AnimateDiffPipeline, LCMScheduler, MotionAdapter -from diffusers.utils import export_to_video, load_image - -# Load pipeline -dtype = torch.float16 -motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM", torch_dtype=dtype) -vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=dtype) - -pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=motion_adapter, vae=vae, torch_dtype=dtype) -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear") - -pipe.load_lora_weights( - "wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm_lora" -) -pipe.set_adapters(["lcm_lora"], [0.8]) - -# Enable FreeNoise for long prompt generation -pipe.enable_free_noise(context_length=16, context_stride=4) -pipe.to("cuda") - -# Can be a single prompt, or a dictionary with frame timesteps -prompt = { - 0: "A caterpillar on a leaf, high quality, photorealistic", - 40: "A caterpillar transforming into a cocoon, on a leaf, near flowers, photorealistic", - 80: "A cocoon on a leaf, flowers in the backgrond, photorealistic", - 120: "A cocoon maturing and a butterfly being born, flowers and leaves visible in the background, photorealistic", - 160: "A beautiful butterfly, vibrant colors, sitting on a leaf, flowers in the background, photorealistic", - 200: "A beautiful butterfly, flying away in a forest, photorealistic", - 240: "A cyberpunk butterfly, neon lights, glowing", -} -negative_prompt = "bad quality, worst quality, jpeg artifacts" - -# Run inference -output = pipe( - prompt=prompt, - negative_prompt=negative_prompt, - num_frames=256, - guidance_scale=2.5, - num_inference_steps=10, - generator=torch.Generator("cpu").manual_seed(0), -) - -# Save video -frames = output.frames[0] -export_to_video(frames, "output.mp4", fps=16) -``` - -### FreeNoise memory savings - -Since FreeNoise processes multiple frames together, there are parts in the modeling where the memory required exceeds that available on normal consumer GPUs. The main memory bottlenecks that we identified are spatial and temporal attention blocks, upsampling and downsampling blocks, resnet blocks and feed-forward layers. Since most of these blocks operate effectively only on the channel/embedding dimension, one can perform chunked inference across the batch dimensions. The batch dimension in AnimateDiff are either spatial (`[B x F, H x W, C]`) or temporal (`B x H x W, F, C`) in nature (note that it may seem counter-intuitive, but the batch dimension here are correct, because spatial blocks process across the `B x F` dimension while the temporal blocks process across the `B x H x W` dimension). We introduce a `SplitInferenceModule` that makes it easier to chunk across any dimension and perform inference. This saves a lot of memory but comes at the cost of requiring more time for inference. - -```diff -# Load pipeline and adapters -# ... -+ pipe.enable_free_noise_split_inference() -+ pipe.unet.enable_forward_chunking(16) -``` - -The call to `pipe.enable_free_noise_split_inference` method accepts two parameters: `spatial_split_size` (defaults to `256`) and `temporal_split_size` (defaults to `16`). These can be configured based on how much VRAM you have available. A lower split size results in lower memory usage but slower inference, whereas a larger split size results in faster inference at the cost of more memory. - -## Using `from_single_file` with the MotionAdapter - -`diffusers>=0.30.0` supports loading the AnimateDiff checkpoints into the `MotionAdapter` in their original format via `from_single_file` - -```python -from diffusers import MotionAdapter - -ckpt_path = "https://huggingface.co/Lightricks/LongAnimateDiff/blob/main/lt_long_mm_32_frames.ckpt" - -adapter = MotionAdapter.from_single_file(ckpt_path, torch_dtype=torch.float16) -pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter) -``` - -## AnimateDiffPipeline - -[[autodoc]] AnimateDiffPipeline - - all - - __call__ - -## AnimateDiffControlNetPipeline - -[[autodoc]] AnimateDiffControlNetPipeline - - all - - __call__ - -## AnimateDiffSparseControlNetPipeline - -[[autodoc]] AnimateDiffSparseControlNetPipeline - - all - - __call__ - -## AnimateDiffSDXLPipeline - -[[autodoc]] AnimateDiffSDXLPipeline - - all - - __call__ - -## AnimateDiffVideoToVideoPipeline - -[[autodoc]] AnimateDiffVideoToVideoPipeline - - all - - __call__ - -## AnimateDiffVideoToVideoControlNetPipeline - -[[autodoc]] AnimateDiffVideoToVideoControlNetPipeline - - all - - __call__ - -## AnimateDiffPipelineOutput - -[[autodoc]] pipelines.animatediff.AnimateDiffPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/attend_and_excite.md b/diffusers/docs/source/en/api/pipelines/attend_and_excite.md deleted file mode 100644 index fd8dd95fa1c3f95e2fb822e6ed2c61e82538374c..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/attend_and_excite.md +++ /dev/null @@ -1,37 +0,0 @@ - - -# Attend-and-Excite - -Attend-and-Excite for Stable Diffusion was proposed in [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://attendandexcite.github.io/Attend-and-Excite/) and provides textual attention control over image generation. - -The abstract from the paper is: - -*Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt. While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt. We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt. Moreover, we find that in some cases the model also fails to correctly bind attributes (e.g., colors) to their corresponding subjects. To help mitigate these failure cases, we introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness of the generated images. Using an attention-based formulation of GSN, dubbed Attend-and-Excite, we guide the model to refine the cross-attention units to attend to all subject tokens in the text prompt and strengthen - or excite - their activations, encouraging the model to generate all subjects described in the text prompt. We compare our approach to alternative approaches and demonstrate that it conveys the desired concepts more faithfully across a range of text prompts.* - -You can find additional information about Attend-and-Excite on the [project page](https://attendandexcite.github.io/Attend-and-Excite/), the [original codebase](https://github.com/AttendAndExcite/Attend-and-Excite), or try it out in a [demo](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## StableDiffusionAttendAndExcitePipeline - -[[autodoc]] StableDiffusionAttendAndExcitePipeline - - all - - __call__ - -## StableDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/audioldm.md b/diffusers/docs/source/en/api/pipelines/audioldm.md deleted file mode 100644 index 95d41b9569f54dde053de02af8f0f17a81f61298..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/audioldm.md +++ /dev/null @@ -1,50 +0,0 @@ - - -# AudioLDM - -AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://huggingface.co/papers/2301.12503) by Haohe Liu et al. Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM -is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap) -latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional -sound effects, human speech and music. - -The abstract from the paper is: - -*Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at [this https URL](https://audioldm.github.io/).* - -The original codebase can be found at [haoheliu/AudioLDM](https://github.com/haoheliu/AudioLDM). - -## Tips - -When constructing a prompt, keep in mind: - -* Descriptive prompt inputs work best; you can use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific (for example, "water stream in a forest" instead of "stream"). -* It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with. - -During inference: - -* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. -* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument. - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## AudioLDMPipeline -[[autodoc]] AudioLDMPipeline - - all - - __call__ - -## AudioPipelineOutput -[[autodoc]] pipelines.AudioPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/audioldm2.md b/diffusers/docs/source/en/api/pipelines/audioldm2.md deleted file mode 100644 index 9f2b7529d4bc27943448bf37c545dfe8b6d60cee..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/audioldm2.md +++ /dev/null @@ -1,81 +0,0 @@ - - -# AudioLDM 2 - -AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music. - -Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM 2 is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from text embeddings. Two text encoder models are used to compute the text embeddings from a prompt input: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap) and the encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5). These text embeddings are then projected to a shared embedding space by an [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/main/api/pipelines/audioldm2#diffusers.AudioLDM2ProjectionModel). A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) _language model (LM)_ is used to auto-regressively predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The [UNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2UNet2DConditionModel) of AudioLDM 2 is unique in the sense that it takes **two** cross-attention embeddings, as opposed to one cross-attention conditioning, as in most other LDMs. - -The abstract of the paper is the following: - -*Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-art or competitive performance against previous approaches. Our code, pretrained model, and demo are available at [this https URL](https://audioldm.github.io/audioldm2).* - -This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi) and [Nguyễn Công Tú Anh](https://github.com/tuanh123789). The original codebase can be -found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2). - -## Tips - -### Choosing a checkpoint - -AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio generation. The third checkpoint is trained exclusively on text-to-music generation. - -All checkpoints share the same model size for the text encoders and VAE. They differ in the size and depth of the UNet. -See table below for details on the three checkpoints: - -| Checkpoint | Task | UNet Model Size | Total Model Size | Training Data / h | -|-----------------------------------------------------------------|---------------|-----------------|------------------|-------------------| -| [audioldm2](https://huggingface.co/cvssp/audioldm2) | Text-to-audio | 350M | 1.1B | 1150k | -| [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 750M | 1.5B | 1150k | -| [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 350M | 1.1B | 665k | -| [audioldm2-gigaspeech](https://huggingface.co/anhnct/audioldm2_gigaspeech) | Text-to-speech | 350M | 1.1B |10k | -| [audioldm2-ljspeech](https://huggingface.co/anhnct/audioldm2_ljspeech) | Text-to-speech | 350M | 1.1B | | - -### Constructing a prompt - -* Descriptive prompt inputs work best: use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g. "water stream in a forest" instead of "stream"). -* It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with. -* Using a **negative prompt** can significantly improve the quality of the generated waveform, by guiding the generation away from terms that correspond to poor quality audio. Try using a negative prompt of "Low quality." - -### Controlling inference - -* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. -* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument. - -### Evaluating generated waveforms: - -* The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation. -* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly. - -The following example demonstrates how to construct good music and speech generation using the aforementioned tips: [example](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.example). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## AudioLDM2Pipeline -[[autodoc]] AudioLDM2Pipeline - - all - - __call__ - -## AudioLDM2ProjectionModel -[[autodoc]] AudioLDM2ProjectionModel - - forward - -## AudioLDM2UNet2DConditionModel -[[autodoc]] AudioLDM2UNet2DConditionModel - - forward - -## AudioPipelineOutput -[[autodoc]] pipelines.AudioPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/aura_flow.md b/diffusers/docs/source/en/api/pipelines/aura_flow.md deleted file mode 100644 index aa5a04800e6fcdf241e60fc22470246151759868..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/aura_flow.md +++ /dev/null @@ -1,29 +0,0 @@ - - -# AuraFlow - -AuraFlow is inspired by [Stable Diffusion 3](../pipelines/stable_diffusion/stable_diffusion_3.md) and is by far the largest text-to-image generation model that comes with an Apache 2.0 license. This model achieves state-of-the-art results on the [GenEval](https://github.com/djghosh13/geneval) benchmark. - -It was developed by the Fal team and more details about it can be found in [this blog post](https://blog.fal.ai/auraflow/). - - - -AuraFlow can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. - - - -## AuraFlowPipeline - -[[autodoc]] AuraFlowPipeline - - all - - __call__ \ No newline at end of file diff --git a/diffusers/docs/source/en/api/pipelines/auto_pipeline.md b/diffusers/docs/source/en/api/pipelines/auto_pipeline.md deleted file mode 100644 index f30bf7873c146aa1327ec73dac39eac1cde67c7d..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/auto_pipeline.md +++ /dev/null @@ -1,39 +0,0 @@ - - -# AutoPipeline - -The `AutoPipeline` is designed to make it easy to load a checkpoint for a task without needing to know the specific pipeline class. Based on the task, the `AutoPipeline` automatically retrieves the correct pipeline class from the checkpoint `model_index.json` file. - -> [!TIP] -> Check out the [AutoPipeline](../../tutorials/autopipeline) tutorial to learn how to use this API! - -## AutoPipelineForText2Image - -[[autodoc]] AutoPipelineForText2Image - - all - - from_pretrained - - from_pipe - -## AutoPipelineForImage2Image - -[[autodoc]] AutoPipelineForImage2Image - - all - - from_pretrained - - from_pipe - -## AutoPipelineForInpainting - -[[autodoc]] AutoPipelineForInpainting - - all - - from_pretrained - - from_pipe diff --git a/diffusers/docs/source/en/api/pipelines/blip_diffusion.md b/diffusers/docs/source/en/api/pipelines/blip_diffusion.md deleted file mode 100644 index b4504f6d6b195f4d0bd7f08e86a2b05dcd63e027..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/blip_diffusion.md +++ /dev/null @@ -1,41 +0,0 @@ - - -# BLIP-Diffusion - -BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://arxiv.org/abs/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation. - - -The abstract from the paper is: - -*Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Project page at [this https URL](https://dxli94.github.io/BLIP-Diffusion-website/).* - -The original codebase can be found at [salesforce/LAVIS](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion). You can find the official BLIP-Diffusion checkpoints under the [hf.co/SalesForce](https://hf.co/SalesForce) organization. - -`BlipDiffusionPipeline` and `BlipDiffusionControlNetPipeline` were contributed by [`ayushtues`](https://github.com/ayushtues/). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - - -## BlipDiffusionPipeline -[[autodoc]] BlipDiffusionPipeline - - all - - __call__ - -## BlipDiffusionControlNetPipeline -[[autodoc]] BlipDiffusionControlNetPipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/cogvideox.md b/diffusers/docs/source/en/api/pipelines/cogvideox.md deleted file mode 100644 index f0f4fd37e6d52cb1b6cad7873ebc0e885368c955..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/cogvideox.md +++ /dev/null @@ -1,133 +0,0 @@ - - -# CogVideoX - -[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://arxiv.org/abs/2408.06072) from Tsinghua University & ZhipuAI, by Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang. - -The abstract from the paper is: - -*We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compresses videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motion. In addition, we develop an effectively text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of CogVideoX-2B is publicly available at https://github.com/THUDM/CogVideo.* - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. - - - -This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM). - -There are two models available that can be used with the text-to-video and video-to-video CogVideoX pipelines: -- [`THUDM/CogVideoX-2b`](https://huggingface.co/THUDM/CogVideoX-2b): The recommended dtype for running this model is `fp16`. -- [`THUDM/CogVideoX-5b`](https://huggingface.co/THUDM/CogVideoX-5b): The recommended dtype for running this model is `bf16`. - -There is one model available that can be used with the image-to-video CogVideoX pipeline: -- [`THUDM/CogVideoX-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-5b-I2V): The recommended dtype for running this model is `bf16`. - -There are two models that support pose controllable generation (by the [Alibaba-PAI](https://huggingface.co/alibaba-pai) team): -- [`alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose): The recommended dtype for running this model is `bf16`. -- [`alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose): The recommended dtype for running this model is `bf16`. - -## Inference - -Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. - -First, load the pipeline: - -```python -import torch -from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline -from diffusers.utils import export_to_video,load_image -pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b").to("cuda") # or "THUDM/CogVideoX-2b" -``` - -If you are using the image-to-video pipeline, load it as follows: - -```python -pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V").to("cuda") -``` - -Then change the memory layout of the pipelines `transformer` component to `torch.channels_last`: - -```python -pipe.transformer.to(memory_format=torch.channels_last) -``` - -Compile the components and run inference: - -```python -pipe.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True) - -# CogVideoX works well with long and well-described prompts -prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance." -video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0] -``` - -The [T2V benchmark](https://gist.github.com/a-r-r-o-w/5183d75e452a368fd17448fcc810bd3f) results on an 80GB A100 machine are: - -``` -Without torch.compile(): Average inference time: 96.89 seconds. -With torch.compile(): Average inference time: 76.27 seconds. -``` - -### Memory optimization - -CogVideoX-2b requires about 19 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/a-r-r-o-w/3959a03f15be5c9bd1fe545b09dfcc93) script. - -- `pipe.enable_model_cpu_offload()`: - - Without enabling cpu offloading, memory usage is `33 GB` - - With enabling cpu offloading, memory usage is `19 GB` -- `pipe.enable_sequential_cpu_offload()`: - - Similar to `enable_model_cpu_offload` but can significantly reduce memory usage at the cost of slow inference - - When enabled, memory usage is under `4 GB` -- `pipe.vae.enable_tiling()`: - - With enabling cpu offloading and tiling, memory usage is `11 GB` -- `pipe.vae.enable_slicing()` - -### Quantized inference - -[torchao](https://github.com/pytorch/ao) and [optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the text encoder, transformer and VAE modules to lower the memory requirements. This makes it possible to run the model on a free-tier T4 Colab or lower VRAM GPUs! - -It is also worth noting that torchao quantization is fully compatible with [torch.compile](/optimization/torch2.0#torchcompile), which allows for much faster inference speed. Additionally, models can be serialized and stored in a quantized datatype to save disk space with torchao. Find examples and benchmarks in the gists below. -- [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897) -- [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa) - -## CogVideoXPipeline - -[[autodoc]] CogVideoXPipeline - - all - - __call__ - -## CogVideoXImageToVideoPipeline - -[[autodoc]] CogVideoXImageToVideoPipeline - - all - - __call__ - -## CogVideoXVideoToVideoPipeline - -[[autodoc]] CogVideoXVideoToVideoPipeline - - all - - __call__ - -## CogVideoXFunControlPipeline - -[[autodoc]] CogVideoXFunControlPipeline - - all - - __call__ - -## CogVideoXPipelineOutput - -[[autodoc]] pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/cogview3.md b/diffusers/docs/source/en/api/pipelines/cogview3.md deleted file mode 100644 index 85a9cf91736f68065098fdc6ece9305e3edb14b8..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/cogview3.md +++ /dev/null @@ -1,40 +0,0 @@ - - -# CogView3Plus - -[CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://huggingface.co/papers/2403.05121) from Tsinghua University & ZhipuAI, by Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, Jie Tang. - -The abstract from the paper is: - -*Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.* - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. - - - -This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM). - -## CogView3PlusPipeline - -[[autodoc]] CogView3PlusPipeline - - all - - __call__ - -## CogView3PipelineOutput - -[[autodoc]] pipelines.cogview3.pipeline_output.CogView3PipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/consistency_models.md b/diffusers/docs/source/en/api/pipelines/consistency_models.md deleted file mode 100644 index 680abaad420be71886e6784643db2627a6bfacbf..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/consistency_models.md +++ /dev/null @@ -1,56 +0,0 @@ - - -# Consistency Models - -Consistency Models were proposed in [Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. - -The abstract from the paper is: - -*Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256.* - -The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models), and additional checkpoints are available at [openai](https://huggingface.co/openai). - -The pipeline was contributed by [dg845](https://github.com/dg845) and [ayushtues](https://huggingface.co/ayushtues). ❤️ - -## Tips - -For an additional speed-up, use `torch.compile` to generate multiple images in <1 second: - -```diff - import torch - from diffusers import ConsistencyModelPipeline - - device = "cuda" - # Load the cd_bedroom256_lpips checkpoint. - model_id_or_path = "openai/diffusers-cd_bedroom256_lpips" - pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) - pipe.to(device) - -+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - - # Multistep sampling - # Timesteps can be explicitly specified; the particular timesteps below are from the original GitHub repo: - # https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L83 - for _ in range(10): - image = pipe(timesteps=[17, 0]).images[0] - image.show() -``` - - -## ConsistencyModelPipeline -[[autodoc]] ConsistencyModelPipeline - - all - - __call__ - -## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/controlnet.md b/diffusers/docs/source/en/api/pipelines/controlnet.md deleted file mode 100644 index 6b00902cf296a174be68c7f9614434e51eb29cd0..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/controlnet.md +++ /dev/null @@ -1,78 +0,0 @@ - - -# ControlNet - -ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. - -With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. - -The abstract from the paper is: - -*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* - -This model was contributed by [takuma104](https://huggingface.co/takuma104). ❤️ - -The original codebase can be found at [lllyasviel/ControlNet](https://github.com/lllyasviel/ControlNet), and you can find official ControlNet checkpoints on [lllyasviel's](https://huggingface.co/lllyasviel) Hub profile. - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## StableDiffusionControlNetPipeline -[[autodoc]] StableDiffusionControlNetPipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_vae_slicing - - disable_vae_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - - load_textual_inversion - -## StableDiffusionControlNetImg2ImgPipeline -[[autodoc]] StableDiffusionControlNetImg2ImgPipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_vae_slicing - - disable_vae_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - - load_textual_inversion - -## StableDiffusionControlNetInpaintPipeline -[[autodoc]] StableDiffusionControlNetInpaintPipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_vae_slicing - - disable_vae_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - - load_textual_inversion - -## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput - -## FlaxStableDiffusionControlNetPipeline -[[autodoc]] FlaxStableDiffusionControlNetPipeline - - all - - __call__ - -## FlaxStableDiffusionControlNetPipelineOutput -[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/controlnet_flux.md b/diffusers/docs/source/en/api/pipelines/controlnet_flux.md deleted file mode 100644 index 82454ae5e930efa67b43b7e82549d6478ece005a..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/controlnet_flux.md +++ /dev/null @@ -1,56 +0,0 @@ - - -# ControlNet with Flux.1 - -FluxControlNetPipeline is an implementation of ControlNet for Flux.1. - -ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. - -With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. - -The abstract from the paper is: - -*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* - -This controlnet code is implemented by [The InstantX Team](https://huggingface.co/InstantX). You can find pre-trained checkpoints for Flux-ControlNet in the table below: - - -| ControlNet type | Developer | Link | -| -------- | ---------- | ---- | -| Canny | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Canny) | -| Depth | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Depth) | -| Union | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Union) | - -XLabs ControlNets are also supported, which was contributed by the [XLabs team](https://huggingface.co/XLabs-AI). - -| ControlNet type | Developer | Link | -| -------- | ---------- | ---- | -| Canny | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-canny-diffusers) | -| Depth | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-depth-diffusers) | -| HED | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-hed-diffusers) | - - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## FluxControlNetPipeline -[[autodoc]] FluxControlNetPipeline - - all - - __call__ - - -## FluxPipelineOutput -[[autodoc]] pipelines.flux.pipeline_output.FluxPipelineOutput \ No newline at end of file diff --git a/diffusers/docs/source/en/api/pipelines/controlnet_hunyuandit.md b/diffusers/docs/source/en/api/pipelines/controlnet_hunyuandit.md deleted file mode 100644 index e702eb30b8b0d1b3672dcec135ec1abf035b024d..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/controlnet_hunyuandit.md +++ /dev/null @@ -1,36 +0,0 @@ - - -# ControlNet with Hunyuan-DiT - -HunyuanDiTControlNetPipeline is an implementation of ControlNet for [Hunyuan-DiT](https://arxiv.org/abs/2405.08748). - -ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. - -With a ControlNet model, you can provide an additional control image to condition and control Hunyuan-DiT generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. - -The abstract from the paper is: - -*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* - -This code is implemented by Tencent Hunyuan Team. You can find pre-trained checkpoints for Hunyuan-DiT ControlNets on [Tencent Hunyuan](https://huggingface.co/Tencent-Hunyuan). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## HunyuanDiTControlNetPipeline -[[autodoc]] HunyuanDiTControlNetPipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/controlnet_sd3.md b/diffusers/docs/source/en/api/pipelines/controlnet_sd3.md deleted file mode 100644 index bb91a43cbaef145319b68f91285a85169cc42d3c..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/controlnet_sd3.md +++ /dev/null @@ -1,53 +0,0 @@ - - -# ControlNet with Stable Diffusion 3 - -StableDiffusion3ControlNetPipeline is an implementation of ControlNet for Stable Diffusion 3. - -ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. - -With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. - -The abstract from the paper is: - -*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* - -This controlnet code is mainly implemented by [The InstantX Team](https://huggingface.co/InstantX). The inpainting-related code was developed by [The Alimama Creative Team](https://huggingface.co/alimama-creative). You can find pre-trained checkpoints for SD3-ControlNet in the table below: - - -| ControlNet type | Developer | Link | -| -------- | ---------- | ---- | -| Canny | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Canny) | -| Pose | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Pose) | -| Tile | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Tile) | -| Inpainting | [The AlimamaCreative Team](https://huggingface.co/alimama-creative) | [link](https://huggingface.co/alimama-creative/SD3-Controlnet-Inpainting) | - - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## StableDiffusion3ControlNetPipeline -[[autodoc]] StableDiffusion3ControlNetPipeline - - all - - __call__ - -## StableDiffusion3ControlNetInpaintingPipeline -[[autodoc]] pipelines.controlnet_sd3.pipeline_stable_diffusion_3_controlnet_inpainting.StableDiffusion3ControlNetInpaintingPipeline - - all - - __call__ - -## StableDiffusion3PipelineOutput -[[autodoc]] pipelines.stable_diffusion_3.pipeline_output.StableDiffusion3PipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/controlnet_sdxl.md b/diffusers/docs/source/en/api/pipelines/controlnet_sdxl.md deleted file mode 100644 index 2de7cbff6ebc13dd6000a8f75ca1f62e6b68c6fe..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/controlnet_sdxl.md +++ /dev/null @@ -1,55 +0,0 @@ - - -# ControlNet with Stable Diffusion XL - -ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. - -With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. - -The abstract from the paper is: - -*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* - -You can find additional smaller Stable Diffusion XL (SDXL) ControlNet checkpoints from the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization, and browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) checkpoints on the Hub. - - - -🧪 Many of the SDXL ControlNet checkpoints are experimental, and there is a lot of room for improvement. Feel free to open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) and leave us feedback on how we can improve! - - - -If you don't see a checkpoint you're interested in, you can train your own SDXL ControlNet with our [training script](../../../../../examples/controlnet/README_sdxl). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## StableDiffusionXLControlNetPipeline -[[autodoc]] StableDiffusionXLControlNetPipeline - - all - - __call__ - -## StableDiffusionXLControlNetImg2ImgPipeline -[[autodoc]] StableDiffusionXLControlNetImg2ImgPipeline - - all - - __call__ - -## StableDiffusionXLControlNetInpaintPipeline -[[autodoc]] StableDiffusionXLControlNetInpaintPipeline - - all - - __call__ - -## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/controlnetxs.md b/diffusers/docs/source/en/api/pipelines/controlnetxs.md deleted file mode 100644 index 2d4ae7b8ce46c9bb59ecf5703dc40a4b21207066..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/controlnetxs.md +++ /dev/null @@ -1,39 +0,0 @@ - - -# ControlNet-XS - -ControlNet-XS was introduced in [ControlNet-XS](https://vislearn.github.io/ControlNet-XS/) by Denis Zavadski and Carsten Rother. It is based on the observation that the control model in the [original ControlNet](https://huggingface.co/papers/2302.05543) can be made much smaller and still produce good results. - -Like the original ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. - -ControlNet-XS generates images with comparable quality to a regular ControlNet, but it is 20-25% faster ([see benchmark](https://github.com/UmerHA/controlnet-xs-benchmark/blob/main/Speed%20Benchmark.ipynb) with StableDiffusion-XL) and uses ~45% less memory. - -Here's the overview from the [project page](https://vislearn.github.io/ControlNet-XS/): - -*With increasing computing capabilities, current model architectures appear to follow the trend of simply upscaling all components without validating the necessity for doing so. In this project we investigate the size and architectural design of ControlNet [Zhang et al., 2023] for controlling the image generation process with stable diffusion-based models. We show that a new architecture with as little as 1% of the parameters of the base model achieves state-of-the art results, considerably better than ControlNet in terms of FID score. Hence we call it ControlNet-XS. We provide the code for controlling StableDiffusion-XL [Podell et al., 2023] (Model B, 48M Parameters) and StableDiffusion 2.1 [Rombach et al. 2022] (Model B, 14M Parameters), all under openrail license.* - -This model was contributed by [UmerHA](https://twitter.com/UmerHAdil). ❤️ - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## StableDiffusionControlNetXSPipeline -[[autodoc]] StableDiffusionControlNetXSPipeline - - all - - __call__ - -## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/controlnetxs_sdxl.md b/diffusers/docs/source/en/api/pipelines/controlnetxs_sdxl.md deleted file mode 100644 index 31075c0ef96aa8e521ebc6c7b09b8848b72e18bd..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/controlnetxs_sdxl.md +++ /dev/null @@ -1,45 +0,0 @@ - - -# ControlNet-XS with Stable Diffusion XL - -ControlNet-XS was introduced in [ControlNet-XS](https://vislearn.github.io/ControlNet-XS/) by Denis Zavadski and Carsten Rother. It is based on the observation that the control model in the [original ControlNet](https://huggingface.co/papers/2302.05543) can be made much smaller and still produce good results. - -Like the original ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. - -ControlNet-XS generates images with comparable quality to a regular ControlNet, but it is 20-25% faster ([see benchmark](https://github.com/UmerHA/controlnet-xs-benchmark/blob/main/Speed%20Benchmark.ipynb)) and uses ~45% less memory. - -Here's the overview from the [project page](https://vislearn.github.io/ControlNet-XS/): - -*With increasing computing capabilities, current model architectures appear to follow the trend of simply upscaling all components without validating the necessity for doing so. In this project we investigate the size and architectural design of ControlNet [Zhang et al., 2023] for controlling the image generation process with stable diffusion-based models. We show that a new architecture with as little as 1% of the parameters of the base model achieves state-of-the art results, considerably better than ControlNet in terms of FID score. Hence we call it ControlNet-XS. We provide the code for controlling StableDiffusion-XL [Podell et al., 2023] (Model B, 48M Parameters) and StableDiffusion 2.1 [Rombach et al. 2022] (Model B, 14M Parameters), all under openrail license.* - -This model was contributed by [UmerHA](https://twitter.com/UmerHAdil). ❤️ - - - -🧪 Many of the SDXL ControlNet checkpoints are experimental, and there is a lot of room for improvement. Feel free to open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) and leave us feedback on how we can improve! - - - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## StableDiffusionXLControlNetXSPipeline -[[autodoc]] StableDiffusionXLControlNetXSPipeline - - all - - __call__ - -## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/dance_diffusion.md b/diffusers/docs/source/en/api/pipelines/dance_diffusion.md deleted file mode 100644 index efba3c3763a43862504a195d0b3ab9ab653f1e67..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/dance_diffusion.md +++ /dev/null @@ -1,32 +0,0 @@ - - -# Dance Diffusion - -[Dance Diffusion](https://github.com/Harmonai-org/sample-generator) is by Zach Evans. - -Dance Diffusion is the first in a suite of generative audio tools for producers and musicians released by [Harmonai](https://github.com/Harmonai-org). - - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## DanceDiffusionPipeline -[[autodoc]] DanceDiffusionPipeline - - all - - __call__ - -## AudioPipelineOutput -[[autodoc]] pipelines.AudioPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/ddim.md b/diffusers/docs/source/en/api/pipelines/ddim.md deleted file mode 100644 index 6802da739cd5db62ca4cb2ca328937809ff7cba1..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/ddim.md +++ /dev/null @@ -1,29 +0,0 @@ - - -# DDIM - -[Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon. - -The abstract from the paper is: - -*Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.* - -The original codebase can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim). - -## DDIMPipeline -[[autodoc]] DDIMPipeline - - all - - __call__ - -## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/ddpm.md b/diffusers/docs/source/en/api/pipelines/ddpm.md deleted file mode 100644 index 81ddb5e0c0518d30dfbdf47a08e478b02344ca50..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/ddpm.md +++ /dev/null @@ -1,35 +0,0 @@ - - -# DDPM - -[Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2006.11239) (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes a diffusion based model of the same name. In the 🤗 Diffusers library, DDPM refers to the *discrete denoising scheduler* from the paper as well as the pipeline. - -The abstract from the paper is: - -*We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.* - -The original codebase can be found at [hohonathanho/diffusion](https://github.com/hojonathanho/diffusion). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -# DDPMPipeline -[[autodoc]] DDPMPipeline - - all - - __call__ - -## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/deepfloyd_if.md b/diffusers/docs/source/en/api/pipelines/deepfloyd_if.md deleted file mode 100644 index 00441980d80248c695ce81e2e1fb138303381aeb..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/deepfloyd_if.md +++ /dev/null @@ -1,506 +0,0 @@ - - -# DeepFloyd IF - -## Overview - -DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. -The model is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules: -- Stage 1: a base model that generates 64x64 px image based on text prompt, -- Stage 2: a 64x64 px => 256x256 px super-resolution model, and -- Stage 3: a 256x256 px => 1024x1024 px super-resolution model -Stage 1 and Stage 2 utilize a frozen text encoder based on the T5 transformer to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention pooling. -Stage 3 is [Stability AI's x4 Upscaling model](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler). -The result is a highly efficient model that outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. -Our work underscores the potential of larger UNet architectures in the first stage of cascaded diffusion models and depicts a promising future for text-to-image synthesis. - -## Usage - -Before you can use IF, you need to accept its usage conditions. To do so: -1. Make sure to have a [Hugging Face account](https://huggingface.co/join) and be logged in. -2. Accept the license on the model card of [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0). Accepting the license on the stage I model card will auto accept for the other IF models. -3. Make sure to login locally. Install `huggingface_hub`: -```sh -pip install huggingface_hub --upgrade -``` - -run the login function in a Python shell: - -```py -from huggingface_hub import login - -login() -``` - -and enter your [Hugging Face Hub access token](https://huggingface.co/docs/hub/security-tokens#what-are-user-access-tokens). - -Next we install `diffusers` and dependencies: - -```sh -pip install -q diffusers accelerate transformers -``` - -The following sections give more in-detail examples of how to use IF. Specifically: - -- [Text-to-Image Generation](#text-to-image-generation) -- [Image-to-Image Generation](#text-guided-image-to-image-generation) -- [Inpainting](#text-guided-inpainting-generation) -- [Reusing model weights](#converting-between-different-pipelines) -- [Speed optimization](#optimizing-for-speed) -- [Memory optimization](#optimizing-for-memory) - -**Available checkpoints** -- *Stage-1* - - [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0) - - [DeepFloyd/IF-I-L-v1.0](https://huggingface.co/DeepFloyd/IF-I-L-v1.0) - - [DeepFloyd/IF-I-M-v1.0](https://huggingface.co/DeepFloyd/IF-I-M-v1.0) - -- *Stage-2* - - [DeepFloyd/IF-II-L-v1.0](https://huggingface.co/DeepFloyd/IF-II-L-v1.0) - - [DeepFloyd/IF-II-M-v1.0](https://huggingface.co/DeepFloyd/IF-II-M-v1.0) - -- *Stage-3* - - [stabilityai/stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) - - -**Google Colab** -[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb) - -### Text-to-Image Generation - -By default diffusers makes use of [model cpu offloading](../../optimization/memory#model-offloading) to run the whole IF pipeline with as little as 14 GB of VRAM. - -```python -from diffusers import DiffusionPipeline -from diffusers.utils import pt_to_pil, make_image_grid -import torch - -# stage 1 -stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) -stage_1.enable_model_cpu_offload() - -# stage 2 -stage_2 = DiffusionPipeline.from_pretrained( - "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 -) -stage_2.enable_model_cpu_offload() - -# stage 3 -safety_modules = { - "feature_extractor": stage_1.feature_extractor, - "safety_checker": stage_1.safety_checker, - "watermarker": stage_1.watermarker, -} -stage_3 = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 -) -stage_3.enable_model_cpu_offload() - -prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' -generator = torch.manual_seed(1) - -# text embeds -prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt) - -# stage 1 -stage_1_output = stage_1( - prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt" -).images -#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png") - -# stage 2 -stage_2_output = stage_2( - image=stage_1_output, - prompt_embeds=prompt_embeds, - negative_prompt_embeds=negative_embeds, - generator=generator, - output_type="pt", -).images -#pt_to_pil(stage_2_output)[0].save("./if_stage_II.png") - -# stage 3 -stage_3_output = stage_3(prompt=prompt, image=stage_2_output, noise_level=100, generator=generator).images -#stage_3_output[0].save("./if_stage_III.png") -make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=3) -``` - -### Text Guided Image-to-Image Generation - -The same IF model weights can be used for text-guided image-to-image translation or image variation. -In this case just make sure to load the weights using the [`IFImg2ImgPipeline`] and [`IFImg2ImgSuperResolutionPipeline`] pipelines. - -**Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines -without loading them twice by making use of the [`~DiffusionPipeline.components`] argument as explained [here](#converting-between-different-pipelines). - -```python -from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline -from diffusers.utils import pt_to_pil, load_image, make_image_grid -import torch - -# download image -url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" -original_image = load_image(url) -original_image = original_image.resize((768, 512)) - -# stage 1 -stage_1 = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) -stage_1.enable_model_cpu_offload() - -# stage 2 -stage_2 = IFImg2ImgSuperResolutionPipeline.from_pretrained( - "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 -) -stage_2.enable_model_cpu_offload() - -# stage 3 -safety_modules = { - "feature_extractor": stage_1.feature_extractor, - "safety_checker": stage_1.safety_checker, - "watermarker": stage_1.watermarker, -} -stage_3 = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 -) -stage_3.enable_model_cpu_offload() - -prompt = "A fantasy landscape in style minecraft" -generator = torch.manual_seed(1) - -# text embeds -prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt) - -# stage 1 -stage_1_output = stage_1( - image=original_image, - prompt_embeds=prompt_embeds, - negative_prompt_embeds=negative_embeds, - generator=generator, - output_type="pt", -).images -#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png") - -# stage 2 -stage_2_output = stage_2( - image=stage_1_output, - original_image=original_image, - prompt_embeds=prompt_embeds, - negative_prompt_embeds=negative_embeds, - generator=generator, - output_type="pt", -).images -#pt_to_pil(stage_2_output)[0].save("./if_stage_II.png") - -# stage 3 -stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images -#stage_3_output[0].save("./if_stage_III.png") -make_image_grid([original_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=4) -``` - -### Text Guided Inpainting Generation - -The same IF model weights can be used for text-guided image-to-image translation or image variation. -In this case just make sure to load the weights using the [`IFInpaintingPipeline`] and [`IFInpaintingSuperResolutionPipeline`] pipelines. - -**Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines -without loading them twice by making use of the [`~DiffusionPipeline.components()`] function as explained [here](#converting-between-different-pipelines). - -```python -from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline -from diffusers.utils import pt_to_pil, load_image, make_image_grid -import torch - -# download image -url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" -original_image = load_image(url) - -# download mask -url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" -mask_image = load_image(url) - -# stage 1 -stage_1 = IFInpaintingPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) -stage_1.enable_model_cpu_offload() - -# stage 2 -stage_2 = IFInpaintingSuperResolutionPipeline.from_pretrained( - "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 -) -stage_2.enable_model_cpu_offload() - -# stage 3 -safety_modules = { - "feature_extractor": stage_1.feature_extractor, - "safety_checker": stage_1.safety_checker, - "watermarker": stage_1.watermarker, -} -stage_3 = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 -) -stage_3.enable_model_cpu_offload() - -prompt = "blue sunglasses" -generator = torch.manual_seed(1) - -# text embeds -prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt) - -# stage 1 -stage_1_output = stage_1( - image=original_image, - mask_image=mask_image, - prompt_embeds=prompt_embeds, - negative_prompt_embeds=negative_embeds, - generator=generator, - output_type="pt", -).images -#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png") - -# stage 2 -stage_2_output = stage_2( - image=stage_1_output, - original_image=original_image, - mask_image=mask_image, - prompt_embeds=prompt_embeds, - negative_prompt_embeds=negative_embeds, - generator=generator, - output_type="pt", -).images -#pt_to_pil(stage_1_output)[0].save("./if_stage_II.png") - -# stage 3 -stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images -#stage_3_output[0].save("./if_stage_III.png") -make_image_grid([original_image, mask_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=5) -``` - -### Converting between different pipelines - -In addition to being loaded with `from_pretrained`, Pipelines can also be loaded directly from each other. - -```python -from diffusers import IFPipeline, IFSuperResolutionPipeline - -pipe_1 = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0") -pipe_2 = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-L-v1.0") - - -from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline - -pipe_1 = IFImg2ImgPipeline(**pipe_1.components) -pipe_2 = IFImg2ImgSuperResolutionPipeline(**pipe_2.components) - - -from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline - -pipe_1 = IFInpaintingPipeline(**pipe_1.components) -pipe_2 = IFInpaintingSuperResolutionPipeline(**pipe_2.components) -``` - -### Optimizing for speed - -The simplest optimization to run IF faster is to move all model components to the GPU. - -```py -pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) -pipe.to("cuda") -``` - -You can also run the diffusion process for a shorter number of timesteps. - -This can either be done with the `num_inference_steps` argument: - -```py -pipe("", num_inference_steps=30) -``` - -Or with the `timesteps` argument: - -```py -from diffusers.pipelines.deepfloyd_if import fast27_timesteps - -pipe("", timesteps=fast27_timesteps) -``` - -When doing image variation or inpainting, you can also decrease the number of timesteps -with the strength argument. The strength argument is the amount of noise to add to the input image which also determines how many steps to run in the denoising process. -A smaller number will vary the image less but run faster. - -```py -pipe = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) -pipe.to("cuda") - -image = pipe(image=image, prompt="", strength=0.3).images -``` - -You can also use [`torch.compile`](../../optimization/torch2.0). Note that we have not exhaustively tested `torch.compile` -with IF and it might not give expected results. - -```py -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) -pipe.to("cuda") - -pipe.text_encoder = torch.compile(pipe.text_encoder, mode="reduce-overhead", fullgraph=True) -pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) -``` - -### Optimizing for memory - -When optimizing for GPU memory, we can use the standard diffusers CPU offloading APIs. - -Either the model based CPU offloading, - -```py -pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) -pipe.enable_model_cpu_offload() -``` - -or the more aggressive layer based CPU offloading. - -```py -pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) -pipe.enable_sequential_cpu_offload() -``` - -Additionally, T5 can be loaded in 8bit precision - -```py -from transformers import T5EncoderModel - -text_encoder = T5EncoderModel.from_pretrained( - "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" -) - -from diffusers import DiffusionPipeline - -pipe = DiffusionPipeline.from_pretrained( - "DeepFloyd/IF-I-XL-v1.0", - text_encoder=text_encoder, # pass the previously instantiated 8bit text encoder - unet=None, - device_map="auto", -) - -prompt_embeds, negative_embeds = pipe.encode_prompt("") -``` - -For CPU RAM constrained machines like Google Colab free tier where we can't load all model components to the CPU at once, we can manually only load the pipeline with -the text encoder or UNet when the respective model components are needed. - -```py -from diffusers import IFPipeline, IFSuperResolutionPipeline -import torch -import gc -from transformers import T5EncoderModel -from diffusers.utils import pt_to_pil, make_image_grid - -text_encoder = T5EncoderModel.from_pretrained( - "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" -) - -# text to image -pipe = DiffusionPipeline.from_pretrained( - "DeepFloyd/IF-I-XL-v1.0", - text_encoder=text_encoder, # pass the previously instantiated 8bit text encoder - unet=None, - device_map="auto", -) - -prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' -prompt_embeds, negative_embeds = pipe.encode_prompt(prompt) - -# Remove the pipeline so we can re-load the pipeline with the unet -del text_encoder -del pipe -gc.collect() -torch.cuda.empty_cache() - -pipe = IFPipeline.from_pretrained( - "DeepFloyd/IF-I-XL-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" -) - -generator = torch.Generator().manual_seed(0) -stage_1_output = pipe( - prompt_embeds=prompt_embeds, - negative_prompt_embeds=negative_embeds, - output_type="pt", - generator=generator, -).images - -#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png") - -# Remove the pipeline so we can load the super-resolution pipeline -del pipe -gc.collect() -torch.cuda.empty_cache() - -# First super resolution - -pipe = IFSuperResolutionPipeline.from_pretrained( - "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" -) - -generator = torch.Generator().manual_seed(0) -stage_2_output = pipe( - image=stage_1_output, - prompt_embeds=prompt_embeds, - negative_prompt_embeds=negative_embeds, - output_type="pt", - generator=generator, -).images - -#pt_to_pil(stage_2_output)[0].save("./if_stage_II.png") -make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0]], rows=1, rows=2) -``` - -## Available Pipelines: - -| Pipeline | Tasks | Colab -|---|---|:---:| -| [pipeline_if.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if.py) | *Text-to-Image Generation* | - | -| [pipeline_if_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_superresolution.py) | *Text-to-Image Generation* | - | -| [pipeline_if_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img.py) | *Image-to-Image Generation* | - | -| [pipeline_if_img2img_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img_superresolution.py) | *Image-to-Image Generation* | - | -| [pipeline_if_inpainting.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting.py) | *Image-to-Image Generation* | - | -| [pipeline_if_inpainting_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting_superresolution.py) | *Image-to-Image Generation* | - | - -## IFPipeline -[[autodoc]] IFPipeline - - all - - __call__ - -## IFSuperResolutionPipeline -[[autodoc]] IFSuperResolutionPipeline - - all - - __call__ - -## IFImg2ImgPipeline -[[autodoc]] IFImg2ImgPipeline - - all - - __call__ - -## IFImg2ImgSuperResolutionPipeline -[[autodoc]] IFImg2ImgSuperResolutionPipeline - - all - - __call__ - -## IFInpaintingPipeline -[[autodoc]] IFInpaintingPipeline - - all - - __call__ - -## IFInpaintingSuperResolutionPipeline -[[autodoc]] IFInpaintingSuperResolutionPipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/diffedit.md b/diffusers/docs/source/en/api/pipelines/diffedit.md deleted file mode 100644 index 97cbdcb0c066320283eb68a849eced9940253c58..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/diffedit.md +++ /dev/null @@ -1,55 +0,0 @@ - - -# DiffEdit - -[DiffEdit: Diffusion-based semantic image editing with mask guidance](https://huggingface.co/papers/2210.11427) is by Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. - -The abstract from the paper is: - -*Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.* - -The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion), and you can try it out in this [demo](https://blog.problemsolversguild.com/technical/research/2022/11/02/DiffEdit-Implementation.html). - -This pipeline was contributed by [clarencechen](https://github.com/clarencechen). ❤️ - -## Tips - -* The pipeline can generate masks that can be fed into other inpainting pipelines. -* In order to generate an image using this pipeline, both an image mask (source and target prompts can be manually specified or generated, and passed to [`~StableDiffusionDiffEditPipeline.generate_mask`]) -and a set of partially inverted latents (generated using [`~StableDiffusionDiffEditPipeline.invert`]) _must_ be provided as arguments when calling the pipeline to generate the final edited image. -* The function [`~StableDiffusionDiffEditPipeline.generate_mask`] exposes two prompt arguments, `source_prompt` and `target_prompt` -that let you control the locations of the semantic edits in the final image to be generated. Let's say, -you wanted to translate from "cat" to "dog". In this case, the edit direction will be "cat -> dog". To reflect -this in the generated mask, you simply have to set the embeddings related to the phrases including "cat" to -`source_prompt` and "dog" to `target_prompt`. -* When generating partially inverted latents using `invert`, assign a caption or text embedding describing the -overall image to the `prompt` argument to help guide the inverse latent sampling process. In most cases, the -source concept is sufficiently descriptive to yield good results, but feel free to explore alternatives. -* When calling the pipeline to generate the final edited image, assign the source concept to `negative_prompt` -and the target concept to `prompt`. Taking the above example, you simply have to set the embeddings related to -the phrases including "cat" to `negative_prompt` and "dog" to `prompt`. -* If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to: - * Swap the `source_prompt` and `target_prompt` in the arguments to `generate_mask`. - * Change the input prompt in [`~StableDiffusionDiffEditPipeline.invert`] to include "dog". - * Swap the `prompt` and `negative_prompt` in the arguments to call the pipeline to generate the final edited image. -* The source and target prompts, or their corresponding embeddings, can also be automatically generated. Please refer to the [DiffEdit](../../using-diffusers/diffedit) guide for more details. - -## StableDiffusionDiffEditPipeline -[[autodoc]] StableDiffusionDiffEditPipeline - - all - - generate_mask - - invert - - __call__ - -## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/dit.md b/diffusers/docs/source/en/api/pipelines/dit.md deleted file mode 100644 index 1d04458d9cb950925b5bc2b9e75076c9bf2b7f5e..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/dit.md +++ /dev/null @@ -1,35 +0,0 @@ - - -# DiT - -[Scalable Diffusion Models with Transformers](https://huggingface.co/papers/2212.09748) (DiT) is by William Peebles and Saining Xie. - -The abstract from the paper is: - -*We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.* - -The original codebase can be found at [facebookresearch/dit](https://github.com/facebookresearch/dit). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## DiTPipeline -[[autodoc]] DiTPipeline - - all - - __call__ - -## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/flux.md b/diffusers/docs/source/en/api/pipelines/flux.md deleted file mode 100644 index 255c69c854bcf2c3e1c07088d7a617bca80ac50e..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/flux.md +++ /dev/null @@ -1,190 +0,0 @@ - - -# Flux - -Flux is a series of text-to-image generation models based on diffusion transformers. To know more about Flux, check out the original [blog post](https://blackforestlabs.ai/announcing-black-forest-labs/) by the creators of Flux, Black Forest Labs. - -Original model checkpoints for Flux can be found [here](https://huggingface.co/black-forest-labs). Original inference code can be found [here](https://github.com/black-forest-labs/flux). - - - -Flux can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. Additionally, Flux can benefit from quantization for memory efficiency with a trade-off in inference latency. Refer to [this blog post](https://huggingface.co/blog/quanto-diffusers) to learn more. For an exhaustive list of resources, check out [this gist](https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c). - - - -Flux comes in two variants: - -* Timestep-distilled (`black-forest-labs/FLUX.1-schnell`) -* Guidance-distilled (`black-forest-labs/FLUX.1-dev`) - -Both checkpoints have slightly difference usage which we detail below. - -### Timestep-distilled - -* `max_sequence_length` cannot be more than 256. -* `guidance_scale` needs to be 0. -* As this is a timestep-distilled model, it benefits from fewer sampling steps. - -```python -import torch -from diffusers import FluxPipeline - -pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) -pipe.enable_model_cpu_offload() - -prompt = "A cat holding a sign that says hello world" -out = pipe( - prompt=prompt, - guidance_scale=0., - height=768, - width=1360, - num_inference_steps=4, - max_sequence_length=256, -).images[0] -out.save("image.png") -``` - -### Guidance-distilled - -* The guidance-distilled variant takes about 50 sampling steps for good-quality generation. -* It doesn't have any limitations around the `max_sequence_length`. - -```python -import torch -from diffusers import FluxPipeline - -pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16) -pipe.enable_model_cpu_offload() - -prompt = "a tiny astronaut hatching from an egg on the moon" -out = pipe( - prompt=prompt, - guidance_scale=3.5, - height=768, - width=1360, - num_inference_steps=50, -).images[0] -out.save("image.png") -``` - -## Running FP16 inference -Flux can generate high-quality images with FP16 (i.e. to accelerate inference on Turing/Volta GPUs) but produces different outputs compared to FP32/BF16. The issue is that some activations in the text encoders have to be clipped when running in FP16, which affects the overall image. Forcing text encoders to run with FP32 inference thus removes this output difference. See [here](https://github.com/huggingface/diffusers/pull/9097#issuecomment-2272292516) for details. - -FP16 inference code: -```python -import torch -from diffusers import FluxPipeline - -pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) # can replace schnell with dev -# to run on low vram GPUs (i.e. between 4 and 32 GB VRAM) -pipe.enable_sequential_cpu_offload() -pipe.vae.enable_slicing() -pipe.vae.enable_tiling() - -pipe.to(torch.float16) # casting here instead of in the pipeline constructor because doing so in the constructor loads all models into CPU memory at once - -prompt = "A cat holding a sign that says hello world" -out = pipe( - prompt=prompt, - guidance_scale=0., - height=768, - width=1360, - num_inference_steps=4, - max_sequence_length=256, -).images[0] -out.save("image.png") -``` - -## Single File Loading for the `FluxTransformer2DModel` - -The `FluxTransformer2DModel` supports loading checkpoints in the original format shipped by Black Forest Labs. This is also useful when trying to load finetunes or quantized versions of the models that have been published by the community. - - -`FP8` inference can be brittle depending on the GPU type, CUDA version, and `torch` version that you are using. It is recommended that you use the `optimum-quanto` library in order to run FP8 inference on your machine. - - -The following example demonstrates how to run Flux with less than 16GB of VRAM. - -First install `optimum-quanto` - -```shell -pip install optimum-quanto -``` - -Then run the following example - -```python -import torch -from diffusers import FluxTransformer2DModel, FluxPipeline -from transformers import T5EncoderModel, CLIPTextModel -from optimum.quanto import freeze, qfloat8, quantize - -bfl_repo = "black-forest-labs/FLUX.1-dev" -dtype = torch.bfloat16 - -transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors", torch_dtype=dtype) -quantize(transformer, weights=qfloat8) -freeze(transformer) - -text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype) -quantize(text_encoder_2, weights=qfloat8) -freeze(text_encoder_2) - -pipe = FluxPipeline.from_pretrained(bfl_repo, transformer=None, text_encoder_2=None, torch_dtype=dtype) -pipe.transformer = transformer -pipe.text_encoder_2 = text_encoder_2 - -pipe.enable_model_cpu_offload() - -prompt = "A cat holding a sign that says hello world" -image = pipe( - prompt, - guidance_scale=3.5, - output_type="pil", - num_inference_steps=20, - generator=torch.Generator("cpu").manual_seed(0) -).images[0] - -image.save("flux-fp8-dev.png") -``` - -## FluxPipeline - -[[autodoc]] FluxPipeline - - all - - __call__ - -## FluxImg2ImgPipeline - -[[autodoc]] FluxImg2ImgPipeline - - all - - __call__ - -## FluxInpaintPipeline - -[[autodoc]] FluxInpaintPipeline - - all - - __call__ - - -## FluxControlNetInpaintPipeline - -[[autodoc]] FluxControlNetInpaintPipeline - - all - - __call__ - -## FluxControlNetImg2ImgPipeline - -[[autodoc]] FluxControlNetImg2ImgPipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/hunyuandit.md b/diffusers/docs/source/en/api/pipelines/hunyuandit.md deleted file mode 100644 index 250533837ed0b21d54503df49617b1150b929b47..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/hunyuandit.md +++ /dev/null @@ -1,101 +0,0 @@ - - -# Hunyuan-DiT -![chinese elements understanding](https://github.com/gnobitab/diffusers-hunyuan/assets/1157982/39b99036-c3cb-4f16-bb1a-40ec25eda573) - -[Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding](https://arxiv.org/abs/2405.08748) from Tencent Hunyuan. - -The abstract from the paper is: - -*We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models.* - - -You can find the original codebase at [Tencent/HunyuanDiT](https://github.com/Tencent/HunyuanDiT) and all the available checkpoints at [Tencent-Hunyuan](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT). - -**Highlights**: HunyuanDiT supports Chinese/English-to-image, multi-resolution generation. - -HunyuanDiT has the following components: -* It uses a diffusion transformer as the backbone -* It combines two text encoders, a bilingual CLIP and a multilingual T5 encoder - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. - - - - - -You can further improve generation quality by passing the generated image from [`HungyuanDiTPipeline`] to the [SDXL refiner](../../using-diffusers/sdxl#base-to-refiner-model) model. - - - -## Optimization - -You can optimize the pipeline's runtime and memory consumption with torch.compile and feed-forward chunking. To learn about other optimization methods, check out the [Speed up inference](../../optimization/fp16) and [Reduce memory usage](../../optimization/memory) guides. - -### Inference - -Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. - -First, load the pipeline: - -```python -from diffusers import HunyuanDiTPipeline -import torch - -pipeline = HunyuanDiTPipeline.from_pretrained( - "Tencent-Hunyuan/HunyuanDiT-Diffusers", torch_dtype=torch.float16 -).to("cuda") -``` - -Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: - -```python -pipeline.transformer.to(memory_format=torch.channels_last) -pipeline.vae.to(memory_format=torch.channels_last) -``` - -Finally, compile the components and run inference: - -```python -pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True) -pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True) - -image = pipeline(prompt="一个宇航员在骑马").images[0] -``` - -The [benchmark](https://gist.github.com/sayakpaul/29d3a14905cfcbf611fe71ebd22e9b23) results on a 80GB A100 machine are: - -```bash -With torch.compile(): Average inference time: 12.470 seconds. -Without torch.compile(): Average inference time: 20.570 seconds. -``` - -### Memory optimization - -By loading the T5 text encoder in 8 bits, you can run the pipeline in just under 6 GBs of GPU VRAM. Refer to [this script](https://gist.github.com/sayakpaul/3154605f6af05b98a41081aaba5ca43e) for details. - -Furthermore, you can use the [`~HunyuanDiT2DModel.enable_forward_chunking`] method to reduce memory usage. Feed-forward chunking runs the feed-forward layers in a transformer block in a loop instead of all at once. This gives you a trade-off between memory consumption and inference runtime. - -```diff -+ pipeline.transformer.enable_forward_chunking(chunk_size=1, dim=1) -``` - - -## HunyuanDiTPipeline - -[[autodoc]] HunyuanDiTPipeline - - all - - __call__ - diff --git a/diffusers/docs/source/en/api/pipelines/i2vgenxl.md b/diffusers/docs/source/en/api/pipelines/i2vgenxl.md deleted file mode 100644 index cbb6be1176fdff8d4871f66a05b81c29d5bf632c..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/i2vgenxl.md +++ /dev/null @@ -1,58 +0,0 @@ - - -# I2VGen-XL - -[I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models](https://hf.co/papers/2311.04145.pdf) by Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. - -The abstract from the paper is: - -*Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280×720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at [this https URL](https://i2vgen-xl.github.io/).* - -The original codebase can be found [here](https://github.com/ali-vilab/i2vgen-xl/). The model checkpoints can be found [here](https://huggingface.co/ali-vilab/). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage). - - - -Sample output with I2VGenXL: - - - - - -
- library. -
- library -
- -## Notes - -* I2VGenXL always uses a `clip_skip` value of 1. This means it leverages the penultimate layer representations from the text encoder of CLIP. -* It can generate videos of quality that is often on par with [Stable Video Diffusion](../../using-diffusers/svd) (SVD). -* Unlike SVD, it additionally accepts text prompts as inputs. -* It can generate higher resolution videos. -* When using the [`DDIMScheduler`] (which is default for this pipeline), less than 50 steps for inference leads to bad results. -* This implementation is 1-stage variant of I2VGenXL. The main figure in the [I2VGen-XL](https://arxiv.org/abs/2311.04145) paper shows a 2-stage variant, however, 1-stage variant works well. See [this discussion](https://github.com/huggingface/diffusers/discussions/7952) for more details. - -## I2VGenXLPipeline -[[autodoc]] I2VGenXLPipeline - - all - - __call__ - -## I2VGenXLPipelineOutput -[[autodoc]] pipelines.i2vgen_xl.pipeline_i2vgen_xl.I2VGenXLPipelineOutput \ No newline at end of file diff --git a/diffusers/docs/source/en/api/pipelines/kandinsky.md b/diffusers/docs/source/en/api/pipelines/kandinsky.md deleted file mode 100644 index 9ea3cd4a17182062f16d8878db9424b49176fa1f..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/kandinsky.md +++ /dev/null @@ -1,73 +0,0 @@ - - -# Kandinsky 2.1 - -Kandinsky 2.1 is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Vladimir Arkhipkin](https://github.com/oriBetelgeuse), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey), and [Denis Dimitrov](https://github.com/denndimitrov). - -The description from it's GitHub page is: - -*Kandinsky 2.1 inherits best practicies from Dall-E 2 and Latent diffusion, while introducing some new ideas. As text and image encoder it uses CLIP model and diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach increases the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation.* - -The original codebase can be found at [ai-forever/Kandinsky-2](https://github.com/ai-forever/Kandinsky-2). - - - -Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting. - - - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## KandinskyPriorPipeline - -[[autodoc]] KandinskyPriorPipeline - - all - - __call__ - - interpolate - -## KandinskyPipeline - -[[autodoc]] KandinskyPipeline - - all - - __call__ - -## KandinskyCombinedPipeline - -[[autodoc]] KandinskyCombinedPipeline - - all - - __call__ - -## KandinskyImg2ImgPipeline - -[[autodoc]] KandinskyImg2ImgPipeline - - all - - __call__ - -## KandinskyImg2ImgCombinedPipeline - -[[autodoc]] KandinskyImg2ImgCombinedPipeline - - all - - __call__ - -## KandinskyInpaintPipeline - -[[autodoc]] KandinskyInpaintPipeline - - all - - __call__ - -## KandinskyInpaintCombinedPipeline - -[[autodoc]] KandinskyInpaintCombinedPipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/kandinsky3.md b/diffusers/docs/source/en/api/pipelines/kandinsky3.md deleted file mode 100644 index 96123846af32589b3bf14a5f339a0470997ef947..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/kandinsky3.md +++ /dev/null @@ -1,49 +0,0 @@ - - -# Kandinsky 3 - -Kandinsky 3 is created by [Vladimir Arkhipkin](https://github.com/oriBetelgeuse),[Anastasia Maltseva](https://github.com/NastyaMittseva),[Igor Pavlov](https://github.com/boomb0om),[Andrei Filatov](https://github.com/anvilarth),[Arseniy Shakhmatov](https://github.com/cene555),[Andrey Kuznetsov](https://github.com/kuznetsoffandrey),[Denis Dimitrov](https://github.com/denndimitrov), [Zein Shaheen](https://github.com/zeinsh) - -The description from it's GitHub page: - -*Kandinsky 3.0 is an open-source text-to-image diffusion model built upon the Kandinsky2-x model family. In comparison to its predecessors, enhancements have been made to the text understanding and visual quality of the model, achieved by increasing the size of the text encoder and Diffusion U-Net models, respectively.* - -Its architecture includes 3 main components: -1. [FLAN-UL2](https://huggingface.co/google/flan-ul2), which is an encoder decoder model based on the T5 architecture. -2. New U-Net architecture featuring BigGAN-deep blocks doubles depth while maintaining the same number of parameters. -3. Sber-MoVQGAN is a decoder proven to have superior results in image restoration. - - - -The original codebase can be found at [ai-forever/Kandinsky-3](https://github.com/ai-forever/Kandinsky-3). - - - -Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting. - - - - - -Make sure to check out the schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## Kandinsky3Pipeline - -[[autodoc]] Kandinsky3Pipeline - - all - - __call__ - -## Kandinsky3Img2ImgPipeline - -[[autodoc]] Kandinsky3Img2ImgPipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/kandinsky_v22.md b/diffusers/docs/source/en/api/pipelines/kandinsky_v22.md deleted file mode 100644 index 13a6ca81d4a5dd14cec640ac2371c77d52d24296..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/kandinsky_v22.md +++ /dev/null @@ -1,92 +0,0 @@ - - -# Kandinsky 2.2 - -Kandinsky 2.2 is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Vladimir Arkhipkin](https://github.com/oriBetelgeuse), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey), and [Denis Dimitrov](https://github.com/denndimitrov). - -The description from it's GitHub page is: - -*Kandinsky 2.2 brings substantial improvements upon its predecessor, Kandinsky 2.1, by introducing a new, more powerful image encoder - CLIP-ViT-G and the ControlNet support. The switch to CLIP-ViT-G as the image encoder significantly increases the model's capability to generate more aesthetic pictures and better understand text, thus enhancing the model's overall performance. The addition of the ControlNet mechanism allows the model to effectively control the process of generating images. This leads to more accurate and visually appealing outputs and opens new possibilities for text-guided image manipulation.* - -The original codebase can be found at [ai-forever/Kandinsky-2](https://github.com/ai-forever/Kandinsky-2). - - - -Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting. - - - - - -Make sure to check out the schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## KandinskyV22PriorPipeline - -[[autodoc]] KandinskyV22PriorPipeline - - all - - __call__ - - interpolate - -## KandinskyV22Pipeline - -[[autodoc]] KandinskyV22Pipeline - - all - - __call__ - -## KandinskyV22CombinedPipeline - -[[autodoc]] KandinskyV22CombinedPipeline - - all - - __call__ - -## KandinskyV22ControlnetPipeline - -[[autodoc]] KandinskyV22ControlnetPipeline - - all - - __call__ - -## KandinskyV22PriorEmb2EmbPipeline - -[[autodoc]] KandinskyV22PriorEmb2EmbPipeline - - all - - __call__ - - interpolate - -## KandinskyV22Img2ImgPipeline - -[[autodoc]] KandinskyV22Img2ImgPipeline - - all - - __call__ - -## KandinskyV22Img2ImgCombinedPipeline - -[[autodoc]] KandinskyV22Img2ImgCombinedPipeline - - all - - __call__ - -## KandinskyV22ControlnetImg2ImgPipeline - -[[autodoc]] KandinskyV22ControlnetImg2ImgPipeline - - all - - __call__ - -## KandinskyV22InpaintPipeline - -[[autodoc]] KandinskyV22InpaintPipeline - - all - - __call__ - -## KandinskyV22InpaintCombinedPipeline - -[[autodoc]] KandinskyV22InpaintCombinedPipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/kolors.md b/diffusers/docs/source/en/api/pipelines/kolors.md deleted file mode 100644 index 367eb4a4854812c86e7d3eca42281409a2af3b95..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/kolors.md +++ /dev/null @@ -1,115 +0,0 @@ - - -# Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis - -![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kolors/kolors_header_collage.png) - -Kolors is a large-scale text-to-image generation model based on latent diffusion, developed by [the Kuaishou Kolors team](https://github.com/Kwai-Kolors/Kolors). Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and closed-source models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters. Furthermore, Kolors supports both Chinese and English inputs, demonstrating strong performance in understanding and generating Chinese-specific content. For more details, please refer to this [technical report](https://github.com/Kwai-Kolors/Kolors/blob/master/imgs/Kolors_paper.pdf). - -The abstract from the technical report is: - -*We present Kolors, a latent diffusion model for text-to-image synthesis, characterized by its profound understanding of both English and Chinese, as well as an impressive degree of photorealism. There are three key insights contributing to the development of Kolors. Firstly, unlike large language model T5 used in Imagen and Stable Diffusion 3, Kolors is built upon the General Language Model (GLM), which enhances its comprehension capabilities in both English and Chinese. Moreover, we employ a multimodal large language model to recaption the extensive training dataset for fine-grained text understanding. These strategies significantly improve Kolors’ ability to comprehend intricate semantics, particularly those involving multiple entities, and enable its advanced text rendering capabilities. Secondly, we divide the training of Kolors into two phases: the concept learning phase with broad knowledge and the quality improvement phase with specifically curated high-aesthetic data. Furthermore, we investigate the critical role of the noise schedule and introduce a novel schedule to optimize high-resolution image generation. These strategies collectively enhance the visual appeal of the generated high-resolution images. Lastly, we propose a category-balanced benchmark KolorsPrompts, which serves as a guide for the training and evaluation of Kolors. Consequently, even when employing the commonly used U-Net backbone, Kolors has demonstrated remarkable performance in human evaluations, surpassing the existing open-source models and achieving Midjourney-v6 level performance, especially in terms of visual appeal. We will release the code and weights of Kolors at , and hope that it will benefit future research and applications in the visual generation community.* - -## Usage Example - -```python -import torch - -from diffusers import DPMSolverMultistepScheduler, KolorsPipeline - -pipe = KolorsPipeline.from_pretrained("Kwai-Kolors/Kolors-diffusers", torch_dtype=torch.float16, variant="fp16") -pipe.to("cuda") -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True) - -image = pipe( - prompt='一张瓢虫的照片,微距,变焦,高质量,电影,拿着一个牌子,写着"可图"', - negative_prompt="", - guidance_scale=6.5, - num_inference_steps=25, -).images[0] - -image.save("kolors_sample.png") -``` - -### IP Adapter - -Kolors needs a different IP Adapter to work, and it uses [Openai-CLIP-336](https://huggingface.co/openai/clip-vit-large-patch14-336) as an image encoder. - - - -Using an IP Adapter with Kolors requires more than 24GB of VRAM. To use it, we recommend using [`~DiffusionPipeline.enable_model_cpu_offload`] on consumer GPUs. - - - - - -While Kolors is integrated in Diffusers, you need to load the image encoder from a revision to use the safetensor files. You can still use the main branch of the original repository if you're comfortable loading pickle checkpoints. - - - -```python -import torch -from transformers import CLIPVisionModelWithProjection - -from diffusers import DPMSolverMultistepScheduler, KolorsPipeline -from diffusers.utils import load_image - -image_encoder = CLIPVisionModelWithProjection.from_pretrained( - "Kwai-Kolors/Kolors-IP-Adapter-Plus", - subfolder="image_encoder", - low_cpu_mem_usage=True, - torch_dtype=torch.float16, - revision="refs/pr/4", -) - -pipe = KolorsPipeline.from_pretrained( - "Kwai-Kolors/Kolors-diffusers", image_encoder=image_encoder, torch_dtype=torch.float16, variant="fp16" -) -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True) - -pipe.load_ip_adapter( - "Kwai-Kolors/Kolors-IP-Adapter-Plus", - subfolder="", - weight_name="ip_adapter_plus_general.safetensors", - revision="refs/pr/4", - image_encoder_folder=None, -) -pipe.enable_model_cpu_offload() - -ipa_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kolors/cat_square.png") - -image = pipe( - prompt="best quality, high quality", - negative_prompt="", - guidance_scale=6.5, - num_inference_steps=25, - ip_adapter_image=ipa_image, -).images[0] - -image.save("kolors_ipa_sample.png") -``` - -## KolorsPipeline - -[[autodoc]] KolorsPipeline - -- all -- __call__ - -## KolorsImg2ImgPipeline - -[[autodoc]] KolorsImg2ImgPipeline - -- all -- __call__ - diff --git a/diffusers/docs/source/en/api/pipelines/latent_consistency_models.md b/diffusers/docs/source/en/api/pipelines/latent_consistency_models.md deleted file mode 100644 index 4d944510445c0234905d55160188074e69ec85f9..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/latent_consistency_models.md +++ /dev/null @@ -1,52 +0,0 @@ - - -# Latent Consistency Models - -Latent Consistency Models (LCMs) were proposed in [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://huggingface.co/papers/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. - -The abstract of the paper is as follows: - -*Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: [this https URL](https://latent-consistency-models.github.io/).* - -A demo for the [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) checkpoint can be found [here](https://huggingface.co/spaces/SimianLuo/Latent_Consistency_Model). - -The pipelines were contributed by [luosiallen](https://luosiallen.github.io/), [nagolinc](https://github.com/nagolinc), and [dg845](https://github.com/dg845). - - -## LatentConsistencyModelPipeline - -[[autodoc]] LatentConsistencyModelPipeline - - all - - __call__ - - enable_freeu - - disable_freeu - - enable_vae_slicing - - disable_vae_slicing - - enable_vae_tiling - - disable_vae_tiling - -## LatentConsistencyModelImg2ImgPipeline - -[[autodoc]] LatentConsistencyModelImg2ImgPipeline - - all - - __call__ - - enable_freeu - - disable_freeu - - enable_vae_slicing - - disable_vae_slicing - - enable_vae_tiling - - disable_vae_tiling - -## StableDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/latent_diffusion.md b/diffusers/docs/source/en/api/pipelines/latent_diffusion.md deleted file mode 100644 index ab50faebbfbafc12f91363b10407fe8a3f88c7aa..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/latent_diffusion.md +++ /dev/null @@ -1,40 +0,0 @@ - - -# Latent Diffusion - -Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. - -The abstract from the paper is: - -*By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.* - -The original codebase can be found at [CompVis/latent-diffusion](https://github.com/CompVis/latent-diffusion). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## LDMTextToImagePipeline -[[autodoc]] LDMTextToImagePipeline - - all - - __call__ - -## LDMSuperResolutionPipeline -[[autodoc]] LDMSuperResolutionPipeline - - all - - __call__ - -## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/latte.md b/diffusers/docs/source/en/api/pipelines/latte.md deleted file mode 100644 index c2154d5d47c157a3edc89e7fc51614c087a0df62..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/latte.md +++ /dev/null @@ -1,77 +0,0 @@ - - -# Latte - -![latte text-to-video](https://github.com/Vchitect/Latte/blob/52bc0029899babbd6e9250384c83d8ed2670ff7a/visuals/latte.gif?raw=true) - -[Latte: Latent Diffusion Transformer for Video Generation](https://arxiv.org/abs/2401.03048) from Monash University, Shanghai AI Lab, Nanjing University, and Nanyang Technological University. - -The abstract from the paper is: - -*We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.* - -**Highlights**: Latte is a latent diffusion transformer proposed as a backbone for modeling different modalities (trained for text-to-video generation here). It achieves state-of-the-art performance across four standard video benchmarks - [FaceForensics](https://arxiv.org/abs/1803.09179), [SkyTimelapse](https://arxiv.org/abs/1709.07592), [UCF101](https://arxiv.org/abs/1212.0402) and [Taichi-HD](https://arxiv.org/abs/2003.00196). To prepare and download the datasets for evaluation, please refer to [this https URL](https://github.com/Vchitect/Latte/blob/main/docs/datasets_evaluation.md). - -This pipeline was contributed by [maxin-cn](https://github.com/maxin-cn). The original codebase can be found [here](https://github.com/Vchitect/Latte). The original weights can be found under [hf.co/maxin-cn](https://huggingface.co/maxin-cn). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. - - - -### Inference - -Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. - -First, load the pipeline: - -```python -import torch -from diffusers import LattePipeline - -pipeline = LattePipeline.from_pretrained( - "maxin-cn/Latte-1", torch_dtype=torch.float16 -).to("cuda") -``` - -Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: - -```python -pipeline.transformer.to(memory_format=torch.channels_last) -pipeline.vae.to(memory_format=torch.channels_last) -``` - -Finally, compile the components and run inference: - -```python -pipeline.transformer = torch.compile(pipeline.transformer) -pipeline.vae.decode = torch.compile(pipeline.vae.decode) - -video = pipeline(prompt="A dog wearing sunglasses floating in space, surreal, nebulae in background").frames[0] -``` - -The [benchmark](https://gist.github.com/a-r-r-o-w/4e1694ca46374793c0361d740a99ff19) results on an 80GB A100 machine are: - -``` -Without torch.compile(): Average inference time: 16.246 seconds. -With torch.compile(): Average inference time: 14.573 seconds. -``` - -## LattePipeline - -[[autodoc]] LattePipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/ledits_pp.md b/diffusers/docs/source/en/api/pipelines/ledits_pp.md deleted file mode 100644 index 4d268a252edfe7fb4182088fb4d106b3d2c3823f..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/ledits_pp.md +++ /dev/null @@ -1,54 +0,0 @@ - - -# LEDITS++ - -LEDITS++ was proposed in [LEDITS++: Limitless Image Editing using Text-to-Image Models](https://huggingface.co/papers/2311.16711) by Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos. - -The abstract from the paper is: - -*Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming fine-tuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods. The project page is available at https://leditsplusplus-project.static.hf.space .* - - - -You can find additional information about LEDITS++ on the [project page](https://leditsplusplus-project.static.hf.space/index.html) and try it out in a [demo](https://huggingface.co/spaces/editing-images/leditsplusplus). - - - - -Due to some backward compatability issues with the current diffusers implementation of [`~schedulers.DPMSolverMultistepScheduler`] this implementation of LEdits++ can no longer guarantee perfect inversion. -This issue is unlikely to have any noticeable effects on applied use-cases. However, we provide an alternative implementation that guarantees perfect inversion in a dedicated [GitHub repo](https://github.com/ml-research/ledits_pp). - - -We provide two distinct pipelines based on different pre-trained models. - -## LEditsPPPipelineStableDiffusion -[[autodoc]] pipelines.ledits_pp.LEditsPPPipelineStableDiffusion - - all - - __call__ - - invert - -## LEditsPPPipelineStableDiffusionXL -[[autodoc]] pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL - - all - - __call__ - - invert - - - -## LEditsPPDiffusionPipelineOutput -[[autodoc]] pipelines.ledits_pp.pipeline_output.LEditsPPDiffusionPipelineOutput - - all - -## LEditsPPInversionPipelineOutput -[[autodoc]] pipelines.ledits_pp.pipeline_output.LEditsPPInversionPipelineOutput - - all diff --git a/diffusers/docs/source/en/api/pipelines/lumina.md b/diffusers/docs/source/en/api/pipelines/lumina.md deleted file mode 100644 index cc8aceefc1b1a96bddf9b77c61d472229c247773..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/lumina.md +++ /dev/null @@ -1,90 +0,0 @@ - - -# Lumina-T2X -![concepts](https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/9f52eabb-07dc-4881-8257-6d8a5f2a0a5a) - -[Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT](https://github.com/Alpha-VLLM/Lumina-T2X/blob/main/assets/lumina-next.pdf) from Alpha-VLLM, OpenGVLab, Shanghai AI Laboratory. - -The abstract from the paper is: - -*Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers (Flag-DiT) that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduce a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights at https://github.com/Alpha-VLLM/Lumina-T2X, we aim to advance the development of next-generation generative AI capable of universal modeling.* - -**Highlights**: Lumina-Next is a next-generation Diffusion Transformer that significantly enhances text-to-image generation, multilingual generation, and multitask performance by introducing the Next-DiT architecture, 3D RoPE, and frequency- and time-aware RoPE, among other improvements. - -Lumina-Next has the following components: -* It improves sampling efficiency with fewer and faster Steps. -* It uses a Next-DiT as a transformer backbone with Sandwichnorm 3D RoPE, and Grouped-Query Attention. -* It uses a Frequency- and Time-Aware Scaled RoPE. - ---- - -[Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers](https://arxiv.org/abs/2405.05945) from Alpha-VLLM, OpenGVLab, Shanghai AI Laboratory. - -The abstract from the paper is: - -*Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.* - - -You can find the original codebase at [Alpha-VLLM](https://github.com/Alpha-VLLM/Lumina-T2X) and all the available checkpoints at [Alpha-VLLM Lumina Family](https://huggingface.co/collections/Alpha-VLLM/lumina-family-66423205bedb81171fd0644b). - -**Highlights**: Lumina-T2X supports Any Modality, Resolution, and Duration. - -Lumina-T2X has the following components: -* It uses a Flow-based Large Diffusion Transformer as the backbone -* It supports different any modalities with one backbone and corresponding encoder, decoder. - -This pipeline was contributed by [PommesPeter](https://github.com/PommesPeter). The original codebase can be found [here](https://github.com/Alpha-VLLM/Lumina-T2X). The original weights can be found under [hf.co/Alpha-VLLM](https://huggingface.co/Alpha-VLLM). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. - - - -### Inference (Text-to-Image) - -Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. - -First, load the pipeline: - -```python -from diffusers import LuminaText2ImgPipeline -import torch - -pipeline = LuminaText2ImgPipeline.from_pretrained( - "Alpha-VLLM/Lumina-Next-SFT-diffusers", torch_dtype=torch.bfloat16 -).to("cuda") -``` - -Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: - -```python -pipeline.transformer.to(memory_format=torch.channels_last) -pipeline.vae.to(memory_format=torch.channels_last) -``` - -Finally, compile the components and run inference: - -```python -pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True) -pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True) - -image = pipeline(prompt="Upper body of a young woman in a Victorian-era outfit with brass goggles and leather straps. Background shows an industrial revolution cityscape with smoky skies and tall, metal structures").images[0] -``` - -## LuminaText2ImgPipeline - -[[autodoc]] LuminaText2ImgPipeline - - all - - __call__ - diff --git a/diffusers/docs/source/en/api/pipelines/marigold.md b/diffusers/docs/source/en/api/pipelines/marigold.md deleted file mode 100644 index 374947ce95abe6ea055ccd2e062e91518fbf6d0a..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/marigold.md +++ /dev/null @@ -1,76 +0,0 @@ - - -# Marigold Pipelines for Computer Vision Tasks - -![marigold](https://marigoldmonodepth.github.io/images/teaser_collage_compressed.jpg) - -Marigold was proposed in [Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation](https://huggingface.co/papers/2312.02145), a CVPR 2024 Oral paper by [Bingxin Ke](http://www.kebingxin.com/), [Anton Obukhov](https://www.obukhov.ai/), [Shengyu Huang](https://shengyuh.github.io/), [Nando Metzger](https://nandometzger.github.io/), [Rodrigo Caye Daudt](https://rcdaudt.github.io/), and [Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en). -The idea is to repurpose the rich generative prior of Text-to-Image Latent Diffusion Models (LDMs) for traditional computer vision tasks. -Initially, this idea was explored to fine-tune Stable Diffusion for Monocular Depth Estimation, as shown in the teaser above. -Later, -- [Tianfu Wang](https://tianfwang.github.io/) trained the first Latent Consistency Model (LCM) of Marigold, which unlocked fast single-step inference; -- [Kevin Qu](https://www.linkedin.com/in/kevin-qu-b3417621b/?locale=en_US) extended the approach to Surface Normals Estimation; -- [Anton Obukhov](https://www.obukhov.ai/) contributed the pipelines and documentation into diffusers (enabled and supported by [YiYi Xu](https://yiyixuxu.github.io/) and [Sayak Paul](https://sayak.dev/)). - -The abstract from the paper is: - -*Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.* - -## Available Pipelines - -Each pipeline supports one Computer Vision task, which takes an input RGB image as input and produces a *prediction* of the modality of interest, such as a depth map of the input image. -Currently, the following tasks are implemented: - -| Pipeline | Predicted Modalities | Demos | -|---------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------:| -| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-lcm), [Slow Original Demo (DDIM)](https://huggingface.co/spaces/prs-eth/marigold) | -| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-normals-lcm) | - - -## Available Checkpoints - -The original checkpoints can be found under the [PRS-ETH](https://huggingface.co/prs-eth/) Hugging Face organization. - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage). - - - - - -Marigold pipelines were designed and tested only with `DDIMScheduler` and `LCMScheduler`. -Depending on the scheduler, the number of inference steps required to get reliable predictions varies, and there is no universal value that works best across schedulers. -Because of that, the default value of `num_inference_steps` in the `__call__` method of the pipeline is set to `None` (see the API reference). -Unless set explicitly, its value will be taken from the checkpoint configuration `model_index.json`. -This is done to ensure high-quality predictions when calling the pipeline with just the `image` argument. - - - -See also Marigold [usage examples](marigold_usage). - -## MarigoldDepthPipeline -[[autodoc]] MarigoldDepthPipeline - - all - - __call__ - -## MarigoldNormalsPipeline -[[autodoc]] MarigoldNormalsPipeline - - all - - __call__ - -## MarigoldDepthOutput -[[autodoc]] pipelines.marigold.pipeline_marigold_depth.MarigoldDepthOutput - -## MarigoldNormalsOutput -[[autodoc]] pipelines.marigold.pipeline_marigold_normals.MarigoldNormalsOutput \ No newline at end of file diff --git a/diffusers/docs/source/en/api/pipelines/mochi.md b/diffusers/docs/source/en/api/pipelines/mochi.md deleted file mode 100644 index f29297e5901c9a5e17e8e5021c24ad0aa865d641..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/mochi.md +++ /dev/null @@ -1,36 +0,0 @@ - - -# Mochi - -[Mochi 1 Preview](https://huggingface.co/genmo/mochi-1-preview) from Genmo. - -*Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence in preliminary evaluation. This model dramatically closes the gap between closed and open video generation systems. The model is released under a permissive Apache 2.0 license.* - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. - - - -## MochiPipeline - -[[autodoc]] MochiPipeline - - all - - __call__ - -## MochiPipelineOutput - -[[autodoc]] pipelines.mochi.pipeline_output.MochiPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/musicldm.md b/diffusers/docs/source/en/api/pipelines/musicldm.md deleted file mode 100644 index 3ffb6541405da06c5f8c31171c1cf9dde41fb405..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/musicldm.md +++ /dev/null @@ -1,52 +0,0 @@ - - -# MusicLDM - -MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov. -MusicLDM takes a text prompt as input and predicts the corresponding music sample. - -Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm), -MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap) -latents. - -MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies encourages the model to interpolate between the training samples, but stay within the domain of the training data. The result is generated music that is more diverse while staying faithful to the corresponding style. - -The abstract of the paper is the following: - -*Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.* - -This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). - -## Tips - -When constructing a prompt, keep in mind: - -* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno"). -* Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality". - -During inference: - -* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. -* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly. -* The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument. - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## MusicLDMPipeline -[[autodoc]] MusicLDMPipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/overview.md b/diffusers/docs/source/en/api/pipelines/overview.md deleted file mode 100644 index 02c77d197e34ea0bd61ae086d09c44f88929c8eb..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/overview.md +++ /dev/null @@ -1,113 +0,0 @@ - - -# Pipelines - -Pipelines provide a simple way to run state-of-the-art diffusion models in inference by bundling all of the necessary components (multiple independently-trained models, schedulers, and processors) into a single end-to-end class. Pipelines are flexible and they can be adapted to use different schedulers or even model components. - -All pipelines are built from the base [`DiffusionPipeline`] class which provides basic functionality for loading, downloading, and saving all the components. Specific pipeline types (for example [`StableDiffusionPipeline`]) loaded with [`~DiffusionPipeline.from_pretrained`] are automatically detected and the pipeline components are loaded and passed to the `__init__` function of the pipeline. - - - -You shouldn't use the [`DiffusionPipeline`] class for training. Individual components (for example, [`UNet2DModel`] and [`UNet2DConditionModel`]) of diffusion pipelines are usually trained individually, so we suggest directly working with them instead. - -
- -Pipelines do not offer any training functionality. You'll notice PyTorch's autograd is disabled by decorating the [`~DiffusionPipeline.__call__`] method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should not be used for training. If you're interested in training, please take a look at the [Training](../../training/overview) guides instead! - -
- -The table below lists all the pipelines currently available in 🤗 Diffusers and the tasks they support. Click on a pipeline to view its abstract and published paper. - -| Pipeline | Tasks | -|---|---| -| [aMUSEd](amused) | text2image | -| [AnimateDiff](animatediff) | text2video | -| [Attend-and-Excite](attend_and_excite) | text2image | -| [AudioLDM](audioldm) | text2audio | -| [AudioLDM2](audioldm2) | text2audio | -| [AuraFlow](auraflow) | text2image | -| [BLIP Diffusion](blip_diffusion) | text2image | -| [CogVideoX](cogvideox) | text2video | -| [Consistency Models](consistency_models) | unconditional image generation | -| [ControlNet](controlnet) | text2image, image2image, inpainting | -| [ControlNet with Flux.1](controlnet_flux) | text2image | -| [ControlNet with Hunyuan-DiT](controlnet_hunyuandit) | text2image | -| [ControlNet with Stable Diffusion 3](controlnet_sd3) | text2image | -| [ControlNet with Stable Diffusion XL](controlnet_sdxl) | text2image | -| [ControlNet-XS](controlnetxs) | text2image | -| [ControlNet-XS with Stable Diffusion XL](controlnetxs_sdxl) | text2image | -| [Dance Diffusion](dance_diffusion) | unconditional audio generation | -| [DDIM](ddim) | unconditional image generation | -| [DDPM](ddpm) | unconditional image generation | -| [DeepFloyd IF](deepfloyd_if) | text2image, image2image, inpainting, super-resolution | -| [DiffEdit](diffedit) | inpainting | -| [DiT](dit) | text2image | -| [Flux](flux) | text2image | -| [Hunyuan-DiT](hunyuandit) | text2image | -| [I2VGen-XL](i2vgenxl) | text2video | -| [InstructPix2Pix](pix2pix) | image editing | -| [Kandinsky 2.1](kandinsky) | text2image, image2image, inpainting, interpolation | -| [Kandinsky 2.2](kandinsky_v22) | text2image, image2image, inpainting | -| [Kandinsky 3](kandinsky3) | text2image, image2image | -| [Kolors](kolors) | text2image | -| [Latent Consistency Models](latent_consistency_models) | text2image | -| [Latent Diffusion](latent_diffusion) | text2image, super-resolution | -| [Latte](latte) | text2image | -| [LEDITS++](ledits_pp) | image editing | -| [Lumina-T2X](lumina) | text2image | -| [Marigold](marigold) | depth | -| [MultiDiffusion](panorama) | text2image | -| [MusicLDM](musicldm) | text2audio | -| [PAG](pag) | text2image | -| [Paint by Example](paint_by_example) | inpainting | -| [PIA](pia) | image2video | -| [PixArt-α](pixart) | text2image | -| [PixArt-Σ](pixart_sigma) | text2image | -| [Self-Attention Guidance](self_attention_guidance) | text2image | -| [Semantic Guidance](semantic_stable_diffusion) | text2image | -| [Shap-E](shap_e) | text-to-3D, image-to-3D | -| [Stable Audio](stable_audio) | text2audio | -| [Stable Cascade](stable_cascade) | text2image | -| [Stable Diffusion](stable_diffusion/overview) | text2image, image2image, depth2image, inpainting, image variation, latent upscaler, super-resolution | -| [Stable Diffusion XL](stable_diffusion/stable_diffusion_xl) | text2image, image2image, inpainting | -| [Stable Diffusion XL Turbo](stable_diffusion/sdxl_turbo) | text2image, image2image, inpainting | -| [Stable unCLIP](stable_unclip) | text2image, image variation | -| [T2I-Adapter](stable_diffusion/adapter) | text2image | -| [Text2Video](text_to_video) | text2video, video2video | -| [Text2Video-Zero](text_to_video_zero) | text2video | -| [unCLIP](unclip) | text2image, image variation | -| [UniDiffuser](unidiffuser) | text2image, image2text, image variation, text variation, unconditional image generation, unconditional audio generation | -| [Value-guided planning](value_guided_sampling) | value guided sampling | -| [Wuerstchen](wuerstchen) | text2image | - -## DiffusionPipeline - -[[autodoc]] DiffusionPipeline - - all - - __call__ - - device - - to - - components - - -[[autodoc]] pipelines.StableDiffusionMixin.enable_freeu - -[[autodoc]] pipelines.StableDiffusionMixin.disable_freeu - -## FlaxDiffusionPipeline - -[[autodoc]] pipelines.pipeline_flax_utils.FlaxDiffusionPipeline - -## PushToHubMixin - -[[autodoc]] utils.PushToHubMixin diff --git a/diffusers/docs/source/en/api/pipelines/pag.md b/diffusers/docs/source/en/api/pipelines/pag.md deleted file mode 100644 index cc6d075f457f3f52453ed716ed30103584f607d8..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/pag.md +++ /dev/null @@ -1,103 +0,0 @@ - - -# Perturbed-Attention Guidance - -[Perturbed-Attention Guidance (PAG)](https://ku-cvlab.github.io/Perturbed-Attention-Guidance/) is a new diffusion sampling guidance that improves sample quality across both unconditional and conditional settings, achieving this without requiring further training or the integration of external modules. - -PAG was introduced in [Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance](https://huggingface.co/papers/2403.17377) by Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin and Seungryong Kim. - -The abstract from the paper is: - -*Recent studies have demonstrated that diffusion models are capable of generating high-quality samples, but their quality heavily depends on sampling guidance techniques, such as classifier guidance (CG) and classifier-free guidance (CFG). These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration. In this paper, we propose a novel sampling guidance, called Perturbed-Attention Guidance (PAG), which improves diffusion sample quality across both unconditional and conditional settings, achieving this without requiring additional training or the integration of external modules. PAG is designed to progressively enhance the structure of samples throughout the denoising process. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, by considering the self-attention mechanisms' ability to capture structural information, and guiding the denoising process away from these degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios. Moreover, PAG significantly improves the baseline performance in various downstream tasks where existing guidances such as CG or CFG cannot be fully utilized, including ControlNet with empty prompts and image restoration such as inpainting and deblurring.* - -PAG can be used by specifying the `pag_applied_layers` as a parameter when instantiating a PAG pipeline. It can be a single string or a list of strings. Each string can be a unique layer identifier or a regular expression to identify one or more layers. - -- Full identifier as a normal string: `down_blocks.2.attentions.0.transformer_blocks.0.attn1.processor` -- Full identifier as a RegEx: `down_blocks.2.(attentions|motion_modules).0.transformer_blocks.0.attn1.processor` -- Partial identifier as a RegEx: `down_blocks.2`, or `attn1` -- List of identifiers (can be combo of strings and ReGex): `["blocks.1", "blocks.(14|20)", r"down_blocks\.(2,3)"]` - - - -Since RegEx is supported as a way for matching layer identifiers, it is crucial to use it correctly otherwise there might be unexpected behaviour. The recommended way to use PAG is by specifying layers as `blocks.{layer_index}` and `blocks.({layer_index_1|layer_index_2|...})`. Using it in any other way, while doable, may bypass our basic validation checks and give you unexpected results. - - - -## AnimateDiffPAGPipeline -[[autodoc]] AnimateDiffPAGPipeline - - all - - __call__ - -## HunyuanDiTPAGPipeline -[[autodoc]] HunyuanDiTPAGPipeline - - all - - __call__ - -## KolorsPAGPipeline -[[autodoc]] KolorsPAGPipeline - - all - - __call__ - -## StableDiffusionPAGPipeline -[[autodoc]] StableDiffusionPAGPipeline - - all - - __call__ - -## StableDiffusionPAGImg2ImgPipeline -[[autodoc]] StableDiffusionPAGImg2ImgPipeline - - all - - __call__ - -## StableDiffusionControlNetPAGPipeline -[[autodoc]] StableDiffusionControlNetPAGPipeline - -## StableDiffusionControlNetPAGInpaintPipeline -[[autodoc]] StableDiffusionControlNetPAGInpaintPipeline - - all - - __call__ - -## StableDiffusionXLPAGPipeline -[[autodoc]] StableDiffusionXLPAGPipeline - - all - - __call__ - -## StableDiffusionXLPAGImg2ImgPipeline -[[autodoc]] StableDiffusionXLPAGImg2ImgPipeline - - all - - __call__ - -## StableDiffusionXLPAGInpaintPipeline -[[autodoc]] StableDiffusionXLPAGInpaintPipeline - - all - - __call__ - -## StableDiffusionXLControlNetPAGPipeline -[[autodoc]] StableDiffusionXLControlNetPAGPipeline - - all - - __call__ - -## StableDiffusionXLControlNetPAGImg2ImgPipeline -[[autodoc]] StableDiffusionXLControlNetPAGImg2ImgPipeline - - all - - __call__ - -## StableDiffusion3PAGPipeline -[[autodoc]] StableDiffusion3PAGPipeline - - all - - __call__ - - -## PixArtSigmaPAGPipeline -[[autodoc]] PixArtSigmaPAGPipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/paint_by_example.md b/diffusers/docs/source/en/api/pipelines/paint_by_example.md deleted file mode 100644 index effd608873fd5b93cce2a059a8f20ac2f8fd3d42..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/paint_by_example.md +++ /dev/null @@ -1,39 +0,0 @@ - - -# Paint by Example - -[Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://huggingface.co/papers/2211.13227) is by Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen. - -The abstract from the paper is: - -*Language-guided image editing has achieved great success recently. In this paper, for the first time, we investigate exemplar-guided image editing for more precise control. We achieve this goal by leveraging self-supervised training to disentangle and re-organize the source image and the exemplar. However, the naive approach will cause obvious fusing artifacts. We carefully analyze it and propose an information bottleneck and strong augmentations to avoid the trivial solution of directly copying and pasting the exemplar image. Meanwhile, to ensure the controllability of the editing process, we design an arbitrary shape mask for the exemplar image and leverage the classifier-free guidance to increase the similarity to the exemplar image. The whole framework involves a single forward of the diffusion model without any iterative optimization. We demonstrate that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.* - -The original codebase can be found at [Fantasy-Studio/Paint-by-Example](https://github.com/Fantasy-Studio/Paint-by-Example), and you can try it out in a [demo](https://huggingface.co/spaces/Fantasy-Studio/Paint-by-Example). - -## Tips - -Paint by Example is supported by the official [Fantasy-Studio/Paint-by-Example](https://huggingface.co/Fantasy-Studio/Paint-by-Example) checkpoint. The checkpoint is warm-started from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) to inpaint partly masked images conditioned on example and reference images. - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## PaintByExamplePipeline -[[autodoc]] PaintByExamplePipeline - - all - - __call__ - -## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/panorama.md b/diffusers/docs/source/en/api/pipelines/panorama.md deleted file mode 100644 index b34008ad830fe090c17da0892d7c032c67634b67..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/panorama.md +++ /dev/null @@ -1,50 +0,0 @@ - - -# MultiDiffusion - -[MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation](https://huggingface.co/papers/2302.08113) is by Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. - -The abstract from the paper is: - -*Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.* - -You can find additional information about MultiDiffusion on the [project page](https://multidiffusion.github.io/), [original codebase](https://github.com/omerbt/MultiDiffusion), and try it out in a [demo](https://huggingface.co/spaces/weizmannscience/MultiDiffusion). - -## Tips - -While calling [`StableDiffusionPanoramaPipeline`], it's possible to specify the `view_batch_size` parameter to be > 1. -For some GPUs with high performance, this can speedup the generation process and increase VRAM usage. - -To generate panorama-like images make sure you pass the width parameter accordingly. We recommend a width value of 2048 which is the default. - -Circular padding is applied to ensure there are no stitching artifacts when working with panoramas to ensure a seamless transition from the rightmost part to the leftmost part. By enabling circular padding (set `circular_padding=True`), the operation applies additional crops after the rightmost point of the image, allowing the model to "see” the transition from the rightmost part to the leftmost part. This helps maintain visual consistency in a 360-degree sense and creates a proper “panorama” that can be viewed using 360-degree panorama viewers. When decoding latents in Stable Diffusion, circular padding is applied to ensure that the decoded latents match in the RGB space. - -For example, without circular padding, there is a stitching artifact (default): -![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/indoor_%20no_circular_padding.png) - -But with circular padding, the right and the left parts are matching (`circular_padding=True`): -![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/indoor_%20circular_padding.png) - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## StableDiffusionPanoramaPipeline -[[autodoc]] StableDiffusionPanoramaPipeline - - __call__ - - all - -## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/pia.md b/diffusers/docs/source/en/api/pipelines/pia.md deleted file mode 100644 index 8ba78252c99b8df21d2570a9dcb601e31664aa51..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/pia.md +++ /dev/null @@ -1,167 +0,0 @@ - - -# Image-to-Video Generation with PIA (Personalized Image Animator) - -## Overview - -[PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models](https://arxiv.org/abs/2312.13964) by Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, Kai Chen - -Recent advancements in personalized text-to-image (T2I) models have revolutionized content creation, empowering non-experts to generate stunning images with unique styles. While promising, adding realistic motions into these personalized images by text poses significant challenges in preserving distinct styles, high-fidelity details, and achieving motion controllability by text. In this paper, we present PIA, a Personalized Image Animator that excels in aligning with condition images, achieving motion controllability by text, and the compatibility with various personalized T2I models without specific tuning. To achieve these goals, PIA builds upon a base T2I model with well-trained temporal alignment layers, allowing for the seamless transformation of any personalized T2I model into an image animation model. A key component of PIA is the introduction of the condition module, which utilizes the condition frame and inter-frame affinity as input to transfer appearance information guided by the affinity hint for individual frame synthesis in the latent space. This design mitigates the challenges of appearance-related image alignment within and allows for a stronger focus on aligning with motion-related guidance. - -[Project page](https://pi-animator.github.io/) - -## Available Pipelines - -| Pipeline | Tasks | Demo -|---|---|:---:| -| [PIAPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pia/pipeline_pia.py) | *Image-to-Video Generation with PIA* | - -## Available checkpoints - -Motion Adapter checkpoints for PIA can be found under the [OpenMMLab org](https://huggingface.co/openmmlab/PIA-condition-adapter). These checkpoints are meant to work with any model based on Stable Diffusion 1.5 - -## Usage example - -PIA works with a MotionAdapter checkpoint and a Stable Diffusion 1.5 model checkpoint. The MotionAdapter is a collection of Motion Modules that are responsible for adding coherent motion across image frames. These modules are applied after the Resnet and Attention blocks in the Stable Diffusion UNet. In addition to the motion modules, PIA also replaces the input convolution layer of the SD 1.5 UNet model with a 9 channel input convolution layer. - -The following example demonstrates how to use PIA to generate a video from a single image. - -```python -import torch -from diffusers import ( - EulerDiscreteScheduler, - MotionAdapter, - PIAPipeline, -) -from diffusers.utils import export_to_gif, load_image - -adapter = MotionAdapter.from_pretrained("openmmlab/PIA-condition-adapter") -pipe = PIAPipeline.from_pretrained("SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter, torch_dtype=torch.float16) - -pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config) -pipe.enable_model_cpu_offload() -pipe.enable_vae_slicing() - -image = load_image( - "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true" -) -image = image.resize((512, 512)) -prompt = "cat in a field" -negative_prompt = "wrong white balance, dark, sketches,worst quality,low quality" - -generator = torch.Generator("cpu").manual_seed(0) -output = pipe(image=image, prompt=prompt, generator=generator) -frames = output.frames[0] -export_to_gif(frames, "pia-animation.gif") -``` - -Here are some sample outputs: - - - - - -
- cat in a field. -
- cat in a field -
- - - - -If you plan on using a scheduler that can clip samples, make sure to disable it by setting `clip_sample=False` in the scheduler as this can also have an adverse effect on generated samples. Additionally, the PIA checkpoints can be sensitive to the beta schedule of the scheduler. We recommend setting this to `linear`. - - - -## Using FreeInit - -[FreeInit: Bridging Initialization Gap in Video Diffusion Models](https://arxiv.org/abs/2312.07537) by Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu. - -FreeInit is an effective method that improves temporal consistency and overall quality of videos generated using video-diffusion-models without any addition training. It can be applied to PIA, AnimateDiff, ModelScope, VideoCrafter and various other video generation models seamlessly at inference time, and works by iteratively refining the latent-initialization noise. More details can be found it the paper. - -The following example demonstrates the usage of FreeInit. - -```python -import torch -from diffusers import ( - DDIMScheduler, - MotionAdapter, - PIAPipeline, -) -from diffusers.utils import export_to_gif, load_image - -adapter = MotionAdapter.from_pretrained("openmmlab/PIA-condition-adapter") -pipe = PIAPipeline.from_pretrained("SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter) - -# enable FreeInit -# Refer to the enable_free_init documentation for a full list of configurable parameters -pipe.enable_free_init(method="butterworth", use_fast_sampling=True) - -# Memory saving options -pipe.enable_model_cpu_offload() -pipe.enable_vae_slicing() - -pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config) -image = load_image( - "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true" -) -image = image.resize((512, 512)) -prompt = "cat in a field" -negative_prompt = "wrong white balance, dark, sketches,worst quality,low quality" - -generator = torch.Generator("cpu").manual_seed(0) - -output = pipe(image=image, prompt=prompt, generator=generator) -frames = output.frames[0] -export_to_gif(frames, "pia-freeinit-animation.gif") -``` - - - - - -
- cat in a field. -
- cat in a field -
- - - - -FreeInit is not really free - the improved quality comes at the cost of extra computation. It requires sampling a few extra times depending on the `num_iters` parameter that is set when enabling it. Setting the `use_fast_sampling` parameter to `True` can improve the overall performance (at the cost of lower quality compared to when `use_fast_sampling=False` but still better results than vanilla video generation models). - - - -## PIAPipeline - -[[autodoc]] PIAPipeline - - all - - __call__ - - enable_freeu - - disable_freeu - - enable_free_init - - disable_free_init - - enable_vae_slicing - - disable_vae_slicing - - enable_vae_tiling - - disable_vae_tiling - -## PIAPipelineOutput - -[[autodoc]] pipelines.pia.PIAPipelineOutput \ No newline at end of file diff --git a/diffusers/docs/source/en/api/pipelines/pix2pix.md b/diffusers/docs/source/en/api/pipelines/pix2pix.md deleted file mode 100644 index 52767a90b2144721e6fe70ab0970bf735f0e90ee..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/pix2pix.md +++ /dev/null @@ -1,40 +0,0 @@ - - -# InstructPix2Pix - -[InstructPix2Pix: Learning to Follow Image Editing Instructions](https://huggingface.co/papers/2211.09800) is by Tim Brooks, Aleksander Holynski and Alexei A. Efros. - -The abstract from the paper is: - -*We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models -- a language model (GPT-3) and a text-to-image model (Stable Diffusion) -- to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.* - -You can find additional information about InstructPix2Pix on the [project page](https://www.timothybrooks.com/instruct-pix2pix), [original codebase](https://github.com/timothybrooks/instruct-pix2pix), and try it out in a [demo](https://huggingface.co/spaces/timbrooks/instruct-pix2pix). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## StableDiffusionInstructPix2PixPipeline -[[autodoc]] StableDiffusionInstructPix2PixPipeline - - __call__ - - all - - load_textual_inversion - - load_lora_weights - - save_lora_weights - -## StableDiffusionXLInstructPix2PixPipeline -[[autodoc]] StableDiffusionXLInstructPix2PixPipeline - - __call__ - - all diff --git a/diffusers/docs/source/en/api/pipelines/pixart.md b/diffusers/docs/source/en/api/pipelines/pixart.md deleted file mode 100644 index b2bef501b237f6bfe9e0d8324261219930dc4c74..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/pixart.md +++ /dev/null @@ -1,148 +0,0 @@ - - -# PixArt-α - -![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/header_collage.png) - -[PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis](https://huggingface.co/papers/2310.00426) is Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. - -The abstract from the paper is: - -*The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-α, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-α's training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-α only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly $300,000 ($26,000 vs. $320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-α excels in image quality, artistry, and semantic control. We hope PIXART-α will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.* - -You can find the original codebase at [PixArt-alpha/PixArt-alpha](https://github.com/PixArt-alpha/PixArt-alpha) and all the available checkpoints at [PixArt-alpha](https://huggingface.co/PixArt-alpha). - -Some notes about this pipeline: - -* It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](./dit). -* It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details. -* It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found [here](https://github.com/PixArt-alpha/PixArt-alpha/blob/08fbbd281ec96866109bdd2cdb75f2f58fb17610/diffusion/data/datasets/utils.py). -* It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient than them. - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. - - - -## Inference with under 8GB GPU VRAM - -Run the [`PixArtAlphaPipeline`] with under 8GB GPU VRAM by loading the text encoder in 8-bit precision. Let's walk through a full-fledged example. - -First, install the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) library: - -```bash -pip install -U bitsandbytes -``` - -Then load the text encoder in 8-bit: - -```python -from transformers import T5EncoderModel -from diffusers import PixArtAlphaPipeline -import torch - -text_encoder = T5EncoderModel.from_pretrained( - "PixArt-alpha/PixArt-XL-2-1024-MS", - subfolder="text_encoder", - load_in_8bit=True, - device_map="auto", - -) -pipe = PixArtAlphaPipeline.from_pretrained( - "PixArt-alpha/PixArt-XL-2-1024-MS", - text_encoder=text_encoder, - transformer=None, - device_map="auto" -) -``` - -Now, use the `pipe` to encode a prompt: - -```python -with torch.no_grad(): - prompt = "cute cat" - prompt_embeds, prompt_attention_mask, negative_embeds, negative_prompt_attention_mask = pipe.encode_prompt(prompt) -``` - -Since text embeddings have been computed, remove the `text_encoder` and `pipe` from the memory, and free up some GPU VRAM: - -```python -import gc - -def flush(): - gc.collect() - torch.cuda.empty_cache() - -del text_encoder -del pipe -flush() -``` - -Then compute the latents with the prompt embeddings as inputs: - -```python -pipe = PixArtAlphaPipeline.from_pretrained( - "PixArt-alpha/PixArt-XL-2-1024-MS", - text_encoder=None, - torch_dtype=torch.float16, -).to("cuda") - -latents = pipe( - negative_prompt=None, - prompt_embeds=prompt_embeds, - negative_prompt_embeds=negative_embeds, - prompt_attention_mask=prompt_attention_mask, - negative_prompt_attention_mask=negative_prompt_attention_mask, - num_images_per_prompt=1, - output_type="latent", -).images - -del pipe.transformer -flush() -``` - - - -Notice that while initializing `pipe`, you're setting `text_encoder` to `None` so that it's not loaded. - - - -Once the latents are computed, pass it off to the VAE to decode into a real image: - -```python -with torch.no_grad(): - image = pipe.vae.decode(latents / pipe.vae.config.scaling_factor, return_dict=False)[0] -image = pipe.image_processor.postprocess(image, output_type="pil")[0] -image.save("cat.png") -``` - -By deleting components you aren't using and flushing the GPU VRAM, you should be able to run [`PixArtAlphaPipeline`] with under 8GB GPU VRAM. - -![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/8bits_cat.png) - -If you want a report of your memory-usage, run this [script](https://gist.github.com/sayakpaul/3ae0f847001d342af27018a96f467e4e). - - - -Text embeddings computed in 8-bit can impact the quality of the generated images because of the information loss in the representation space caused by the reduced precision. It's recommended to compare the outputs with and without 8-bit. - - - -While loading the `text_encoder`, you set `load_in_8bit` to `True`. You could also specify `load_in_4bit` to bring your memory requirements down even further to under 7GB. - -## PixArtAlphaPipeline - -[[autodoc]] PixArtAlphaPipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/pixart_sigma.md b/diffusers/docs/source/en/api/pipelines/pixart_sigma.md deleted file mode 100644 index 592ba0f374bea6c57e23d2829154397ce4b1f701..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/pixart_sigma.md +++ /dev/null @@ -1,155 +0,0 @@ - - -# PixArt-Σ - -![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/header_collage_sigma.jpg) - -[PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation](https://huggingface.co/papers/2403.04692) is Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. - -The abstract from the paper is: - -*In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly generating images at 4K resolution. PixArt-Σ represents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Σ is its training efficiency. Leveraging the foundational pre-training of PixArt-α, it evolves from the ‘weaker’ baseline to a ‘stronger’ model via incorporating higher quality data, a process we term “weak-to-strong training”. The advancements in PixArt-Σ are twofold: (1) High-Quality Training Data: PixArt-Σ incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-Σ achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-Σ’s capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of highquality visual content in industries such as film and gaming.* - -You can find the original codebase at [PixArt-alpha/PixArt-sigma](https://github.com/PixArt-alpha/PixArt-sigma) and all the available checkpoints at [PixArt-alpha](https://huggingface.co/PixArt-alpha). - -Some notes about this pipeline: - -* It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](https://hf.co/docs/transformers/model_doc/dit). -* It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details. -* It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found [here](https://github.com/PixArt-alpha/PixArt-sigma/blob/master/diffusion/data/datasets/utils.py). -* It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as PixArt-α, Stable Diffusion XL, Playground V2.0 and DALL-E 3, while being more efficient than them. -* It shows the ability of generating super high resolution images, such as 2048px or even 4K. -* It shows that text-to-image models can grow from a weak model to a stronger one through several improvements (VAEs, datasets, and so on.) - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. - - - - - -You can further improve generation quality by passing the generated image from [`PixArtSigmaPipeline`] to the [SDXL refiner](../../using-diffusers/sdxl#base-to-refiner-model) model. - - - -## Inference with under 8GB GPU VRAM - -Run the [`PixArtSigmaPipeline`] with under 8GB GPU VRAM by loading the text encoder in 8-bit precision. Let's walk through a full-fledged example. - -First, install the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) library: - -```bash -pip install -U bitsandbytes -``` - -Then load the text encoder in 8-bit: - -```python -from transformers import T5EncoderModel -from diffusers import PixArtSigmaPipeline -import torch - -text_encoder = T5EncoderModel.from_pretrained( - "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", - subfolder="text_encoder", - load_in_8bit=True, - device_map="auto", -) -pipe = PixArtSigmaPipeline.from_pretrained( - "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", - text_encoder=text_encoder, - transformer=None, - device_map="balanced" -) -``` - -Now, use the `pipe` to encode a prompt: - -```python -with torch.no_grad(): - prompt = "cute cat" - prompt_embeds, prompt_attention_mask, negative_embeds, negative_prompt_attention_mask = pipe.encode_prompt(prompt) -``` - -Since text embeddings have been computed, remove the `text_encoder` and `pipe` from the memory, and free up some GPU VRAM: - -```python -import gc - -def flush(): - gc.collect() - torch.cuda.empty_cache() - -del text_encoder -del pipe -flush() -``` - -Then compute the latents with the prompt embeddings as inputs: - -```python -pipe = PixArtSigmaPipeline.from_pretrained( - "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", - text_encoder=None, - torch_dtype=torch.float16, -).to("cuda") - -latents = pipe( - negative_prompt=None, - prompt_embeds=prompt_embeds, - negative_prompt_embeds=negative_embeds, - prompt_attention_mask=prompt_attention_mask, - negative_prompt_attention_mask=negative_prompt_attention_mask, - num_images_per_prompt=1, - output_type="latent", -).images - -del pipe.transformer -flush() -``` - - - -Notice that while initializing `pipe`, you're setting `text_encoder` to `None` so that it's not loaded. - - - -Once the latents are computed, pass it off to the VAE to decode into a real image: - -```python -with torch.no_grad(): - image = pipe.vae.decode(latents / pipe.vae.config.scaling_factor, return_dict=False)[0] -image = pipe.image_processor.postprocess(image, output_type="pil")[0] -image.save("cat.png") -``` - -By deleting components you aren't using and flushing the GPU VRAM, you should be able to run [`PixArtSigmaPipeline`] with under 8GB GPU VRAM. - -![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/8bits_cat.png) - -If you want a report of your memory-usage, run this [script](https://gist.github.com/sayakpaul/3ae0f847001d342af27018a96f467e4e). - - - -Text embeddings computed in 8-bit can impact the quality of the generated images because of the information loss in the representation space caused by the reduced precision. It's recommended to compare the outputs with and without 8-bit. - - - -While loading the `text_encoder`, you set `load_in_8bit` to `True`. You could also specify `load_in_4bit` to bring your memory requirements down even further to under 7GB. - -## PixArtSigmaPipeline - -[[autodoc]] PixArtSigmaPipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/self_attention_guidance.md b/diffusers/docs/source/en/api/pipelines/self_attention_guidance.md deleted file mode 100644 index e56aae2a775b29dbe31a63bd00c3fc5e9333e95b..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/self_attention_guidance.md +++ /dev/null @@ -1,35 +0,0 @@ - - -# Self-Attention Guidance - -[Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://huggingface.co/papers/2210.00939) is by Susung Hong et al. - -The abstract from the paper is: - -*Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity. This success is largely attributed to the use of class- or text-conditional diffusion guidance methods, such as classifier and classifier-free guidance. In this paper, we present a more comprehensive perspective that goes beyond the traditional guidance methods. From this generalized perspective, we introduce novel condition- and training-free strategies to enhance the quality of generated images. As a simple solution, blur guidance improves the suitability of intermediate samples for their fine-scale information and structures, enabling diffusion models to generate higher quality samples with a moderate guidance scale. Improving upon this, Self-Attention Guidance (SAG) uses the intermediate self-attention maps of diffusion models to enhance their stability and efficacy. Specifically, SAG adversarially blurs only the regions that diffusion models attend to at each iteration and guides them accordingly. Our experimental results show that our SAG improves the performance of various diffusion models, including ADM, IDDPM, Stable Diffusion, and DiT. Moreover, combining SAG with conventional guidance methods leads to further improvement.* - -You can find additional information about Self-Attention Guidance on the [project page](https://ku-cvlab.github.io/Self-Attention-Guidance), [original codebase](https://github.com/KU-CVLAB/Self-Attention-Guidance), and try it out in a [demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) or [notebook](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## StableDiffusionSAGPipeline -[[autodoc]] StableDiffusionSAGPipeline - - __call__ - - all - -## StableDiffusionOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/semantic_stable_diffusion.md b/diffusers/docs/source/en/api/pipelines/semantic_stable_diffusion.md deleted file mode 100644 index 19a0a8116989590da623a0ee1abee0fcdb629016..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/semantic_stable_diffusion.md +++ /dev/null @@ -1,35 +0,0 @@ - - -# Semantic Guidance - -Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Text-to-Image Models using Semantic Guidance](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation. -Small changes to the text prompt usually result in entirely different output images. However, with SEGA a variety of changes to the image are enabled that can be controlled easily and intuitively, while staying true to the original image composition. - -The abstract from the paper is: - -*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility, flexibility, and improvements over existing methods.* - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## SemanticStableDiffusionPipeline -[[autodoc]] SemanticStableDiffusionPipeline - - all - - __call__ - -## SemanticStableDiffusionPipelineOutput -[[autodoc]] pipelines.semantic_stable_diffusion.pipeline_output.SemanticStableDiffusionPipelineOutput - - all diff --git a/diffusers/docs/source/en/api/pipelines/shap_e.md b/diffusers/docs/source/en/api/pipelines/shap_e.md deleted file mode 100644 index 9f9155c79e895c1d0215fac37ac32ce89dce6f96..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/shap_e.md +++ /dev/null @@ -1,37 +0,0 @@ - - -# Shap-E - -The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewoo Jun from [OpenAI](https://github.com/openai). - -The abstract from the paper is: - -*We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.* - -The original codebase can be found at [openai/shap-e](https://github.com/openai/shap-e). - - - -See the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## ShapEPipeline -[[autodoc]] ShapEPipeline - - all - - __call__ - -## ShapEImg2ImgPipeline -[[autodoc]] ShapEImg2ImgPipeline - - all - - __call__ - -## ShapEPipelineOutput -[[autodoc]] pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/stable_audio.md b/diffusers/docs/source/en/api/pipelines/stable_audio.md deleted file mode 100644 index a6d34a0697d5da0b03095cda14b5e19aa1b8b6ef..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_audio.md +++ /dev/null @@ -1,42 +0,0 @@ - - -# Stable Audio - -Stable Audio was proposed in [Stable Audio Open](https://arxiv.org/abs/2407.14358) by Zach Evans et al. . it takes a text prompt as input and predicts the corresponding sound or music sample. - -Stable Audio Open generates variable-length (up to 47s) stereo audio at 44.1kHz from text prompts. It comprises three components: an autoencoder that compresses waveforms into a manageable sequence length, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the latent space of the autoencoder. - -Stable Audio is trained on a corpus of around 48k audio recordings, where around 47k are from Freesound and the rest are from the Free Music Archive (FMA). All audio files are licensed under CC0, CC BY, or CC Sampling+. This data is used to train the autoencoder and the DiT. - -The abstract of the paper is the following: -*Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.* - -This pipeline was contributed by [Yoach Lacombe](https://huggingface.co/ylacombe). The original codebase can be found at [Stability-AI/stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools). - -## Tips - -When constructing a prompt, keep in mind: - -* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno"). -* Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality". - -During inference: - -* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. -* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly. - - -## StableAudioPipeline -[[autodoc]] StableAudioPipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/stable_cascade.md b/diffusers/docs/source/en/api/pipelines/stable_cascade.md deleted file mode 100644 index 93a94d66c1093bbf7efd295249d70acb6cb58624..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_cascade.md +++ /dev/null @@ -1,229 +0,0 @@ - - -# Stable Cascade - -This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main -difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this -important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes. -How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being -encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a -1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the -highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable -Diffusion 1.5. - -Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions -like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well. - -The original codebase can be found at [Stability-AI/StableCascade](https://github.com/Stability-AI/StableCascade). - -## Model Overview -Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images, -hence the name "Stable Cascade". - -Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion. -However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a -spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves -a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the -image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible -for generating the small 24 x 24 latents given a text prompt. - -The Stage C model operates on the small 24 x 24 latents and denoises the latents conditioned on text prompts. The model is also the largest component in the Cascade pipeline and is meant to be used with the `StableCascadePriorPipeline` - -The Stage B and Stage A models are used with the `StableCascadeDecoderPipeline` and are responsible for generating the final image given the small 24 x 24 latents. - - - -There are some restrictions on data types that can be used with the Stable Cascade models. The official checkpoints for the `StableCascadePriorPipeline` do not support the `torch.float16` data type. Please use `torch.bfloat16` instead. - -In order to use the `torch.bfloat16` data type with the `StableCascadeDecoderPipeline` you need to have PyTorch 2.2.0 or higher installed. This also means that using the `StableCascadeCombinedPipeline` with `torch.bfloat16` requires PyTorch 2.2.0 or higher, since it calls the `StableCascadeDecoderPipeline` internally. - -If it is not possible to install PyTorch 2.2.0 or higher in your environment, the `StableCascadeDecoderPipeline` can be used on its own with the `torch.float16` data type. You can download the full precision or `bf16` variant weights for the pipeline and cast the weights to `torch.float16`. - - - -## Usage example - -```python -import torch -from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline - -prompt = "an image of a shiba inu, donning a spacesuit and helmet" -negative_prompt = "" - -prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16) -decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.float16) - -prior.enable_model_cpu_offload() -prior_output = prior( - prompt=prompt, - height=1024, - width=1024, - negative_prompt=negative_prompt, - guidance_scale=4.0, - num_images_per_prompt=1, - num_inference_steps=20 -) - -decoder.enable_model_cpu_offload() -decoder_output = decoder( - image_embeddings=prior_output.image_embeddings.to(torch.float16), - prompt=prompt, - negative_prompt=negative_prompt, - guidance_scale=0.0, - output_type="pil", - num_inference_steps=10 -).images[0] -decoder_output.save("cascade.png") -``` - -## Using the Lite Versions of the Stage B and Stage C models - -```python -import torch -from diffusers import ( - StableCascadeDecoderPipeline, - StableCascadePriorPipeline, - StableCascadeUNet, -) - -prompt = "an image of a shiba inu, donning a spacesuit and helmet" -negative_prompt = "" - -prior_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade-prior", subfolder="prior_lite") -decoder_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade", subfolder="decoder_lite") - -prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet) -decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet) - -prior.enable_model_cpu_offload() -prior_output = prior( - prompt=prompt, - height=1024, - width=1024, - negative_prompt=negative_prompt, - guidance_scale=4.0, - num_images_per_prompt=1, - num_inference_steps=20 -) - -decoder.enable_model_cpu_offload() -decoder_output = decoder( - image_embeddings=prior_output.image_embeddings, - prompt=prompt, - negative_prompt=negative_prompt, - guidance_scale=0.0, - output_type="pil", - num_inference_steps=10 -).images[0] -decoder_output.save("cascade.png") -``` - -## Loading original checkpoints with `from_single_file` - -Loading the original format checkpoints is supported via `from_single_file` method in the StableCascadeUNet. - -```python -import torch -from diffusers import ( - StableCascadeDecoderPipeline, - StableCascadePriorPipeline, - StableCascadeUNet, -) - -prompt = "an image of a shiba inu, donning a spacesuit and helmet" -negative_prompt = "" - -prior_unet = StableCascadeUNet.from_single_file( - "https://huggingface.co/stabilityai/stable-cascade/resolve/main/stage_c_bf16.safetensors", - torch_dtype=torch.bfloat16 -) -decoder_unet = StableCascadeUNet.from_single_file( - "https://huggingface.co/stabilityai/stable-cascade/blob/main/stage_b_bf16.safetensors", - torch_dtype=torch.bfloat16 -) - -prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet, torch_dtype=torch.bfloat16) -decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet, torch_dtype=torch.bfloat16) - -prior.enable_model_cpu_offload() -prior_output = prior( - prompt=prompt, - height=1024, - width=1024, - negative_prompt=negative_prompt, - guidance_scale=4.0, - num_images_per_prompt=1, - num_inference_steps=20 -) - -decoder.enable_model_cpu_offload() -decoder_output = decoder( - image_embeddings=prior_output.image_embeddings, - prompt=prompt, - negative_prompt=negative_prompt, - guidance_scale=0.0, - output_type="pil", - num_inference_steps=10 -).images[0] -decoder_output.save("cascade-single-file.png") -``` - -## Uses - -### Direct Use - -The model is intended for research purposes for now. Possible research areas and tasks include - -- Research on generative models. -- Safe deployment of models which have the potential to generate harmful content. -- Probing and understanding the limitations and biases of generative models. -- Generation of artworks and use in design and other artistic processes. -- Applications in educational or creative tools. - -Excluded uses are described below. - -### Out-of-Scope Use - -The model was not trained to be factual or true representations of people or events, -and therefore using the model to generate such content is out-of-scope for the abilities of this model. -The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy). - -## Limitations and Bias - -### Limitations -- Faces and people in general may not be generated properly. -- The autoencoding part of the model is lossy. - - -## StableCascadeCombinedPipeline - -[[autodoc]] StableCascadeCombinedPipeline - - all - - __call__ - -## StableCascadePriorPipeline - -[[autodoc]] StableCascadePriorPipeline - - all - - __call__ - -## StableCascadePriorPipelineOutput - -[[autodoc]] pipelines.stable_cascade.pipeline_stable_cascade_prior.StableCascadePriorPipelineOutput - -## StableCascadeDecoderPipeline - -[[autodoc]] StableCascadeDecoderPipeline - - all - - __call__ - diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/adapter.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/adapter.md deleted file mode 100644 index ca42fdc83984fb6bd4bb226747479a1936fcaa79..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/adapter.md +++ /dev/null @@ -1,47 +0,0 @@ - - -# T2I-Adapter - -[T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.08453) by Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie. - -Using the pretrained models we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details. - -The abstract of the paper is the following: - -*The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.* - -This model was contributed by the community contributor [HimariO](https://github.com/HimariO) ❤️ . - -## StableDiffusionAdapterPipeline - -[[autodoc]] StableDiffusionAdapterPipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_vae_slicing - - disable_vae_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - -## StableDiffusionXLAdapterPipeline - -[[autodoc]] StableDiffusionXLAdapterPipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_vae_slicing - - disable_vae_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/depth2img.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/depth2img.md deleted file mode 100644 index 84dae80498a3828a7ab1cd803ac055ea18313bd0..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/depth2img.md +++ /dev/null @@ -1,40 +0,0 @@ - - -# Depth-to-image - -The Stable Diffusion model can also infer depth based on an image using [MiDaS](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure. - - - -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! - -If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! - - - -## StableDiffusionDepth2ImgPipeline - -[[autodoc]] StableDiffusionDepth2ImgPipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - - load_textual_inversion - - load_lora_weights - - save_lora_weights - -## StableDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/gligen.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/gligen.md deleted file mode 100644 index c67544472ead6fd89f67f013a0e92bbfb07d125b..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/gligen.md +++ /dev/null @@ -1,59 +0,0 @@ - - -# GLIGEN (Grounded Language-to-Image Generation) - -The GLIGEN model was created by researchers and engineers from [University of Wisconsin-Madison, Columbia University, and Microsoft](https://github.com/gligen/GLIGEN). The [`StableDiffusionGLIGENPipeline`] and [`StableDiffusionGLIGENTextImagePipeline`] can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes with [`StableDiffusionGLIGENPipeline`], if input images are given, [`StableDiffusionGLIGENTextImagePipeline`] can insert objects described by text at the region defined by bounding boxes. Otherwise, it'll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It's trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs. - -The abstract from the [paper](https://huggingface.co/papers/2301.07093) is: - -*Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN’s zeroshot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.* - - - -Make sure to check out the Stable Diffusion [Tips](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality and how to reuse pipeline components efficiently! - -If you want to use one of the official checkpoints for a task, explore the [gligen](https://huggingface.co/gligen) Hub organizations! - - - -[`StableDiffusionGLIGENPipeline`] was contributed by [Nikhil Gajendrakumar](https://github.com/nikhil-masterful) and [`StableDiffusionGLIGENTextImagePipeline`] was contributed by [Nguyễn Công Tú Anh](https://github.com/tuanh123789). - -## StableDiffusionGLIGENPipeline - -[[autodoc]] StableDiffusionGLIGENPipeline - - all - - __call__ - - enable_vae_slicing - - disable_vae_slicing - - enable_vae_tiling - - disable_vae_tiling - - enable_model_cpu_offload - - prepare_latents - - enable_fuser - -## StableDiffusionGLIGENTextImagePipeline - -[[autodoc]] StableDiffusionGLIGENTextImagePipeline - - all - - __call__ - - enable_vae_slicing - - disable_vae_slicing - - enable_vae_tiling - - disable_vae_tiling - - enable_model_cpu_offload - - prepare_latents - - enable_fuser - -## StableDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/image_variation.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/image_variation.md deleted file mode 100644 index 57dd2f0d5b396d271652805d1d86a49c58e3efe9..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/image_variation.md +++ /dev/null @@ -1,37 +0,0 @@ - - -# Image variation - -The Stable Diffusion model can also generate variations from an input image. It uses a fine-tuned version of a Stable Diffusion model by [Justin Pinkney](https://www.justinpinkney.com/) from [Lambda](https://lambdalabs.com/). - -The original codebase can be found at [LambdaLabsML/lambda-diffusers](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) and additional official checkpoints for image variation can be found at [lambdalabs/sd-image-variations-diffusers](https://huggingface.co/lambdalabs/sd-image-variations-diffusers). - - - -Make sure to check out the Stable Diffusion [Tips](./overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! - - - -## StableDiffusionImageVariationPipeline - -[[autodoc]] StableDiffusionImageVariationPipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - -## StableDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/img2img.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/img2img.md deleted file mode 100644 index 1a62a5a48ff0eb7eeed43a15f76509a620fd2692..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/img2img.md +++ /dev/null @@ -1,55 +0,0 @@ - - -# Image-to-image - -The Stable Diffusion model can also be applied to image-to-image generation by passing a text prompt and an initial image to condition the generation of new images. - -The [`StableDiffusionImg2ImgPipeline`] uses the diffusion-denoising mechanism proposed in [SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations](https://huggingface.co/papers/2108.01073) by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon. - -The abstract from the paper is: - -*Guided image synthesis enables everyday users to create and edit photo-realistic images with minimum effort. The key challenge is balancing faithfulness to the user input (e.g., hand-drawn colored strokes) and realism of the synthesized image. Existing GAN-based methods attempt to achieve such balance using either conditional GANs or GAN inversions, which are challenging and often require additional training data or loss functions for individual applications. To address these issues, we introduce a new image synthesis and editing method, Stochastic Differential Editing (SDEdit), based on a diffusion model generative prior, which synthesizes realistic images by iteratively denoising through a stochastic differential equation (SDE). Given an input image with user guide of any type, SDEdit first adds noise to the input, then subsequently denoises the resulting image through the SDE prior to increase its realism. SDEdit does not require task-specific training or inversions and can naturally achieve the balance between realism and faithfulness. SDEdit significantly outperforms state-of-the-art GAN-based methods by up to 98.09% on realism and 91.72% on overall satisfaction scores, according to a human perception study, on multiple tasks, including stroke-based image synthesis and editing as well as image compositing.* - - - -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! - - - -## StableDiffusionImg2ImgPipeline - -[[autodoc]] StableDiffusionImg2ImgPipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - - load_textual_inversion - - from_single_file - - load_lora_weights - - save_lora_weights - -## StableDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput - -## FlaxStableDiffusionImg2ImgPipeline - -[[autodoc]] FlaxStableDiffusionImg2ImgPipeline - - all - - __call__ - -## FlaxStableDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/inpaint.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/inpaint.md deleted file mode 100644 index ef605cfe8b9001c62d1734e40c8c9c94cf469cc8..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/inpaint.md +++ /dev/null @@ -1,57 +0,0 @@ - - -# Inpainting - -The Stable Diffusion model can also be applied to inpainting which lets you edit specific parts of an image by providing a mask and a text prompt using Stable Diffusion. - -## Tips - -It is recommended to use this pipeline with checkpoints that have been specifically fine-tuned for inpainting, such -as [runwayml/stable-diffusion-inpainting](https://huggingface.co/runwayml/stable-diffusion-inpainting). Default -text-to-image Stable Diffusion checkpoints, such as -[stable-diffusion-v1-5/stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) are also compatible but they might be less performant. - - - -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! - -If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! - - - -## StableDiffusionInpaintPipeline - -[[autodoc]] StableDiffusionInpaintPipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - - load_textual_inversion - - load_lora_weights - - save_lora_weights - -## StableDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput - -## FlaxStableDiffusionInpaintPipeline - -[[autodoc]] FlaxStableDiffusionInpaintPipeline - - all - - __call__ - -## FlaxStableDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/k_diffusion.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/k_diffusion.md deleted file mode 100644 index 77e77f8eded8ea4059f22c1d01cce9fa62a96752..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/k_diffusion.md +++ /dev/null @@ -1,27 +0,0 @@ - - -# K-Diffusion - -[k-diffusion](https://github.com/crowsonkb/k-diffusion) is a popular library created by [Katherine Crowson](https://github.com/crowsonkb/). We provide `StableDiffusionKDiffusionPipeline` and `StableDiffusionXLKDiffusionPipeline` that allow you to run Stable DIffusion with samplers from k-diffusion. - -Note that most the samplers from k-diffusion are implemented in Diffusers and we recommend using existing schedulers. You can find a mapping between k-diffusion samplers and schedulers in Diffusers [here](https://huggingface.co/docs/diffusers/api/schedulers/overview) - - -## StableDiffusionKDiffusionPipeline - -[[autodoc]] StableDiffusionKDiffusionPipeline - - -## StableDiffusionXLKDiffusionPipeline - -[[autodoc]] StableDiffusionXLKDiffusionPipeline \ No newline at end of file diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md deleted file mode 100644 index 9abccd6e134713125914ca8080ee9195696fb8ed..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md +++ /dev/null @@ -1,38 +0,0 @@ - - -# Latent upscaler - -The Stable Diffusion latent upscaler model was created by [Katherine Crowson](https://github.com/crowsonkb/k-diffusion) in collaboration with [Stability AI](https://stability.ai/). It is used to enhance the output image resolution by a factor of 2 (see this demo [notebook](https://colab.research.google.com/drive/1o1qYJcFeywzCIdkfKJy7cTpgZTCM2EI4) for a demonstration of the original implementation). - - - -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! - -If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! - - - -## StableDiffusionLatentUpscalePipeline - -[[autodoc]] StableDiffusionLatentUpscalePipeline - - all - - __call__ - - enable_sequential_cpu_offload - - enable_attention_slicing - - disable_attention_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - -## StableDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.md deleted file mode 100644 index 23830462c20badd897b50091eda0345cb99e65c8..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.md +++ /dev/null @@ -1,55 +0,0 @@ - - -# Text-to-(RGB, depth) - -LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://huggingface.co/papers/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, and Vasudev Lal. LDM3D generates an image and a depth map from a given text prompt unlike the existing text-to-image diffusion models such as [Stable Diffusion](./overview) which only generates an image. With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps. - -Two checkpoints are available for use: -- [ldm3d-original](https://huggingface.co/Intel/ldm3d). The original checkpoint used in the [paper](https://arxiv.org/pdf/2305.10853.pdf) -- [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c). The new version of LDM3D using 4 channels inputs instead of 6-channels inputs and finetuned on higher resolution images. - - -The abstract from the paper is: - -*This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found at [this url](https://t.ly/tdi2).* - - - -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! - - - -## StableDiffusionLDM3DPipeline - -[[autodoc]] pipelines.stable_diffusion_ldm3d.pipeline_stable_diffusion_ldm3d.StableDiffusionLDM3DPipeline - - all - - __call__ - - -## LDM3DPipelineOutput - -[[autodoc]] pipelines.stable_diffusion_ldm3d.pipeline_stable_diffusion_ldm3d.LDM3DPipelineOutput - - all - - __call__ - -# Upscaler - -[LDM3D-VR](https://arxiv.org/pdf/2311.03226.pdf) is an extended version of LDM3D. - -The abstract from the paper is: -*Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods* - -Two checkpoints are available for use: -- [ldm3d-pano](https://huggingface.co/Intel/ldm3d-pano). This checkpoint enables the generation of panoramic images and requires the StableDiffusionLDM3DPipeline pipeline to be used. -- [ldm3d-sr](https://huggingface.co/Intel/ldm3d-sr). This checkpoint enables the upscaling of RGB and depth images. Can be used in cascade after the original LDM3D pipeline using the StableDiffusionUpscaleLDM3DPipeline from communauty pipeline. - diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/overview.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/overview.md deleted file mode 100644 index 5087d1fdd43ac6a5fbeed7548c61b5ef58de6e73..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/overview.md +++ /dev/null @@ -1,212 +0,0 @@ - - -# Stable Diffusion pipelines - -Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). Latent diffusion applies the diffusion process over a lower dimensional latent space to reduce memory and compute complexity. This specific type of diffusion model was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. - -Stable Diffusion is trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs. - -For more details about how Stable Diffusion works and how it differs from the base latent diffusion model, take a look at the Stability AI [announcement](https://stability.ai/blog/stable-diffusion-announcement) and our own [blog post](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) for more technical details. - -You can find the original codebase for Stable Diffusion v1.0 at [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion) and Stable Diffusion v2.0 at [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) as well as their original scripts for various tasks. Additional official checkpoints for the different Stable Diffusion versions and tasks can be found on the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations. Explore these organizations to find the best checkpoint for your use-case! - -The table below summarizes the available Stable Diffusion pipelines, their supported tasks, and an interactive demo: - -
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- Pipeline - - Supported tasks - - 🤗 Space -
- StableDiffusion - text-to-image -
- StableDiffusionImg2Img - image-to-image -
- StableDiffusionInpaint - inpainting -
- StableDiffusionDepth2Img - depth-to-image -
- StableDiffusionImageVariation - image variation -
- StableDiffusionPipelineSafe - filtered text-to-image -
- StableDiffusion2 - text-to-image, inpainting, depth-to-image, super-resolution -
- StableDiffusionXL - text-to-image, image-to-image -
- StableDiffusionLatentUpscale - super-resolution -
- StableDiffusionUpscale - super-resolution
- StableDiffusionLDM3D - text-to-rgb, text-to-depth, text-to-pano -
- StableDiffusionUpscaleLDM3D - ldm3d super-resolution
-
-
- -## Tips - -To help you get the most out of the Stable Diffusion pipelines, here are a few tips for improving performance and usability. These tips are applicable to all Stable Diffusion pipelines. - -### Explore tradeoff between speed and quality - -[`StableDiffusionPipeline`] uses the [`PNDMScheduler`] by default, but 🤗 Diffusers provides many other schedulers (some of which are faster or output better quality) that are compatible. For example, if you want to use the [`EulerDiscreteScheduler`] instead of the default: - -```py -from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler - -pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") -pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) - -# or -euler_scheduler = EulerDiscreteScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler") -pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=euler_scheduler) -``` - -### Reuse pipeline components to save memory - -To save memory and use the same components across multiple pipelines, use the `.components` method to avoid loading weights into RAM more than once. - -```py -from diffusers import ( - StableDiffusionPipeline, - StableDiffusionImg2ImgPipeline, - StableDiffusionInpaintPipeline, -) - -text2img = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") -img2img = StableDiffusionImg2ImgPipeline(**text2img.components) -inpaint = StableDiffusionInpaintPipeline(**text2img.components) - -# now you can use text2img(...), img2img(...), inpaint(...) just like the call methods of each respective pipeline -``` - -### Create web demos using `gradio` - -The Stable Diffusion pipelines are automatically supported in [Gradio](https://github.com/gradio-app/gradio/), a library that makes creating beautiful and user-friendly machine learning apps on the web a breeze. First, make sure you have Gradio installed: - -```sh -pip install -U gradio -``` - -Then, create a web demo around any Stable Diffusion-based pipeline. For example, you can create an image generation pipeline in a single line of code with Gradio's [`Interface.from_pipeline`](https://www.gradio.app/docs/interface#interface-from-pipeline) function: - -```py -from diffusers import StableDiffusionPipeline -import gradio as gr - -pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") - -gr.Interface.from_pipeline(pipe).launch() -``` - -which opens an intuitive drag-and-drop interface in your browser: - -![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gradio-panda.png) - -Similarly, you could create a demo for an image-to-image pipeline with: - -```py -from diffusers import StableDiffusionImg2ImgPipeline -import gradio as gr - - -pipe = StableDiffusionImg2ImgPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5") - -gr.Interface.from_pipeline(pipe).launch() -``` - -By default, the web demo runs on a local server. If you'd like to share it with others, you can generate a temporary public -link by setting `share=True` in `launch()`. Or, you can host your demo on [Hugging Face Spaces](https://huggingface.co/spaces)https://huggingface.co/spaces for a permanent link. \ No newline at end of file diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/sdxl_turbo.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/sdxl_turbo.md deleted file mode 100644 index 764685a73cfb3672f523a37263417defe2413848..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/sdxl_turbo.md +++ /dev/null @@ -1,35 +0,0 @@ - - -# SDXL Turbo - -Stable Diffusion XL (SDXL) Turbo was proposed in [Adversarial Diffusion Distillation](https://stability.ai/research/adversarial-diffusion-distillation) by Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. - -The abstract from the paper is: - -*We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1–4 steps while maintaining high image quality. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal in combination with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. Our analyses show that our model clearly outperforms existing few-step methods (GANs,Latent Consistency Models) in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first method to unlock single-step, real-time image synthesis with foundation models.* - -## Tips - -- SDXL Turbo uses the exact same architecture as [SDXL](./stable_diffusion_xl), which means it also has the same API. Please refer to the [SDXL](./stable_diffusion_xl) API reference for more details. -- SDXL Turbo should disable guidance scale by setting `guidance_scale=0.0`. -- SDXL Turbo should use `timestep_spacing='trailing'` for the scheduler and use between 1 and 4 steps. -- SDXL Turbo has been trained to generate images of size 512x512. -- SDXL Turbo is open-access, but not open-source meaning that one might have to buy a model license in order to use it for commercial applications. Make sure to read the [official model card](https://huggingface.co/stabilityai/sdxl-turbo) to learn more. - - - -To learn how to use SDXL Turbo for various tasks, how to optimize performance, and other usage examples, take a look at the [SDXL Turbo](../../../using-diffusers/sdxl_turbo) guide. - -Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! - - diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md deleted file mode 100644 index a6bb50cc83d5805e567c82017281806cadd8ec98..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md +++ /dev/null @@ -1,125 +0,0 @@ - - -# Stable Diffusion 2 - -Stable Diffusion 2 is a text-to-image _latent diffusion_ model built upon the work of the original [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release), and it was led by Robin Rombach and Katherine Crowson from [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). - -*The Stable Diffusion 2.0 release includes robust text-to-image models trained using a brand new text encoder (OpenCLIP), developed by LAION with support from Stability AI, which greatly improves the quality of the generated images compared to earlier V1 releases. The text-to-image models in this release can generate images with default resolutions of both 512x512 pixels and 768x768 pixels. -These models are trained on an aesthetic subset of the [LAION-5B dataset](https://laion.ai/blog/laion-5b/) created by the DeepFloyd team at Stability AI, which is then further filtered to remove adult content using [LAION’s NSFW filter](https://openreview.net/forum?id=M3Y74vmsMcY).* - -For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [announcement post](https://stability.ai/blog/stable-diffusion-v2-release). - -The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it gives a reasonable speed/quality trade-off and can be run with as little as 20 steps. - -Stable Diffusion 2 is available for tasks like text-to-image, inpainting, super-resolution, and depth-to-image: - -| Task | Repository | -|-------------------------|---------------------------------------------------------------------------------------------------------------| -| text-to-image (512x512) | [stabilityai/stable-diffusion-2-base](https://huggingface.co/stabilityai/stable-diffusion-2-base) | -| text-to-image (768x768) | [stabilityai/stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) | -| inpainting | [stabilityai/stable-diffusion-2-inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) | -| super-resolution | [stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) | -| depth-to-image | [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth) | - -Here are some examples for how to use Stable Diffusion 2 for each task: - - - -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! - -If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! - - - -## Text-to-image - -```py -from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler -import torch - -repo_id = "stabilityai/stable-diffusion-2-base" -pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, variant="fp16") - -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) -pipe = pipe.to("cuda") - -prompt = "High quality photo of an astronaut riding a horse in space" -image = pipe(prompt, num_inference_steps=25).images[0] -image -``` - -## Inpainting - -```py -import torch -from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler -from diffusers.utils import load_image, make_image_grid - -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" - -init_image = load_image(img_url).resize((512, 512)) -mask_image = load_image(mask_url).resize((512, 512)) - -repo_id = "stabilityai/stable-diffusion-2-inpainting" -pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, variant="fp16") - -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) -pipe = pipe.to("cuda") - -prompt = "Face of a yellow cat, high resolution, sitting on a park bench" -image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0] -make_image_grid([init_image, mask_image, image], rows=1, cols=3) -``` - -## Super-resolution - -```py -from diffusers import StableDiffusionUpscalePipeline -from diffusers.utils import load_image, make_image_grid -import torch - -# load model and scheduler -model_id = "stabilityai/stable-diffusion-x4-upscaler" -pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, torch_dtype=torch.float16) -pipeline = pipeline.to("cuda") - -# let's download an image -url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png" -low_res_img = load_image(url) -low_res_img = low_res_img.resize((128, 128)) -prompt = "a white cat" -upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0] -make_image_grid([low_res_img.resize((512, 512)), upscaled_image.resize((512, 512))], rows=1, cols=2) -``` - -## Depth-to-image - -```py -import torch -from diffusers import StableDiffusionDepth2ImgPipeline -from diffusers.utils import load_image, make_image_grid - -pipe = StableDiffusionDepth2ImgPipeline.from_pretrained( - "stabilityai/stable-diffusion-2-depth", - torch_dtype=torch.float16, -).to("cuda") - - -url = "http://images.cocodataset.org/val2017/000000039769.jpg" -init_image = load_image(url) -prompt = "two tigers" -negative_prompt = "bad, deformed, ugly, bad anotomy" -image = pipe(prompt=prompt, image=init_image, negative_prompt=negative_prompt, strength=0.7).images[0] -make_image_grid([init_image, image], rows=1, cols=2) -``` diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md deleted file mode 100644 index 8170c5280d3823ce87ddc7b87afc91d07b32ea5d..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md +++ /dev/null @@ -1,340 +0,0 @@ - - -# Stable Diffusion 3 - -Stable Diffusion 3 (SD3) was proposed in [Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://arxiv.org/pdf/2403.03206.pdf) by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Muller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. - -The abstract from the paper is: - -*Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations.* - - -## Usage Example - -_As the model is gated, before using it with diffusers you first need to go to the [Stable Diffusion 3 Medium Hugging Face page](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers), fill in the form and accept the gate. Once you are in, you need to login so that your system knows you’ve accepted the gate._ - -Use the command below to log in: - -```bash -huggingface-cli login -``` - - - -The SD3 pipeline uses three text encoders to generate an image. Model offloading is necessary in order for it to run on most commodity hardware. Please use the `torch.float16` data type for additional memory savings. - - - -```python -import torch -from diffusers import StableDiffusion3Pipeline - -pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16) -pipe.to("cuda") - -image = pipe( - prompt="a photo of a cat holding a sign that says hello world", - negative_prompt="", - num_inference_steps=28, - height=1024, - width=1024, - guidance_scale=7.0, -).images[0] - -image.save("sd3_hello_world.png") -``` - -**Note:** Stable Diffusion 3.5 can also be run using the SD3 pipeline, and all mentioned optimizations and techniques apply to it as well. In total there are three official models in the SD3 family: -- [`stabilityai/stable-diffusion-3-medium-diffusers`](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers) -- [`stabilityai/stable-diffusion-3.5-large`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large) -- [`stabilityai/stable-diffusion-3.5-large-turbo`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large-turbo) - -## Memory Optimisations for SD3 - -SD3 uses three text encoders, one if which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware. - -### Running Inference with Model Offloading - -The most basic memory optimization available in Diffusers allows you to offload the components of the model to CPU during inference in order to save memory, while seeing a slight increase in inference latency. Model offloading will only move a model component onto the GPU when it needs to be executed, while keeping the remaining components on the CPU. - -```python -import torch -from diffusers import StableDiffusion3Pipeline - -pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16) -pipe.enable_model_cpu_offload() - -image = pipe( - prompt="a photo of a cat holding a sign that says hello world", - negative_prompt="", - num_inference_steps=28, - height=1024, - width=1024, - guidance_scale=7.0, -).images[0] - -image.save("sd3_hello_world.png") -``` - -### Dropping the T5 Text Encoder during Inference - -Removing the memory-intensive 4.7B parameter T5-XXL text encoder during inference can significantly decrease the memory requirements for SD3 with only a slight loss in performance. - -```python -import torch -from diffusers import StableDiffusion3Pipeline - -pipe = StableDiffusion3Pipeline.from_pretrained( - "stabilityai/stable-diffusion-3-medium-diffusers", - text_encoder_3=None, - tokenizer_3=None, - torch_dtype=torch.float16 -) -pipe.to("cuda") - -image = pipe( - prompt="a photo of a cat holding a sign that says hello world", - negative_prompt="", - num_inference_steps=28, - height=1024, - width=1024, - guidance_scale=7.0, -).images[0] - -image.save("sd3_hello_world-no-T5.png") -``` - -### Using a Quantized Version of the T5 Text Encoder - -We can leverage the `bitsandbytes` library to load and quantize the T5-XXL text encoder to 8-bit precision. This allows you to keep using all three text encoders while only slightly impacting performance. - -First install the `bitsandbytes` library. - -```shell -pip install bitsandbytes -``` - -Then load the T5-XXL model using the `BitsAndBytesConfig`. - -```python -import torch -from diffusers import StableDiffusion3Pipeline -from transformers import T5EncoderModel, BitsAndBytesConfig - -quantization_config = BitsAndBytesConfig(load_in_8bit=True) - -model_id = "stabilityai/stable-diffusion-3-medium-diffusers" -text_encoder = T5EncoderModel.from_pretrained( - model_id, - subfolder="text_encoder_3", - quantization_config=quantization_config, -) -pipe = StableDiffusion3Pipeline.from_pretrained( - model_id, - text_encoder_3=text_encoder, - device_map="balanced", - torch_dtype=torch.float16 -) - -image = pipe( - prompt="a photo of a cat holding a sign that says hello world", - negative_prompt="", - num_inference_steps=28, - height=1024, - width=1024, - guidance_scale=7.0, -).images[0] - -image.save("sd3_hello_world-8bit-T5.png") -``` - -You can find the end-to-end script [here](https://gist.github.com/sayakpaul/82acb5976509851f2db1a83456e504f1). - -## Performance Optimizations for SD3 - -### Using Torch Compile to Speed Up Inference - -Using compiled components in the SD3 pipeline can speed up inference by as much as 4X. The following code snippet demonstrates how to compile the Transformer and VAE components of the SD3 pipeline. - -```python -import torch -from diffusers import StableDiffusion3Pipeline - -torch.set_float32_matmul_precision("high") - -torch._inductor.config.conv_1x1_as_mm = True -torch._inductor.config.coordinate_descent_tuning = True -torch._inductor.config.epilogue_fusion = False -torch._inductor.config.coordinate_descent_check_all_directions = True - -pipe = StableDiffusion3Pipeline.from_pretrained( - "stabilityai/stable-diffusion-3-medium-diffusers", - torch_dtype=torch.float16 -).to("cuda") -pipe.set_progress_bar_config(disable=True) - -pipe.transformer.to(memory_format=torch.channels_last) -pipe.vae.to(memory_format=torch.channels_last) - -pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True) -pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True) - -# Warm Up -prompt = "a photo of a cat holding a sign that says hello world" -for _ in range(3): - _ = pipe(prompt=prompt, generator=torch.manual_seed(1)) - -# Run Inference -image = pipe(prompt=prompt, generator=torch.manual_seed(1)).images[0] -image.save("sd3_hello_world.png") -``` - -Check out the full script [here](https://gist.github.com/sayakpaul/508d89d7aad4f454900813da5d42ca97). - -## Using Long Prompts with the T5 Text Encoder - -By default, the T5 Text Encoder prompt uses a maximum sequence length of `256`. This can be adjusted by setting the `max_sequence_length` to accept fewer or more tokens. Keep in mind that longer sequences require additional resources and result in longer generation times, such as during batch inference. - -```python -prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature’s body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree. As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight" - -image = pipe( - prompt=prompt, - negative_prompt="", - num_inference_steps=28, - guidance_scale=4.5, - max_sequence_length=512, -).images[0] -``` - -### Sending a different prompt to the T5 Text Encoder - -You can send a different prompt to the CLIP Text Encoders and the T5 Text Encoder to prevent the prompt from being truncated by the CLIP Text Encoders and to improve generation. - - - -The prompt with the CLIP Text Encoders is still truncated to the 77 token limit. - - - -```python -prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. A river of warm, melted butter, pancake-like foliage in the background, a towering pepper mill standing in for a tree." - -prompt_3 = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature’s body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree. As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight" - -image = pipe( - prompt=prompt, - prompt_3=prompt_3, - negative_prompt="", - num_inference_steps=28, - guidance_scale=4.5, - max_sequence_length=512, -).images[0] -``` - -## Tiny AutoEncoder for Stable Diffusion 3 - -Tiny AutoEncoder for Stable Diffusion (TAESD3) is a tiny distilled version of Stable Diffusion 3's VAE by [Ollin Boer Bohan](https://github.com/madebyollin/taesd) that can decode [`StableDiffusion3Pipeline`] latents almost instantly. - -To use with Stable Diffusion 3: - -```python -import torch -from diffusers import StableDiffusion3Pipeline, AutoencoderTiny - -pipe = StableDiffusion3Pipeline.from_pretrained( - "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16 -) -pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd3", torch_dtype=torch.float16) -pipe = pipe.to("cuda") - -prompt = "slice of delicious New York-style berry cheesecake" -image = pipe(prompt, num_inference_steps=25).images[0] -image.save("cheesecake.png") -``` - -## Loading the original checkpoints via `from_single_file` - -The `SD3Transformer2DModel` and `StableDiffusion3Pipeline` classes support loading the original checkpoints via the `from_single_file` method. This method allows you to load the original checkpoint files that were used to train the models. - -## Loading the original checkpoints for the `SD3Transformer2DModel` - -```python -from diffusers import SD3Transformer2DModel - -model = SD3Transformer2DModel.from_single_file("https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium.safetensors") -``` - -## Loading the single checkpoint for the `StableDiffusion3Pipeline` - -### Loading the single file checkpoint without T5 - -```python -import torch -from diffusers import StableDiffusion3Pipeline - -pipe = StableDiffusion3Pipeline.from_single_file( - "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips.safetensors", - torch_dtype=torch.float16, - text_encoder_3=None -) -pipe.enable_model_cpu_offload() - -image = pipe("a picture of a cat holding a sign that says hello world").images[0] -image.save('sd3-single-file.png') -``` - -### Loading the single file checkpoint with T5 - -> [!TIP] -> The following example loads a checkpoint stored in a 8-bit floating point format which requires PyTorch 2.3 or later. - -```python -import torch -from diffusers import StableDiffusion3Pipeline - -pipe = StableDiffusion3Pipeline.from_single_file( - "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors", - torch_dtype=torch.float16, -) -pipe.enable_model_cpu_offload() - -image = pipe("a picture of a cat holding a sign that says hello world").images[0] -image.save('sd3-single-file-t5-fp8.png') -``` - -### Loading the single file checkpoint for the Stable Diffusion 3.5 Transformer Model - -```python -import torch -from diffusers import SD3Transformer2DModel, StableDiffusion3Pipeline - -transformer = SD3Transformer2DModel.from_single_file( - "https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo/blob/main/sd3.5_large.safetensors", - torch_dtype=torch.bfloat16, -) -pipe = StableDiffusion3Pipeline.from_pretrained( - "stabilityai/stable-diffusion-3.5-large", - transformer=transformer, - torch_dtype=torch.bfloat16, -) -pipe.enable_model_cpu_offload() -image = pipe("a cat holding a sign that says hello world").images[0] -image.save("sd35.png") -``` - -## StableDiffusion3Pipeline - -[[autodoc]] StableDiffusion3Pipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.md deleted file mode 100644 index 97c11bfe23bb66c8733d18abb40850a6a3de0c2d..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.md +++ /dev/null @@ -1,61 +0,0 @@ - - -# Safe Stable Diffusion - -Safe Stable Diffusion was proposed in [Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models](https://huggingface.co/papers/2211.05105) and mitigates inappropriate degeneration from Stable Diffusion models because they're trained on unfiltered web-crawled datasets. For instance Stable Diffusion may unexpectedly generate nudity, violence, images depicting self-harm, and otherwise offensive content. Safe Stable Diffusion is an extension of Stable Diffusion that drastically reduces this type of content. - -The abstract from the paper is: - -*Text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer, as we demonstrate, from degenerated and biased human behavior. In turn, they may even reinforce such biases. To help combat these undesired side effects, we present safe latent diffusion (SLD). Specifically, to measure the inappropriate degeneration due to unfiltered and imbalanced training sets, we establish a novel image generation test bed-inappropriate image prompts (I2P)-containing dedicated, real-world image-to-text prompts covering concepts such as nudity and violence. As our exhaustive empirical evaluation demonstrates, the introduced SLD removes and suppresses inappropriate image parts during the diffusion process, with no additional training required and no adverse effect on overall image quality or text alignment.* - -## Tips - -Use the `safety_concept` property of [`StableDiffusionPipelineSafe`] to check and edit the current safety concept: - -```python ->>> from diffusers import StableDiffusionPipelineSafe - ->>> pipeline = StableDiffusionPipelineSafe.from_pretrained("AIML-TUDA/stable-diffusion-safe") ->>> pipeline.safety_concept -'an image showing hate, harassment, violence, suffering, humiliation, harm, suicide, sexual, nudity, bodily fluids, blood, obscene gestures, illegal activity, drug use, theft, vandalism, weapons, child abuse, brutality, cruelty' -``` -For each image generation the active concept is also contained in [`StableDiffusionSafePipelineOutput`]. - -There are 4 configurations (`SafetyConfig.WEAK`, `SafetyConfig.MEDIUM`, `SafetyConfig.STRONG`, and `SafetyConfig.MAX`) that can be applied: - -```python ->>> from diffusers import StableDiffusionPipelineSafe ->>> from diffusers.pipelines.stable_diffusion_safe import SafetyConfig - ->>> pipeline = StableDiffusionPipelineSafe.from_pretrained("AIML-TUDA/stable-diffusion-safe") ->>> prompt = "the four horsewomen of the apocalypse, painting by tom of finland, gaston bussiere, craig mullins, j. c. leyendecker" ->>> out = pipeline(prompt=prompt, **SafetyConfig.MAX) -``` - - - -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! - - - -## StableDiffusionPipelineSafe - -[[autodoc]] StableDiffusionPipelineSafe - - all - - __call__ - -## StableDiffusionSafePipelineOutput - -[[autodoc]] pipelines.stable_diffusion_safe.StableDiffusionSafePipelineOutput - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md deleted file mode 100644 index c5433c0783ba50aecde4d87a4924656b7a3bd865..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md +++ /dev/null @@ -1,55 +0,0 @@ - - -# Stable Diffusion XL - -Stable Diffusion XL (SDXL) was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://huggingface.co/papers/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. - -The abstract from the paper is: - -*We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.* - -## Tips - -- Using SDXL with a DPM++ scheduler for less than 50 steps is known to produce [visual artifacts](https://github.com/huggingface/diffusers/issues/5433) because the solver becomes numerically unstable. To fix this issue, take a look at this [PR](https://github.com/huggingface/diffusers/pull/5541) which recommends for ODE/SDE solvers: - - set `use_karras_sigmas=True` or `lu_lambdas=True` to improve image quality - - set `euler_at_final=True` if you're using a solver with uniform step sizes (DPM++2M or DPM++2M SDE) -- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't be for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0). -- SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders. -- SDXL output images can be improved by making use of a refiner model in an image-to-image setting. -- SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters. - - - -To learn how to use SDXL for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](../../../using-diffusers/sdxl) guide. - -Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! - - - -## StableDiffusionXLPipeline - -[[autodoc]] StableDiffusionXLPipeline - - all - - __call__ - -## StableDiffusionXLImg2ImgPipeline - -[[autodoc]] StableDiffusionXLImg2ImgPipeline - - all - - __call__ - -## StableDiffusionXLInpaintPipeline - -[[autodoc]] StableDiffusionXLInpaintPipeline - - all - - __call__ diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/svd.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/svd.md deleted file mode 100644 index 87a9c2a5be869acc0820a30f99b62f8855daa2bb..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/svd.md +++ /dev/null @@ -1,43 +0,0 @@ - - -# Stable Video Diffusion - -Stable Video Diffusion was proposed in [Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets](https://hf.co/papers/2311.15127) by Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach. - -The abstract from the paper is: - -*We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at this https URL.* - - - -To learn how to use Stable Video Diffusion, take a look at the [Stable Video Diffusion](../../../using-diffusers/svd) guide. - -
- -Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the [base](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [extended frame](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) checkpoints! - -
- -## Tips - -Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient. - -Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage. - -## StableVideoDiffusionPipeline - -[[autodoc]] StableVideoDiffusionPipeline - -## StableVideoDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_video_diffusion.StableVideoDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/text2img.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/text2img.md deleted file mode 100644 index 86f3090fe9fd13314a338e6806912efb35ed93f5..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/text2img.md +++ /dev/null @@ -1,59 +0,0 @@ - - -# Text-to-image - -The Stable Diffusion model was created by researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [Runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionPipeline`] is capable of generating photorealistic images given any text input. It's trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs. Latent diffusion is the research on top of which Stable Diffusion was built. It was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. - -The abstract from the paper is: - -*By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion.* - - - -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! - -If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! - - - -## StableDiffusionPipeline - -[[autodoc]] StableDiffusionPipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_vae_slicing - - disable_vae_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - - enable_vae_tiling - - disable_vae_tiling - - load_textual_inversion - - from_single_file - - load_lora_weights - - save_lora_weights - -## StableDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput - -## FlaxStableDiffusionPipeline - -[[autodoc]] FlaxStableDiffusionPipeline - - all - - __call__ - -## FlaxStableDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/stable_diffusion/upscale.md b/diffusers/docs/source/en/api/pipelines/stable_diffusion/upscale.md deleted file mode 100644 index b188c29bff6ba3a7b792d05b5c5095a9bbb36f69..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_diffusion/upscale.md +++ /dev/null @@ -1,37 +0,0 @@ - - -# Super-resolution - -The Stable Diffusion upscaler diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), and [LAION](https://laion.ai/). It is used to enhance the resolution of input images by a factor of 4. - - - -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! - -If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! - - - -## StableDiffusionUpscalePipeline - -[[autodoc]] StableDiffusionUpscalePipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - -## StableDiffusionPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/stable_unclip.md b/diffusers/docs/source/en/api/pipelines/stable_unclip.md deleted file mode 100644 index 3067ba91f752cf60cc2f6755be3cb15b45d8257c..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/stable_unclip.md +++ /dev/null @@ -1,129 +0,0 @@ - - -# Stable unCLIP - -Stable unCLIP checkpoints are finetuned from [Stable Diffusion 2.1](./stable_diffusion/stable_diffusion_2) checkpoints to condition on CLIP image embeddings. -Stable unCLIP still conditions on text embeddings. Given the two separate conditionings, stable unCLIP can be used -for text guided image variation. When combined with an unCLIP prior, it can also be used for full text to image generation. - -The abstract from the paper is: - -*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.* - -## Tips - -Stable unCLIP takes `noise_level` as input during inference which determines how much noise is added to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, we do not add any additional noise to the image embeddings (`noise_level = 0`). - -### Text-to-Image Generation -Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha): - -```python -import torch -from diffusers import UnCLIPScheduler, DDPMScheduler, StableUnCLIPPipeline -from diffusers.models import PriorTransformer -from transformers import CLIPTokenizer, CLIPTextModelWithProjection - -prior_model_id = "kakaobrain/karlo-v1-alpha" -data_type = torch.float16 -prior = PriorTransformer.from_pretrained(prior_model_id, subfolder="prior", torch_dtype=data_type) - -prior_text_model_id = "openai/clip-vit-large-patch14" -prior_tokenizer = CLIPTokenizer.from_pretrained(prior_text_model_id) -prior_text_model = CLIPTextModelWithProjection.from_pretrained(prior_text_model_id, torch_dtype=data_type) -prior_scheduler = UnCLIPScheduler.from_pretrained(prior_model_id, subfolder="prior_scheduler") -prior_scheduler = DDPMScheduler.from_config(prior_scheduler.config) - -stable_unclip_model_id = "stabilityai/stable-diffusion-2-1-unclip-small" - -pipe = StableUnCLIPPipeline.from_pretrained( - stable_unclip_model_id, - torch_dtype=data_type, - variant="fp16", - prior_tokenizer=prior_tokenizer, - prior_text_encoder=prior_text_model, - prior=prior, - prior_scheduler=prior_scheduler, -) - -pipe = pipe.to("cuda") -wave_prompt = "dramatic wave, the Oceans roar, Strong wave spiral across the oceans as the waves unfurl into roaring crests; perfect wave form; perfect wave shape; dramatic wave shape; wave shape unbelievable; wave; wave shape spectacular" - -image = pipe(prompt=wave_prompt).images[0] -image -``` - - -For text-to-image we use `stabilityai/stable-diffusion-2-1-unclip-small` as it was trained on CLIP ViT-L/14 embedding, the same as the Karlo model prior. [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) was trained on OpenCLIP ViT-H, so we don't recommend its use. - - - -### Text guided Image-to-Image Variation - -```python -from diffusers import StableUnCLIPImg2ImgPipeline -from diffusers.utils import load_image -import torch - -pipe = StableUnCLIPImg2ImgPipeline.from_pretrained( - "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16" -) -pipe = pipe.to("cuda") - -url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png" -init_image = load_image(url) - -images = pipe(init_image).images -images[0].save("variation_image.png") -``` - -Optionally, you can also pass a prompt to `pipe` such as: - -```python -prompt = "A fantasy landscape, trending on artstation" - -image = pipe(init_image, prompt=prompt).images[0] -image -``` - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## StableUnCLIPPipeline - -[[autodoc]] StableUnCLIPPipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_vae_slicing - - disable_vae_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - -## StableUnCLIPImg2ImgPipeline - -[[autodoc]] StableUnCLIPImg2ImgPipeline - - all - - __call__ - - enable_attention_slicing - - disable_attention_slicing - - enable_vae_slicing - - disable_vae_slicing - - enable_xformers_memory_efficient_attention - - disable_xformers_memory_efficient_attention - -## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/text_to_video.md b/diffusers/docs/source/en/api/pipelines/text_to_video.md deleted file mode 100644 index 7522264e0b58d419986a9722699d2555ed7eaf27..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/text_to_video.md +++ /dev/null @@ -1,193 +0,0 @@ - - - - -🧪 This pipeline is for research purposes only. - - - -# Text-to-video - -[ModelScope Text-to-Video Technical Report](https://arxiv.org/abs/2308.06571) is by Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang. - -The abstract from the paper is: - -*This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at https://modelscope.cn/models/damo/text-to-video-synthesis/summary.* - -You can find additional information about Text-to-Video on the [project page](https://modelscope.cn/models/damo/text-to-video-synthesis/summary), [original codebase](https://github.com/modelscope/modelscope/), and try it out in a [demo](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis). Official checkpoints can be found at [damo-vilab](https://huggingface.co/damo-vilab) and [cerspense](https://huggingface.co/cerspense). - -## Usage example - -### `text-to-video-ms-1.7b` - -Let's start by generating a short video with the default length of 16 frames (2s at 8 fps): - -```python -import torch -from diffusers import DiffusionPipeline -from diffusers.utils import export_to_video - -pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") -pipe = pipe.to("cuda") - -prompt = "Spiderman is surfing" -video_frames = pipe(prompt).frames[0] -video_path = export_to_video(video_frames) -video_path -``` - -Diffusers supports different optimization techniques to improve the latency -and memory footprint of a pipeline. Since videos are often more memory-heavy than images, -we can enable CPU offloading and VAE slicing to keep the memory footprint at bay. - -Let's generate a video of 8 seconds (64 frames) on the same GPU using CPU offloading and VAE slicing: - -```python -import torch -from diffusers import DiffusionPipeline -from diffusers.utils import export_to_video - -pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") -pipe.enable_model_cpu_offload() - -# memory optimization -pipe.enable_vae_slicing() - -prompt = "Darth Vader surfing a wave" -video_frames = pipe(prompt, num_frames=64).frames[0] -video_path = export_to_video(video_frames) -video_path -``` - -It just takes **7 GBs of GPU memory** to generate the 64 video frames using PyTorch 2.0, "fp16" precision and the techniques mentioned above. - -We can also use a different scheduler easily, using the same method we'd use for Stable Diffusion: - -```python -import torch -from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler -from diffusers.utils import export_to_video - -pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) -pipe.enable_model_cpu_offload() - -prompt = "Spiderman is surfing" -video_frames = pipe(prompt, num_inference_steps=25).frames[0] -video_path = export_to_video(video_frames) -video_path -``` - -Here are some sample outputs: - - - - - - -
- An astronaut riding a horse. -
- An astronaut riding a horse. -
- Darth vader surfing in waves. -
- Darth vader surfing in waves. -
- -### `cerspense/zeroscope_v2_576w` & `cerspense/zeroscope_v2_XL` - -Zeroscope are watermark-free model and have been trained on specific sizes such as `576x320` and `1024x576`. -One should first generate a video using the lower resolution checkpoint [`cerspense/zeroscope_v2_576w`](https://huggingface.co/cerspense/zeroscope_v2_576w) with [`TextToVideoSDPipeline`], -which can then be upscaled using [`VideoToVideoSDPipeline`] and [`cerspense/zeroscope_v2_XL`](https://huggingface.co/cerspense/zeroscope_v2_XL). - - -```py -import torch -from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler -from diffusers.utils import export_to_video -from PIL import Image - -pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16) -pipe.enable_model_cpu_offload() - -# memory optimization -pipe.unet.enable_forward_chunking(chunk_size=1, dim=1) -pipe.enable_vae_slicing() - -prompt = "Darth Vader surfing a wave" -video_frames = pipe(prompt, num_frames=24).frames[0] -video_path = export_to_video(video_frames) -video_path -``` - -Now the video can be upscaled: - -```py -pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_XL", torch_dtype=torch.float16) -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) -pipe.enable_model_cpu_offload() - -# memory optimization -pipe.unet.enable_forward_chunking(chunk_size=1, dim=1) -pipe.enable_vae_slicing() - -video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames] - -video_frames = pipe(prompt, video=video, strength=0.6).frames[0] -video_path = export_to_video(video_frames) -video_path -``` - -Here are some sample outputs: - - - - - -
- Darth vader surfing in waves. -
- Darth vader surfing in waves. -
- -## Tips - -Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient. - -Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage. - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## TextToVideoSDPipeline -[[autodoc]] TextToVideoSDPipeline - - all - - __call__ - -## VideoToVideoSDPipeline -[[autodoc]] VideoToVideoSDPipeline - - all - - __call__ - -## TextToVideoSDPipelineOutput -[[autodoc]] pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/text_to_video_zero.md b/diffusers/docs/source/en/api/pipelines/text_to_video_zero.md deleted file mode 100644 index c6bf30fed7af81984729508e61e8669b97d1b49a..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/text_to_video_zero.md +++ /dev/null @@ -1,302 +0,0 @@ - - -# Text2Video-Zero - -[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, [Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com). - -Text2Video-Zero enables zero-shot video generation using either: -1. A textual prompt -2. A prompt combined with guidance from poses or edges -3. Video Instruct-Pix2Pix (instruction-guided video editing) - -Results are temporally consistent and closely follow the guidance and textual prompts. - -![teaser-img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/t2v_zero_teaser.png) - -The abstract from the paper is: - -*Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. -Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. -Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. -As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.* - -You can find additional information about Text2Video-Zero on the [project page](https://text2video-zero.github.io/), [paper](https://arxiv.org/abs/2303.13439), and [original codebase](https://github.com/Picsart-AI-Research/Text2Video-Zero). - -## Usage example - -### Text-To-Video - -To generate a video from prompt, run the following Python code: -```python -import torch -from diffusers import TextToVideoZeroPipeline -import imageio - -model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" -pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") - -prompt = "A panda is playing guitar on times square" -result = pipe(prompt=prompt).images -result = [(r * 255).astype("uint8") for r in result] -imageio.mimsave("video.mp4", result, fps=4) -``` -You can change these parameters in the pipeline call: -* Motion field strength (see the [paper](https://arxiv.org/abs/2303.13439), Sect. 3.3.1): - * `motion_field_strength_x` and `motion_field_strength_y`. Default: `motion_field_strength_x=12`, `motion_field_strength_y=12` -* `T` and `T'` (see the [paper](https://arxiv.org/abs/2303.13439), Sect. 3.3.1) - * `t0` and `t1` in the range `{0, ..., num_inference_steps}`. Default: `t0=45`, `t1=48` -* Video length: - * `video_length`, the number of frames video_length to be generated. Default: `video_length=8` - -We can also generate longer videos by doing the processing in a chunk-by-chunk manner: -```python -import torch -from diffusers import TextToVideoZeroPipeline -import numpy as np - -model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" -pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") -seed = 0 -video_length = 24 #24 ÷ 4fps = 6 seconds -chunk_size = 8 -prompt = "A panda is playing guitar on times square" - -# Generate the video chunk-by-chunk -result = [] -chunk_ids = np.arange(0, video_length, chunk_size - 1) -generator = torch.Generator(device="cuda") -for i in range(len(chunk_ids)): - print(f"Processing chunk {i + 1} / {len(chunk_ids)}") - ch_start = chunk_ids[i] - ch_end = video_length if i == len(chunk_ids) - 1 else chunk_ids[i + 1] - # Attach the first frame for Cross Frame Attention - frame_ids = [0] + list(range(ch_start, ch_end)) - # Fix the seed for the temporal consistency - generator.manual_seed(seed) - output = pipe(prompt=prompt, video_length=len(frame_ids), generator=generator, frame_ids=frame_ids) - result.append(output.images[1:]) - -# Concatenate chunks and save -result = np.concatenate(result) -result = [(r * 255).astype("uint8") for r in result] -imageio.mimsave("video.mp4", result, fps=4) -``` - - -- #### SDXL Support -In order to use the SDXL model when generating a video from prompt, use the `TextToVideoZeroSDXLPipeline` pipeline: - -```python -import torch -from diffusers import TextToVideoZeroSDXLPipeline - -model_id = "stabilityai/stable-diffusion-xl-base-1.0" -pipe = TextToVideoZeroSDXLPipeline.from_pretrained( - model_id, torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") -``` - -### Text-To-Video with Pose Control -To generate a video from prompt with additional pose control - -1. Download a demo video - - ```python - from huggingface_hub import hf_hub_download - - filename = "__assets__/poses_skeleton_gifs/dance1_corr.mp4" - repo_id = "PAIR/Text2Video-Zero" - video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename) - ``` - - -2. Read video containing extracted pose images - ```python - from PIL import Image - import imageio - - reader = imageio.get_reader(video_path, "ffmpeg") - frame_count = 8 - pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)] - ``` - To extract pose from actual video, read [ControlNet documentation](controlnet). - -3. Run `StableDiffusionControlNetPipeline` with our custom attention processor - - ```python - import torch - from diffusers import StableDiffusionControlNetPipeline, ControlNetModel - from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor - - model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" - controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16) - pipe = StableDiffusionControlNetPipeline.from_pretrained( - model_id, controlnet=controlnet, torch_dtype=torch.float16 - ).to("cuda") - - # Set the attention processor - pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) - pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) - - # fix latents for all frames - latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1) - - prompt = "Darth Vader dancing in a desert" - result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images - imageio.mimsave("video.mp4", result, fps=4) - ``` -- #### SDXL Support - - Since our attention processor also works with SDXL, it can be utilized to generate a video from prompt using ControlNet models powered by SDXL: - ```python - import torch - from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel - from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor - - controlnet_model_id = 'thibaud/controlnet-openpose-sdxl-1.0' - model_id = 'stabilityai/stable-diffusion-xl-base-1.0' - - controlnet = ControlNetModel.from_pretrained(controlnet_model_id, torch_dtype=torch.float16) - pipe = StableDiffusionControlNetPipeline.from_pretrained( - model_id, controlnet=controlnet, torch_dtype=torch.float16 - ).to('cuda') - - # Set the attention processor - pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) - pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) - - # fix latents for all frames - latents = torch.randn((1, 4, 128, 128), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1) - - prompt = "Darth Vader dancing in a desert" - result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images - imageio.mimsave("video.mp4", result, fps=4) - ``` - -### Text-To-Video with Edge Control - -To generate a video from prompt with additional Canny edge control, follow the same steps described above for pose-guided generation using [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny). - - -### Video Instruct-Pix2Pix - -To perform text-guided video editing (with [InstructPix2Pix](pix2pix)): - -1. Download a demo video - - ```python - from huggingface_hub import hf_hub_download - - filename = "__assets__/pix2pix video/camel.mp4" - repo_id = "PAIR/Text2Video-Zero" - video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename) - ``` - -2. Read video from path - ```python - from PIL import Image - import imageio - - reader = imageio.get_reader(video_path, "ffmpeg") - frame_count = 8 - video = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)] - ``` - -3. Run `StableDiffusionInstructPix2PixPipeline` with our custom attention processor - ```python - import torch - from diffusers import StableDiffusionInstructPix2PixPipeline - from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor - - model_id = "timbrooks/instruct-pix2pix" - pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") - pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=3)) - - prompt = "make it Van Gogh Starry Night style" - result = pipe(prompt=[prompt] * len(video), image=video).images - imageio.mimsave("edited_video.mp4", result, fps=4) - ``` - - -### DreamBooth specialization - -Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control** -can run with custom [DreamBooth](../../training/dreambooth) models, as shown below for -[Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny) and -[Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model: - -1. Download a demo video - - ```python - from huggingface_hub import hf_hub_download - - filename = "__assets__/canny_videos_mp4/girl_turning.mp4" - repo_id = "PAIR/Text2Video-Zero" - video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename) - ``` - -2. Read video from path - ```python - from PIL import Image - import imageio - - reader = imageio.get_reader(video_path, "ffmpeg") - frame_count = 8 - canny_edges = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)] - ``` - -3. Run `StableDiffusionControlNetPipeline` with custom trained DreamBooth model - ```python - import torch - from diffusers import StableDiffusionControlNetPipeline, ControlNetModel - from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor - - # set model id to custom model - model_id = "PAIR/text2video-zero-controlnet-canny-avatar" - controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16) - pipe = StableDiffusionControlNetPipeline.from_pretrained( - model_id, controlnet=controlnet, torch_dtype=torch.float16 - ).to("cuda") - - # Set the attention processor - pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) - pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) - - # fix latents for all frames - latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(canny_edges), 1, 1, 1) - - prompt = "oil painting of a beautiful girl avatar style" - result = pipe(prompt=[prompt] * len(canny_edges), image=canny_edges, latents=latents).images - imageio.mimsave("video.mp4", result, fps=4) - ``` - -You can filter out some available DreamBooth-trained models with [this link](https://huggingface.co/models?search=dreambooth). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## TextToVideoZeroPipeline -[[autodoc]] TextToVideoZeroPipeline - - all - - __call__ - -## TextToVideoZeroSDXLPipeline -[[autodoc]] TextToVideoZeroSDXLPipeline - - all - - __call__ - -## TextToVideoPipelineOutput -[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/unclip.md b/diffusers/docs/source/en/api/pipelines/unclip.md deleted file mode 100644 index f379ffd63f536358b1e954a07e06b0f09ddb09a9..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/unclip.md +++ /dev/null @@ -1,37 +0,0 @@ - - -# unCLIP - -[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in 🤗 Diffusers comes from kakaobrain's [karlo](https://github.com/kakaobrain/karlo). - -The abstract from the paper is following: - -*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.* - -You can find lucidrains' DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## UnCLIPPipeline -[[autodoc]] UnCLIPPipeline - - all - - __call__ - -## UnCLIPImageVariationPipeline -[[autodoc]] UnCLIPImageVariationPipeline - - all - - __call__ - -## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/unidiffuser.md b/diffusers/docs/source/en/api/pipelines/unidiffuser.md deleted file mode 100644 index 553a6d30015258970514c61d4b9b3cc65e4afa8a..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/unidiffuser.md +++ /dev/null @@ -1,205 +0,0 @@ - - -# UniDiffuser - -The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://huggingface.co/papers/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu. - -The abstract from the paper is: - -*This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).* - -You can find the original codebase at [thu-ml/unidiffuser](https://github.com/thu-ml/unidiffuser) and additional checkpoints at [thu-ml](https://huggingface.co/thu-ml). - - - -There is currently an issue on PyTorch 1.X where the output images are all black or the pixel values become `NaNs`. This issue can be mitigated by switching to PyTorch 2.X. - - - -This pipeline was contributed by [dg845](https://github.com/dg845). ❤️ - -## Usage Examples - -Because the UniDiffuser model is trained to model the joint distribution of (image, text) pairs, it is capable of performing a diverse range of generation tasks: - -### Unconditional Image and Text Generation - -Unconditional generation (where we start from only latents sampled from a standard Gaussian prior) from a [`UniDiffuserPipeline`] will produce a (image, text) pair: - -```python -import torch - -from diffusers import UniDiffuserPipeline - -device = "cuda" -model_id_or_path = "thu-ml/unidiffuser-v1" -pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) -pipe.to(device) - -# Unconditional image and text generation. The generation task is automatically inferred. -sample = pipe(num_inference_steps=20, guidance_scale=8.0) -image = sample.images[0] -text = sample.text[0] -image.save("unidiffuser_joint_sample_image.png") -print(text) -``` - -This is also called "joint" generation in the UniDiffuser paper, since we are sampling from the joint image-text distribution. - -Note that the generation task is inferred from the inputs used when calling the pipeline. -It is also possible to manually specify the unconditional generation task ("mode") manually with [`UniDiffuserPipeline.set_joint_mode`]: - -```python -# Equivalent to the above. -pipe.set_joint_mode() -sample = pipe(num_inference_steps=20, guidance_scale=8.0) -``` - -When the mode is set manually, subsequent calls to the pipeline will use the set mode without attempting to infer the mode. -You can reset the mode with [`UniDiffuserPipeline.reset_mode`], after which the pipeline will once again infer the mode. - -You can also generate only an image or only text (which the UniDiffuser paper calls "marginal" generation since we sample from the marginal distribution of images and text, respectively): - -```python -# Unlike other generation tasks, image-only and text-only generation don't use classifier-free guidance -# Image-only generation -pipe.set_image_mode() -sample_image = pipe(num_inference_steps=20).images[0] -# Text-only generation -pipe.set_text_mode() -sample_text = pipe(num_inference_steps=20).text[0] -``` - -### Text-to-Image Generation - -UniDiffuser is also capable of sampling from conditional distributions; that is, the distribution of images conditioned on a text prompt or the distribution of texts conditioned on an image. -Here is an example of sampling from the conditional image distribution (text-to-image generation or text-conditioned image generation): - -```python -import torch - -from diffusers import UniDiffuserPipeline - -device = "cuda" -model_id_or_path = "thu-ml/unidiffuser-v1" -pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) -pipe.to(device) - -# Text-to-image generation -prompt = "an elephant under the sea" - -sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0) -t2i_image = sample.images[0] -t2i_image -``` - -The `text2img` mode requires that either an input `prompt` or `prompt_embeds` be supplied. You can set the `text2img` mode manually with [`UniDiffuserPipeline.set_text_to_image_mode`]. - -### Image-to-Text Generation - -Similarly, UniDiffuser can also produce text samples given an image (image-to-text or image-conditioned text generation): - -```python -import torch - -from diffusers import UniDiffuserPipeline -from diffusers.utils import load_image - -device = "cuda" -model_id_or_path = "thu-ml/unidiffuser-v1" -pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) -pipe.to(device) - -# Image-to-text generation -image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg" -init_image = load_image(image_url).resize((512, 512)) - -sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0) -i2t_text = sample.text[0] -print(i2t_text) -``` - -The `img2text` mode requires that an input `image` be supplied. You can set the `img2text` mode manually with [`UniDiffuserPipeline.set_image_to_text_mode`]. - -### Image Variation - -The UniDiffuser authors suggest performing image variation through a "round-trip" generation method, where given an input image, we first perform an image-to-text generation, and then perform a text-to-image generation on the outputs of the first generation. -This produces a new image which is semantically similar to the input image: - -```python -import torch - -from diffusers import UniDiffuserPipeline -from diffusers.utils import load_image - -device = "cuda" -model_id_or_path = "thu-ml/unidiffuser-v1" -pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) -pipe.to(device) - -# Image variation can be performed with an image-to-text generation followed by a text-to-image generation: -# 1. Image-to-text generation -image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg" -init_image = load_image(image_url).resize((512, 512)) - -sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0) -i2t_text = sample.text[0] -print(i2t_text) - -# 2. Text-to-image generation -sample = pipe(prompt=i2t_text, num_inference_steps=20, guidance_scale=8.0) -final_image = sample.images[0] -final_image.save("unidiffuser_image_variation_sample.png") -``` - -### Text Variation - -Similarly, text variation can be performed on an input prompt with a text-to-image generation followed by a image-to-text generation: - -```python -import torch - -from diffusers import UniDiffuserPipeline - -device = "cuda" -model_id_or_path = "thu-ml/unidiffuser-v1" -pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) -pipe.to(device) - -# Text variation can be performed with a text-to-image generation followed by a image-to-text generation: -# 1. Text-to-image generation -prompt = "an elephant under the sea" - -sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0) -t2i_image = sample.images[0] -t2i_image.save("unidiffuser_text2img_sample_image.png") - -# 2. Image-to-text generation -sample = pipe(image=t2i_image, num_inference_steps=20, guidance_scale=8.0) -final_prompt = sample.text[0] -print(final_prompt) -``` - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## UniDiffuserPipeline -[[autodoc]] UniDiffuserPipeline - - all - - __call__ - -## ImageTextPipelineOutput -[[autodoc]] pipelines.ImageTextPipelineOutput diff --git a/diffusers/docs/source/en/api/pipelines/value_guided_sampling.md b/diffusers/docs/source/en/api/pipelines/value_guided_sampling.md deleted file mode 100644 index d21dbf04d7eeb6ec14f2c5a923bb18d955f32832..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/value_guided_sampling.md +++ /dev/null @@ -1,38 +0,0 @@ - - -# Value-guided planning - - - -🧪 This is an experimental pipeline for reinforcement learning! - - - -This pipeline is based on the [Planning with Diffusion for Flexible Behavior Synthesis](https://huggingface.co/papers/2205.09991) paper by Michael Janner, Yilun Du, Joshua B. Tenenbaum, Sergey Levine. - -The abstract from the paper is: - -*Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility.* - -You can find additional information about the model on the [project page](https://diffusion-planning.github.io/), the [original codebase](https://github.com/jannerm/diffuser), or try it out in a demo [notebook](https://colab.research.google.com/drive/1rXm8CX4ZdN5qivjJ2lhwhkOmt_m0CvU0#scrollTo=6HXJvhyqcITc&uniqifier=1). - -The script to run the model is available [here](https://github.com/huggingface/diffusers/tree/main/examples/reinforcement_learning). - - - -Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. - - - -## ValueGuidedRLPipeline -[[autodoc]] diffusers.experimental.ValueGuidedRLPipeline diff --git a/diffusers/docs/source/en/api/pipelines/wuerstchen.md b/diffusers/docs/source/en/api/pipelines/wuerstchen.md deleted file mode 100644 index 4d90ad46dc6448193cc7402fbe0be4af8ebfbc4d..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/pipelines/wuerstchen.md +++ /dev/null @@ -1,163 +0,0 @@ - - -# Würstchen - - - -[Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models](https://huggingface.co/papers/2306.00637) is by Pablo Pernias, Dominic Rampas, Mats L. Richter and Christopher Pal and Marc Aubreville. - -The abstract from the paper is: - -*We introduce Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favorably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.* - -## Würstchen Overview -Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://huggingface.co/papers/2306.00637)). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference. - -## Würstchen v2 comes to Diffusers - -After the initial paper release, we have improved numerous things in the architecture, training and sampling, making Würstchen competitive to current state-of-the-art models in many ways. We are excited to release this new version together with Diffusers. Here is a list of the improvements. - -- Higher resolution (1024x1024 up to 2048x2048) -- Faster inference -- Multi Aspect Resolution Sampling -- Better quality - - -We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are: - -- v2-base -- v2-aesthetic -- **(default)** v2-interpolated (50% interpolation between v2-base and v2-aesthetic) - -We recommend using v2-interpolated, as it has a nice touch of both photorealism and aesthetics. Use v2-base for finetunings as it does not have a style bias and use v2-aesthetic for very artistic generations. -A comparison can be seen here: - - - -## Text-to-Image Generation - -For the sake of usability, Würstchen can be used with a single pipeline. This pipeline can be used as follows: - -```python -import torch -from diffusers import AutoPipelineForText2Image -from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS - -pipe = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dtype=torch.float16).to("cuda") - -caption = "Anthropomorphic cat dressed as a fire fighter" -images = pipe( - caption, - width=1024, - height=1536, - prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS, - prior_guidance_scale=4.0, - num_images_per_prompt=2, -).images -``` - -For explanation purposes, we can also initialize the two main pipelines of Würstchen individually. Würstchen consists of 3 stages: Stage C, Stage B, Stage A. They all have different jobs and work only together. When generating text-conditional images, Stage C will first generate the latents in a very compressed latent space. This is what happens in the `prior_pipeline`. Afterwards, the generated latents will be passed to Stage B, which decompresses the latents into a bigger latent space of a VQGAN. These latents can then be decoded by Stage A, which is a VQGAN, into the pixel-space. Stage B & Stage A are both encapsulated in the `decoder_pipeline`. For more details, take a look at the [paper](https://huggingface.co/papers/2306.00637). - -```python -import torch -from diffusers import WuerstchenDecoderPipeline, WuerstchenPriorPipeline -from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS - -device = "cuda" -dtype = torch.float16 -num_images_per_prompt = 2 - -prior_pipeline = WuerstchenPriorPipeline.from_pretrained( - "warp-ai/wuerstchen-prior", torch_dtype=dtype -).to(device) -decoder_pipeline = WuerstchenDecoderPipeline.from_pretrained( - "warp-ai/wuerstchen", torch_dtype=dtype -).to(device) - -caption = "Anthropomorphic cat dressed as a fire fighter" -negative_prompt = "" - -prior_output = prior_pipeline( - prompt=caption, - height=1024, - width=1536, - timesteps=DEFAULT_STAGE_C_TIMESTEPS, - negative_prompt=negative_prompt, - guidance_scale=4.0, - num_images_per_prompt=num_images_per_prompt, -) -decoder_output = decoder_pipeline( - image_embeddings=prior_output.image_embeddings, - prompt=caption, - negative_prompt=negative_prompt, - guidance_scale=0.0, - output_type="pil", -).images[0] -decoder_output -``` - -## Speed-Up Inference -You can make use of `torch.compile` function and gain a speed-up of about 2-3x: - -```python -prior_pipeline.prior = torch.compile(prior_pipeline.prior, mode="reduce-overhead", fullgraph=True) -decoder_pipeline.decoder = torch.compile(decoder_pipeline.decoder, mode="reduce-overhead", fullgraph=True) -``` - -## Limitations - -- Due to the high compression employed by Würstchen, generations can lack a good amount -of detail. To our human eye, this is especially noticeable in faces, hands etc. -- **Images can only be generated in 128-pixel steps**, e.g. the next higher resolution -after 1024x1024 is 1152x1152 -- The model lacks the ability to render correct text in images -- The model often does not achieve photorealism -- Difficult compositional prompts are hard for the model - -The original codebase, as well as experimental ideas, can be found at [dome272/Wuerstchen](https://github.com/dome272/Wuerstchen). - - -## WuerstchenCombinedPipeline - -[[autodoc]] WuerstchenCombinedPipeline - - all - - __call__ - -## WuerstchenPriorPipeline - -[[autodoc]] WuerstchenPriorPipeline - - all - - __call__ - -## WuerstchenPriorPipelineOutput - -[[autodoc]] pipelines.wuerstchen.pipeline_wuerstchen_prior.WuerstchenPriorPipelineOutput - -## WuerstchenDecoderPipeline - -[[autodoc]] WuerstchenDecoderPipeline - - all - - __call__ - -## Citation - -```bibtex - @misc{pernias2023wuerstchen, - title={Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models}, - author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher J. Pal and Marc Aubreville}, - year={2023}, - eprint={2306.00637}, - archivePrefix={arXiv}, - primaryClass={cs.CV} - } -``` diff --git a/diffusers/docs/source/en/api/quantization.md b/diffusers/docs/source/en/api/quantization.md deleted file mode 100644 index 2fbde9e707ea8bb763eb41a36d9f0fa2622f444b..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/quantization.md +++ /dev/null @@ -1,33 +0,0 @@ - - -# Quantization - -Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference. Diffusers supports 8-bit and 4-bit quantization with [bitsandbytes](https://huggingface.co/docs/bitsandbytes/en/index). - -Quantization techniques that aren't supported in Transformers can be added with the [`DiffusersQuantizer`] class. - - - -Learn how to quantize models in the [Quantization](../quantization/overview) guide. - - - - -## BitsAndBytesConfig - -[[autodoc]] BitsAndBytesConfig - -## DiffusersQuantizer - -[[autodoc]] quantizers.base.DiffusersQuantizer diff --git a/diffusers/docs/source/en/api/schedulers/cm_stochastic_iterative.md b/diffusers/docs/source/en/api/schedulers/cm_stochastic_iterative.md deleted file mode 100644 index 89e50b5d6b614f8fb8bd4408f78c2566caeaad6d..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/cm_stochastic_iterative.md +++ /dev/null @@ -1,27 +0,0 @@ - - -# CMStochasticIterativeScheduler - -[Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever introduced a multistep and onestep scheduler (Algorithm 1) that is capable of generating good samples in one or a small number of steps. - -The abstract from the paper is: - -*Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256.* - -The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models). - -## CMStochasticIterativeScheduler -[[autodoc]] CMStochasticIterativeScheduler - -## CMStochasticIterativeSchedulerOutput -[[autodoc]] schedulers.scheduling_consistency_models.CMStochasticIterativeSchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/consistency_decoder.md b/diffusers/docs/source/en/api/schedulers/consistency_decoder.md deleted file mode 100644 index a9eaa5336dcda592d1d947e838027d040b6f39f7..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/consistency_decoder.md +++ /dev/null @@ -1,21 +0,0 @@ - - -# ConsistencyDecoderScheduler - -This scheduler is a part of the [`ConsistencyDecoderPipeline`] and was introduced in [DALL-E 3](https://openai.com/dall-e-3). - -The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models). - - -## ConsistencyDecoderScheduler -[[autodoc]] schedulers.scheduling_consistency_decoder.ConsistencyDecoderScheduler diff --git a/diffusers/docs/source/en/api/schedulers/cosine_dpm.md b/diffusers/docs/source/en/api/schedulers/cosine_dpm.md deleted file mode 100644 index 7685269c21452f4efc6f9765269754e696bce131..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/cosine_dpm.md +++ /dev/null @@ -1,24 +0,0 @@ - - -# CosineDPMSolverMultistepScheduler - -The [`CosineDPMSolverMultistepScheduler`] is a variant of [`DPMSolverMultistepScheduler`] with cosine schedule, proposed by Nichol and Dhariwal (2021). -It is being used in the [Stable Audio Open](https://arxiv.org/abs/2407.14358) paper and the [Stability-AI/stable-audio-tool](https://github.com/Stability-AI/stable-audio-tool) codebase. - -This scheduler was contributed by [Yoach Lacombe](https://huggingface.co/ylacombe). - -## CosineDPMSolverMultistepScheduler -[[autodoc]] CosineDPMSolverMultistepScheduler - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/ddim.md b/diffusers/docs/source/en/api/schedulers/ddim.md deleted file mode 100644 index 952855dbd2ac50e5b5e66350c993f0ed620323c4..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/ddim.md +++ /dev/null @@ -1,82 +0,0 @@ - - -# DDIMScheduler - -[Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon. - -The abstract from the paper is: - -*Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. -To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models -with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. -We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. -We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.* - -The original codebase of this paper can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim), and you can contact the author on [tsong.me](https://tsong.me/). - -## Tips - -The paper [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion. To fix this, the authors propose: - - - -🧪 This is an experimental feature! - - - -1. rescale the noise schedule to enforce zero terminal signal-to-noise ratio (SNR) - -```py -pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, rescale_betas_zero_snr=True) -``` - -2. train a model with `v_prediction` (add the following argument to the [train_text_to_image.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [train_text_to_image_lora.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py) scripts) - -```bash ---prediction_type="v_prediction" -``` - -3. change the sampler to always start from the last timestep - -```py -pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing") -``` - -4. rescale classifier-free guidance to prevent over-exposure - -```py -image = pipe(prompt, guidance_rescale=0.7).images[0] -``` - -For example: - -```py -from diffusers import DiffusionPipeline, DDIMScheduler -import torch - -pipe = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", torch_dtype=torch.float16) -pipe.scheduler = DDIMScheduler.from_config( - pipe.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing" -) -pipe.to("cuda") - -prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k" -image = pipe(prompt, guidance_rescale=0.7).images[0] -image -``` - -## DDIMScheduler -[[autodoc]] DDIMScheduler - -## DDIMSchedulerOutput -[[autodoc]] schedulers.scheduling_ddim.DDIMSchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/ddim_inverse.md b/diffusers/docs/source/en/api/schedulers/ddim_inverse.md deleted file mode 100644 index 82069cce4c538b5eac76547a1cc34b2af661978f..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/ddim_inverse.md +++ /dev/null @@ -1,19 +0,0 @@ - - -# DDIMInverseScheduler - -`DDIMInverseScheduler` is the inverted scheduler from [Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon. -The implementation is mostly based on the DDIM inversion definition from [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794). - -## DDIMInverseScheduler -[[autodoc]] DDIMInverseScheduler diff --git a/diffusers/docs/source/en/api/schedulers/ddpm.md b/diffusers/docs/source/en/api/schedulers/ddpm.md deleted file mode 100644 index cfe3815b67323546c770b97df5e5296fab91db9a..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/ddpm.md +++ /dev/null @@ -1,25 +0,0 @@ - - -# DDPMScheduler - -[Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2006.11239) (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes a diffusion based model of the same name. In the context of the 🤗 Diffusers library, DDPM refers to the discrete denoising scheduler from the paper as well as the pipeline. - -The abstract from the paper is: - -*We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at [this https URL](https://github.com/hojonathanho/diffusion).* - -## DDPMScheduler -[[autodoc]] DDPMScheduler - -## DDPMSchedulerOutput -[[autodoc]] schedulers.scheduling_ddpm.DDPMSchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/deis.md b/diffusers/docs/source/en/api/schedulers/deis.md deleted file mode 100644 index 4a449b32bf0d1fe9b42e6a4dd7270924a04af47a..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/deis.md +++ /dev/null @@ -1,34 +0,0 @@ - - -# DEISMultistepScheduler - -Diffusion Exponential Integrator Sampler (DEIS) is proposed in [Fast Sampling of Diffusion Models with Exponential Integrator](https://huggingface.co/papers/2204.13902) by Qinsheng Zhang and Yongxin Chen. `DEISMultistepScheduler` is a fast high order solver for diffusion ordinary differential equations (ODEs). - -This implementation modifies the polynomial fitting formula in log-rho space instead of the original linear `t` space in the DEIS paper. The modification enjoys closed-form coefficients for exponential multistep update instead of replying on the numerical solver. - -The abstract from the paper is: - -*The past few years have witnessed the great success of Diffusion models~(DMs) in generating high-fidelity samples in generative modeling tasks. A major limitation of the DM is its notoriously slow sampling procedure which normally requires hundreds to thousands of time discretization steps of the learned diffusion process to reach the desired accuracy. Our goal is to develop a fast sampling method for DMs with a much less number of steps while retaining high sample quality. To this end, we systematically analyze the sampling procedure in DMs and identify key factors that affect the sample quality, among which the method of discretization is most crucial. By carefully examining the learned diffusion process, we propose Diffusion Exponential Integrator Sampler~(DEIS). It is based on the Exponential Integrator designed for discretizing ordinary differential equations (ODEs) and leverages a semilinear structure of the learned diffusion process to reduce the discretization error. The proposed method can be applied to any DMs and can generate high-fidelity samples in as few as 10 steps. In our experiments, it takes about 3 minutes on one A6000 GPU to generate 50k images from CIFAR10. Moreover, by directly using pre-trained DMs, we achieve the state-of-art sampling performance when the number of score function evaluation~(NFE) is limited, e.g., 4.17 FID with 10 NFEs, 3.37 FID, and 9.74 IS with only 15 NFEs on CIFAR10. Code is available at [this https URL](https://github.com/qsh-zh/deis).* - -## Tips - -It is recommended to set `solver_order` to 2 or 3, while `solver_order=1` is equivalent to [`DDIMScheduler`]. - -Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space -diffusion models, you can set `thresholding=True` to use the dynamic thresholding. - -## DEISMultistepScheduler -[[autodoc]] DEISMultistepScheduler - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/dpm_discrete.md b/diffusers/docs/source/en/api/schedulers/dpm_discrete.md deleted file mode 100644 index cb95f3781ecf2f9cd2fe410d7ccc58bf895cd138..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/dpm_discrete.md +++ /dev/null @@ -1,23 +0,0 @@ - - -# KDPM2DiscreteScheduler - -The `KDPM2DiscreteScheduler` is inspired by the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/). - -The original codebase can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion). - -## KDPM2DiscreteScheduler -[[autodoc]] KDPM2DiscreteScheduler - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/dpm_discrete_ancestral.md b/diffusers/docs/source/en/api/schedulers/dpm_discrete_ancestral.md deleted file mode 100644 index 97d205b3cc4ce1fdcc7acf310729592a3b7a5d0f..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/dpm_discrete_ancestral.md +++ /dev/null @@ -1,23 +0,0 @@ - - -# KDPM2AncestralDiscreteScheduler - -The `KDPM2DiscreteScheduler` with ancestral sampling is inspired by the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/). - -The original codebase can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion). - -## KDPM2AncestralDiscreteScheduler -[[autodoc]] KDPM2AncestralDiscreteScheduler - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/dpm_sde.md b/diffusers/docs/source/en/api/schedulers/dpm_sde.md deleted file mode 100644 index fe87bb96ee17a27b6ca3d571f11e15d8af373359..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/dpm_sde.md +++ /dev/null @@ -1,21 +0,0 @@ - - -# DPMSolverSDEScheduler - -The `DPMSolverSDEScheduler` is inspired by the stochastic sampler from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/). - -## DPMSolverSDEScheduler -[[autodoc]] DPMSolverSDEScheduler - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/edm_euler.md b/diffusers/docs/source/en/api/schedulers/edm_euler.md deleted file mode 100644 index 228f0505e3bc8e9763f735b0763127541d5faa61..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/edm_euler.md +++ /dev/null @@ -1,22 +0,0 @@ - - -# EDMEulerScheduler - -The Karras formulation of the Euler scheduler (Algorithm 2) from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L51) implementation by [Katherine Crowson](https://github.com/crowsonkb/). - - -## EDMEulerScheduler -[[autodoc]] EDMEulerScheduler - -## EDMEulerSchedulerOutput -[[autodoc]] schedulers.scheduling_edm_euler.EDMEulerSchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/edm_multistep_dpm_solver.md b/diffusers/docs/source/en/api/schedulers/edm_multistep_dpm_solver.md deleted file mode 100644 index 88ca639a924c7aeaddf621cef5ffcf37d8c8a6cc..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/edm_multistep_dpm_solver.md +++ /dev/null @@ -1,24 +0,0 @@ - - -# EDMDPMSolverMultistepScheduler - -`EDMDPMSolverMultistepScheduler` is a [Karras formulation](https://huggingface.co/papers/2206.00364) of `DPMSolverMultistepScheduler`, a multistep scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. - -DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality -samples, and it can generate quite good samples even in 10 steps. - -## EDMDPMSolverMultistepScheduler -[[autodoc]] EDMDPMSolverMultistepScheduler - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/euler.md b/diffusers/docs/source/en/api/schedulers/euler.md deleted file mode 100644 index 9c98118bd795d60c2c85faf6e7ba8cdfdae92575..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/euler.md +++ /dev/null @@ -1,22 +0,0 @@ - - -# EulerDiscreteScheduler - -The Euler scheduler (Algorithm 2) is from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L51) implementation by [Katherine Crowson](https://github.com/crowsonkb/). - - -## EulerDiscreteScheduler -[[autodoc]] EulerDiscreteScheduler - -## EulerDiscreteSchedulerOutput -[[autodoc]] schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/euler_ancestral.md b/diffusers/docs/source/en/api/schedulers/euler_ancestral.md deleted file mode 100644 index eba9b063005affc741c37f7c7e42d741720795ca..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/euler_ancestral.md +++ /dev/null @@ -1,21 +0,0 @@ - - -# EulerAncestralDiscreteScheduler - -A scheduler that uses ancestral sampling with Euler method steps. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L72) implementation by [Katherine Crowson](https://github.com/crowsonkb/). - -## EulerAncestralDiscreteScheduler -[[autodoc]] EulerAncestralDiscreteScheduler - -## EulerAncestralDiscreteSchedulerOutput -[[autodoc]] schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteSchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/flow_match_euler_discrete.md b/diffusers/docs/source/en/api/schedulers/flow_match_euler_discrete.md deleted file mode 100644 index a8907f96f7549022f04540a30e4269fdfbfabac5..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/flow_match_euler_discrete.md +++ /dev/null @@ -1,18 +0,0 @@ - - -# FlowMatchEulerDiscreteScheduler - -`FlowMatchEulerDiscreteScheduler` is based on the flow-matching sampling introduced in [Stable Diffusion 3](https://arxiv.org/abs/2403.03206). - -## FlowMatchEulerDiscreteScheduler -[[autodoc]] FlowMatchEulerDiscreteScheduler diff --git a/diffusers/docs/source/en/api/schedulers/flow_match_heun_discrete.md b/diffusers/docs/source/en/api/schedulers/flow_match_heun_discrete.md deleted file mode 100644 index 642f8ffc7dcca1dd7945f96877457e33a6c7d29a..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/flow_match_heun_discrete.md +++ /dev/null @@ -1,18 +0,0 @@ - - -# FlowMatchHeunDiscreteScheduler - -`FlowMatchHeunDiscreteScheduler` is based on the flow-matching sampling introduced in [EDM](https://arxiv.org/abs/2403.03206). - -## FlowMatchHeunDiscreteScheduler -[[autodoc]] FlowMatchHeunDiscreteScheduler diff --git a/diffusers/docs/source/en/api/schedulers/heun.md b/diffusers/docs/source/en/api/schedulers/heun.md deleted file mode 100644 index bca5cf743d05ab8a96fee62473627bf7af4cf7fc..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/heun.md +++ /dev/null @@ -1,21 +0,0 @@ - - -# HeunDiscreteScheduler - -The Heun scheduler (Algorithm 1) is from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. The scheduler is ported from the [k-diffusion](https://github.com/crowsonkb/k-diffusion) library and created by [Katherine Crowson](https://github.com/crowsonkb/). - -## HeunDiscreteScheduler -[[autodoc]] HeunDiscreteScheduler - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/ipndm.md b/diffusers/docs/source/en/api/schedulers/ipndm.md deleted file mode 100644 index eeeee8aea32eb56767ad3f00d1cea661c81644c4..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/ipndm.md +++ /dev/null @@ -1,21 +0,0 @@ - - -# IPNDMScheduler - -`IPNDMScheduler` is a fourth-order Improved Pseudo Linear Multistep scheduler. The original implementation can be found at [crowsonkb/v-diffusion-pytorch](https://github.com/crowsonkb/v-diffusion-pytorch/blob/987f8985e38208345c1959b0ea767a625831cc9b/diffusion/sampling.py#L296). - -## IPNDMScheduler -[[autodoc]] IPNDMScheduler - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/lcm.md b/diffusers/docs/source/en/api/schedulers/lcm.md deleted file mode 100644 index 93e80ea16933be6f200b6ab87e10aa276b152501..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/lcm.md +++ /dev/null @@ -1,21 +0,0 @@ - - -# Latent Consistency Model Multistep Scheduler - -## Overview - -Multistep and onestep scheduler (Algorithm 3) introduced alongside latent consistency models in the paper [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://arxiv.org/abs/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. -This scheduler should be able to generate good samples from [`LatentConsistencyModelPipeline`] in 1-8 steps. - -## LCMScheduler -[[autodoc]] LCMScheduler diff --git a/diffusers/docs/source/en/api/schedulers/lms_discrete.md b/diffusers/docs/source/en/api/schedulers/lms_discrete.md deleted file mode 100644 index a0f4aea8a79077f8f3cdd6833c198c63925ee138..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/lms_discrete.md +++ /dev/null @@ -1,21 +0,0 @@ - - -# LMSDiscreteScheduler - -`LMSDiscreteScheduler` is a linear multistep scheduler for discrete beta schedules. The scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/), and the original implementation can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L181). - -## LMSDiscreteScheduler -[[autodoc]] LMSDiscreteScheduler - -## LMSDiscreteSchedulerOutput -[[autodoc]] schedulers.scheduling_lms_discrete.LMSDiscreteSchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/multistep_dpm_solver.md b/diffusers/docs/source/en/api/schedulers/multistep_dpm_solver.md deleted file mode 100644 index ff6e5688e24ff7f1a7e46357431572808b5eee16..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/multistep_dpm_solver.md +++ /dev/null @@ -1,35 +0,0 @@ - - -# DPMSolverMultistepScheduler - -`DPMSolverMultistepScheduler` is a multistep scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. - -DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality -samples, and it can generate quite good samples even in 10 steps. - -## Tips - -It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling. - -Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space -diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use the dynamic -thresholding. This thresholding method is unsuitable for latent-space diffusion models such as -Stable Diffusion. - -The SDE variant of DPMSolver and DPM-Solver++ is also supported, but only for the first and second-order solvers. This is a fast SDE solver for the reverse diffusion SDE. It is recommended to use the second-order `sde-dpmsolver++`. - -## DPMSolverMultistepScheduler -[[autodoc]] DPMSolverMultistepScheduler - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md b/diffusers/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md deleted file mode 100644 index b77a5cf1407963e0ea7ead0f176b97141c0df1f8..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md +++ /dev/null @@ -1,30 +0,0 @@ - - -# DPMSolverMultistepInverse - -`DPMSolverMultistepInverse` is the inverted scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. - -The implementation is mostly based on the DDIM inversion definition of [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794) and notebook implementation of the [`DiffEdit`] latent inversion from [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion/blob/main/diffedit.ipynb). - -## Tips - -Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space -diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use the dynamic -thresholding. This thresholding method is unsuitable for latent-space diffusion models such as -Stable Diffusion. - -## DPMSolverMultistepInverseScheduler -[[autodoc]] DPMSolverMultistepInverseScheduler - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/overview.md b/diffusers/docs/source/en/api/schedulers/overview.md deleted file mode 100644 index af287454e15d4f205ccf8c2f5d5934b6710f7f08..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/overview.md +++ /dev/null @@ -1,73 +0,0 @@ - - -# Schedulers - -🤗 Diffusers provides many scheduler functions for the diffusion process. A scheduler takes a model's output (the sample which the diffusion process is iterating on) and a timestep to return a denoised sample. The timestep is important because it dictates where in the diffusion process the step is; data is generated by iterating forward *n* timesteps and inference occurs by propagating backward through the timesteps. Based on the timestep, a scheduler may be *discrete* in which case the timestep is an `int` or *continuous* in which case the timestep is a `float`. - -Depending on the context, a scheduler defines how to iteratively add noise to an image or how to update a sample based on a model's output: - -- during *training*, a scheduler adds noise (there are different algorithms for how to add noise) to a sample to train a diffusion model -- during *inference*, a scheduler defines how to update a sample based on a pretrained model's output - -Many schedulers are implemented from the [k-diffusion](https://github.com/crowsonkb/k-diffusion) library by [Katherine Crowson](https://github.com/crowsonkb/), and they're also widely used in A1111. To help you map the schedulers from k-diffusion and A1111 to the schedulers in 🤗 Diffusers, take a look at the table below: - -| A1111/k-diffusion | 🤗 Diffusers | Usage | -|---------------------|-------------------------------------|---------------------------------------------------------------------------------------------------------------| -| DPM++ 2M | [`DPMSolverMultistepScheduler`] | | -| DPM++ 2M Karras | [`DPMSolverMultistepScheduler`] | init with `use_karras_sigmas=True` | -| DPM++ 2M SDE | [`DPMSolverMultistepScheduler`] | init with `algorithm_type="sde-dpmsolver++"` | -| DPM++ 2M SDE Karras | [`DPMSolverMultistepScheduler`] | init with `use_karras_sigmas=True` and `algorithm_type="sde-dpmsolver++"` | -| DPM++ 2S a | N/A | very similar to `DPMSolverSinglestepScheduler` | -| DPM++ 2S a Karras | N/A | very similar to `DPMSolverSinglestepScheduler(use_karras_sigmas=True, ...)` | -| DPM++ SDE | [`DPMSolverSinglestepScheduler`] | | -| DPM++ SDE Karras | [`DPMSolverSinglestepScheduler`] | init with `use_karras_sigmas=True` | -| DPM2 | [`KDPM2DiscreteScheduler`] | | -| DPM2 Karras | [`KDPM2DiscreteScheduler`] | init with `use_karras_sigmas=True` | -| DPM2 a | [`KDPM2AncestralDiscreteScheduler`] | | -| DPM2 a Karras | [`KDPM2AncestralDiscreteScheduler`] | init with `use_karras_sigmas=True` | -| DPM adaptive | N/A | | -| DPM fast | N/A | | -| Euler | [`EulerDiscreteScheduler`] | | -| Euler a | [`EulerAncestralDiscreteScheduler`] | | -| Heun | [`HeunDiscreteScheduler`] | | -| LMS | [`LMSDiscreteScheduler`] | | -| LMS Karras | [`LMSDiscreteScheduler`] | init with `use_karras_sigmas=True` | -| N/A | [`DEISMultistepScheduler`] | | -| N/A | [`UniPCMultistepScheduler`] | | - -## Noise schedules and schedule types -| A1111/k-diffusion | 🤗 Diffusers | -|--------------------------|----------------------------------------------------------------------------| -| Karras | init with `use_karras_sigmas=True` | -| sgm_uniform | init with `timestep_spacing="trailing"` | -| simple | init with `timestep_spacing="trailing"` | -| exponential | init with `timestep_spacing="linspace"`, `use_exponential_sigmas=True` | -| beta | init with `timestep_spacing="linspace"`, `use_beta_sigmas=True` | - -All schedulers are built from the base [`SchedulerMixin`] class which implements low level utilities shared by all schedulers. - -## SchedulerMixin -[[autodoc]] SchedulerMixin - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput - -## KarrasDiffusionSchedulers - -[`KarrasDiffusionSchedulers`] are a broad generalization of schedulers in 🤗 Diffusers. The schedulers in this class are distinguished at a high level by their noise sampling strategy, the type of network and scaling, the training strategy, and how the loss is weighed. - -The different schedulers in this class, depending on the ordinary differential equations (ODE) solver type, fall into the above taxonomy and provide a good abstraction for the design of the main schedulers implemented in 🤗 Diffusers. The schedulers in this class are given [here](https://github.com/huggingface/diffusers/blob/a69754bb879ed55b9b6dc9dd0b3cf4fa4124c765/src/diffusers/schedulers/scheduling_utils.py#L32). - -## PushToHubMixin - -[[autodoc]] utils.PushToHubMixin diff --git a/diffusers/docs/source/en/api/schedulers/pndm.md b/diffusers/docs/source/en/api/schedulers/pndm.md deleted file mode 100644 index ed959d53e0262004ff8d7b8818fcf23981a932be..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/pndm.md +++ /dev/null @@ -1,21 +0,0 @@ - - -# PNDMScheduler - -`PNDMScheduler`, or pseudo numerical methods for diffusion models, uses more advanced ODE integration techniques like the Runge-Kutta and linear multi-step method. The original implementation can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L181). - -## PNDMScheduler -[[autodoc]] PNDMScheduler - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/repaint.md b/diffusers/docs/source/en/api/schedulers/repaint.md deleted file mode 100644 index 3b19e344a0bf3e252c7521e00737f5ada22fe15f..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/repaint.md +++ /dev/null @@ -1,27 +0,0 @@ - - -# RePaintScheduler - -`RePaintScheduler` is a DDPM-based inpainting scheduler for unsupervised inpainting with extreme masks. It is designed to be used with the [`RePaintPipeline`], and it is based on the paper [RePaint: Inpainting using Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2201.09865) by Andreas Lugmayr et al. - -The abstract from the paper is: - -*Free-form inpainting is the task of adding new content to an image in the regions specified by an arbitrary binary mask. Most existing approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior. To condition the generation process, we only alter the reverse diffusion iterations by sampling the unmasked regions using the given image information. Since this technique does not modify or condition the original DDPM network itself, the model produces high-quality and diverse output images for any inpainting form. We validate our method for both faces and general-purpose image inpainting using standard and extreme masks. RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at least five out of six mask distributions. GitHub Repository: [this http URL](http://git.io/RePaint).* - -The original implementation can be found at [andreas128/RePaint](https://github.com/andreas128/). - -## RePaintScheduler -[[autodoc]] RePaintScheduler - -## RePaintSchedulerOutput -[[autodoc]] schedulers.scheduling_repaint.RePaintSchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/score_sde_ve.md b/diffusers/docs/source/en/api/schedulers/score_sde_ve.md deleted file mode 100644 index 43bce146be84a3091af6caf094d129a933a52b87..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/score_sde_ve.md +++ /dev/null @@ -1,25 +0,0 @@ - - -# ScoreSdeVeScheduler - -`ScoreSdeVeScheduler` is a variance exploding stochastic differential equation (SDE) scheduler. It was introduced in the [Score-Based Generative Modeling through Stochastic Differential Equations](https://huggingface.co/papers/2011.13456) paper by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole. - -The abstract from the paper is: - -*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.* - -## ScoreSdeVeScheduler -[[autodoc]] ScoreSdeVeScheduler - -## SdeVeOutput -[[autodoc]] schedulers.scheduling_sde_ve.SdeVeOutput diff --git a/diffusers/docs/source/en/api/schedulers/score_sde_vp.md b/diffusers/docs/source/en/api/schedulers/score_sde_vp.md deleted file mode 100644 index 4b25b259708a6b72714806b5c90bc23d31444242..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/score_sde_vp.md +++ /dev/null @@ -1,28 +0,0 @@ - - -# ScoreSdeVpScheduler - -`ScoreSdeVpScheduler` is a variance preserving stochastic differential equation (SDE) scheduler. It was introduced in the [Score-Based Generative Modeling through Stochastic Differential Equations](https://huggingface.co/papers/2011.13456) paper by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole. - -The abstract from the paper is: - -*Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.* - - - -🚧 This scheduler is under construction! - - - -## ScoreSdeVpScheduler -[[autodoc]] schedulers.deprecated.scheduling_sde_vp.ScoreSdeVpScheduler diff --git a/diffusers/docs/source/en/api/schedulers/singlestep_dpm_solver.md b/diffusers/docs/source/en/api/schedulers/singlestep_dpm_solver.md deleted file mode 100644 index 063678f5cfb29f4f6de4960ca2889a1d78d7033b..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/singlestep_dpm_solver.md +++ /dev/null @@ -1,35 +0,0 @@ - - -# DPMSolverSinglestepScheduler - -`DPMSolverSinglestepScheduler` is a single step scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. - -DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality -samples, and it can generate quite good samples even in 10 steps. - -The original implementation can be found at [LuChengTHU/dpm-solver](https://github.com/LuChengTHU/dpm-solver). - -## Tips - -It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling. - -Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space -diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use dynamic -thresholding. This thresholding method is unsuitable for latent-space diffusion models such as -Stable Diffusion. - -## DPMSolverSinglestepScheduler -[[autodoc]] DPMSolverSinglestepScheduler - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/stochastic_karras_ve.md b/diffusers/docs/source/en/api/schedulers/stochastic_karras_ve.md deleted file mode 100644 index 2d08b3289c95a1ff47bf9b3c4b64cfc41f939814..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/stochastic_karras_ve.md +++ /dev/null @@ -1,21 +0,0 @@ - - -# KarrasVeScheduler - -`KarrasVeScheduler` is a stochastic sampler tailored to variance-expanding (VE) models. It is based on the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) and [Score-based generative modeling through stochastic differential equations](https://huggingface.co/papers/2011.13456) papers. - -## KarrasVeScheduler -[[autodoc]] KarrasVeScheduler - -## KarrasVeOutput -[[autodoc]] schedulers.deprecated.scheduling_karras_ve.KarrasVeOutput \ No newline at end of file diff --git a/diffusers/docs/source/en/api/schedulers/tcd.md b/diffusers/docs/source/en/api/schedulers/tcd.md deleted file mode 100644 index 27fc111d644a5065a36f1cffdfcc487f1b4e8d2d..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/tcd.md +++ /dev/null @@ -1,29 +0,0 @@ - - -# TCDScheduler - -[Trajectory Consistency Distillation](https://huggingface.co/papers/2402.19159) by Jianbin Zheng, Minghui Hu, Zhongyi Fan, Chaoyue Wang, Changxing Ding, Dacheng Tao and Tat-Jen Cham introduced a Strategic Stochastic Sampling (Algorithm 4) that is capable of generating good samples in a small number of steps. Distinguishing it as an advanced iteration of the multistep scheduler (Algorithm 1) in the [Consistency Models](https://huggingface.co/papers/2303.01469), Strategic Stochastic Sampling specifically tailored for the trajectory consistency function. - -The abstract from the paper is: - -*Latent Consistency Model (LCM) extends the Consistency Model to the latent space and leverages the guided consistency distillation technique to achieve impressive performance in accelerating text-to-image synthesis. However, we observed that LCM struggles to generate images with both clarity and detailed intricacy. To address this limitation, we initially delve into and elucidate the underlying causes. Our investigation identifies that the primary issue stems from errors in three distinct areas. Consequently, we introduce Trajectory Consistency Distillation (TCD), which encompasses trajectory consistency function and strategic stochastic sampling. The trajectory consistency function diminishes the distillation errors by broadening the scope of the self-consistency boundary condition and endowing the TCD with the ability to accurately trace the entire trajectory of the Probability Flow ODE. Additionally, strategic stochastic sampling is specifically designed to circumvent the accumulated errors inherent in multi-step consistency sampling, which is meticulously tailored to complement the TCD model. Experiments demonstrate that TCD not only significantly enhances image quality at low NFEs but also yields more detailed results compared to the teacher model at high NFEs.* - -The original codebase can be found at [jabir-zheng/TCD](https://github.com/jabir-zheng/TCD). - -## TCDScheduler -[[autodoc]] TCDScheduler - - -## TCDSchedulerOutput -[[autodoc]] schedulers.scheduling_tcd.TCDSchedulerOutput - diff --git a/diffusers/docs/source/en/api/schedulers/unipc.md b/diffusers/docs/source/en/api/schedulers/unipc.md deleted file mode 100644 index d82345996fba60eb07cb1fd947923e6b75d6cdae..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/unipc.md +++ /dev/null @@ -1,35 +0,0 @@ - - -# UniPCMultistepScheduler - -`UniPCMultistepScheduler` is a training-free framework designed for fast sampling of diffusion models. It was introduced in [UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models](https://huggingface.co/papers/2302.04867) by Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, Jiwen Lu. - -It consists of a corrector (UniC) and a predictor (UniP) that share a unified analytical form and support arbitrary orders. -UniPC is by design model-agnostic, supporting pixel-space/latent-space DPMs on unconditional/conditional sampling. It can also be applied to both noise prediction and data prediction models. The corrector UniC can be also applied after any off-the-shelf solvers to increase the order of accuracy. - -The abstract from the paper is: - -*Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM is time-consuming due to the multiple evaluations of the denoising network, making it more and more important to accelerate the sampling of DPMs. Despite recent progress in designing fast samplers, existing methods still cannot generate satisfying images in many applications where fewer steps (e.g., <10) are favored. In this paper, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods, especially in extremely few steps. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256×256 (conditional) with only 10 function evaluations. Code is available at [this https URL](https://github.com/wl-zhao/UniPC).* - -## Tips - -It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling. - -Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space -diffusion models, you can set both `predict_x0=True` and `thresholding=True` to use dynamic thresholding. This thresholding method is unsuitable for latent-space diffusion models such as Stable Diffusion. - -## UniPCMultistepScheduler -[[autodoc]] UniPCMultistepScheduler - -## SchedulerOutput -[[autodoc]] schedulers.scheduling_utils.SchedulerOutput diff --git a/diffusers/docs/source/en/api/schedulers/vq_diffusion.md b/diffusers/docs/source/en/api/schedulers/vq_diffusion.md deleted file mode 100644 index b21cba9ee5ae7311476f14430c6523c96e7ed752..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/schedulers/vq_diffusion.md +++ /dev/null @@ -1,25 +0,0 @@ - - -# VQDiffusionScheduler - -`VQDiffusionScheduler` converts the transformer model's output into a sample for the unnoised image at the previous diffusion timestep. It was introduced in [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://huggingface.co/papers/2111.14822) by Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo. - -The abstract from the paper is: - -*We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.* - -## VQDiffusionScheduler -[[autodoc]] VQDiffusionScheduler - -## VQDiffusionSchedulerOutput -[[autodoc]] schedulers.scheduling_vq_diffusion.VQDiffusionSchedulerOutput diff --git a/diffusers/docs/source/en/api/utilities.md b/diffusers/docs/source/en/api/utilities.md deleted file mode 100644 index d4f4d7d7964ff716d6ece305ba5dfe2bab8a8ce3..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/utilities.md +++ /dev/null @@ -1,43 +0,0 @@ - - -# Utilities - -Utility and helper functions for working with 🤗 Diffusers. - -## numpy_to_pil - -[[autodoc]] utils.numpy_to_pil - -## pt_to_pil - -[[autodoc]] utils.pt_to_pil - -## load_image - -[[autodoc]] utils.load_image - -## export_to_gif - -[[autodoc]] utils.export_to_gif - -## export_to_video - -[[autodoc]] utils.export_to_video - -## make_image_grid - -[[autodoc]] utils.make_image_grid - -## randn_tensor - -[[autodoc]] utils.torch_utils.randn_tensor diff --git a/diffusers/docs/source/en/api/video_processor.md b/diffusers/docs/source/en/api/video_processor.md deleted file mode 100644 index 6461c46c286f14fcbb142859a33ad3675ee10a54..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/api/video_processor.md +++ /dev/null @@ -1,21 +0,0 @@ - - -# Video Processor - -The [`VideoProcessor`] provides a unified API for video pipelines to prepare inputs for VAE encoding and post-processing outputs once they're decoded. The class inherits [`VaeImageProcessor`] so it includes transformations such as resizing, normalization, and conversion between PIL Image, PyTorch, and NumPy arrays. - -## VideoProcessor - -[[autodoc]] video_processor.VideoProcessor.preprocess_video - -[[autodoc]] video_processor.VideoProcessor.postprocess_video diff --git a/diffusers/docs/source/en/community_projects.md b/diffusers/docs/source/en/community_projects.md deleted file mode 100644 index 4ab1829871c820e1ea9e1cd554748f053b5f0307..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/community_projects.md +++ /dev/null @@ -1,82 +0,0 @@ - - -# Community Projects - -Welcome to Community Projects. This space is dedicated to showcasing the incredible work and innovative applications created by our vibrant community using the `diffusers` library. - -This section aims to: - -- Highlight diverse and inspiring projects built with `diffusers` -- Foster knowledge sharing within our community -- Provide real-world examples of how `diffusers` can be leveraged - -Happy exploring, and thank you for being part of the Diffusers community! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Project NameDescription
dream-textures Stable Diffusion built-in to Blender
HiDiffusion Increases the resolution and speed of your diffusion model by only adding a single line of code
IC-Light IC-Light is a project to manipulate the illumination of images
InstantID InstantID : Zero-shot Identity-Preserving Generation in Seconds
IOPaint Image inpainting tool powered by SOTA AI Model. Remove any unwanted object, defect, people from your pictures or erase and replace(powered by stable diffusion) any thing on your pictures.
Kohya Gradio GUI for Kohya's Stable Diffusion trainers
MagicAnimate MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
OOTDiffusion Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on
SD.Next SD.Next: Advanced Implementation of Stable Diffusion and other Diffusion-based generative image models
stable-dreamfusion Text-to-3D & Image-to-3D & Mesh Exportation with NeRF + Diffusion
StoryDiffusion StoryDiffusion can create a magic story by generating consistent images and videos.
StreamDiffusion A Pipeline-Level Solution for Real-Time Interactive Generation
Stable Diffusion Server A server configured for Inpainting/Generation/img2img with one stable diffusion model
diff --git a/diffusers/docs/source/en/conceptual/contribution.md b/diffusers/docs/source/en/conceptual/contribution.md deleted file mode 100644 index b4d33cb5e39cf43f4c99735600f5d8aa951d2b84..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/conceptual/contribution.md +++ /dev/null @@ -1,568 +0,0 @@ - - -# How to contribute to Diffusers 🧨 - -We ❤️ contributions from the open-source community! Everyone is welcome, and all types of participation –not just code– are valued and appreciated. Answering questions, helping others, reaching out, and improving the documentation are all immensely valuable to the community, so don't be afraid and get involved if you're up for it! - -Everyone is encouraged to start by saying 👋 in our public Discord channel. We discuss the latest trends in diffusion models, ask questions, show off personal projects, help each other with contributions, or just hang out ☕. Join us on Discord - -Whichever way you choose to contribute, we strive to be part of an open, welcoming, and kind community. Please, read our [code of conduct](https://github.com/huggingface/diffusers/blob/main/CODE_OF_CONDUCT.md) and be mindful to respect it during your interactions. We also recommend you become familiar with the [ethical guidelines](https://huggingface.co/docs/diffusers/conceptual/ethical_guidelines) that guide our project and ask you to adhere to the same principles of transparency and responsibility. - -We enormously value feedback from the community, so please do not be afraid to speak up if you believe you have valuable feedback that can help improve the library - every message, comment, issue, and pull request (PR) is read and considered. - -## Overview - -You can contribute in many ways ranging from answering questions on issues and discussions to adding new diffusion models to the core library. - -In the following, we give an overview of different ways to contribute, ranked by difficulty in ascending order. All of them are valuable to the community. - -* 1. Asking and answering questions on [the Diffusers discussion forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers) or on [Discord](https://discord.gg/G7tWnz98XR). -* 2. Opening new issues on [the GitHub Issues tab](https://github.com/huggingface/diffusers/issues/new/choose) or new discussions on [the GitHub Discussions tab](https://github.com/huggingface/diffusers/discussions/new/choose). -* 3. Answering issues on [the GitHub Issues tab](https://github.com/huggingface/diffusers/issues) or discussions on [the GitHub Discussions tab](https://github.com/huggingface/diffusers/discussions). -* 4. Fix a simple issue, marked by the "Good first issue" label, see [here](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22). -* 5. Contribute to the [documentation](https://github.com/huggingface/diffusers/tree/main/docs/source). -* 6. Contribute a [Community Pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3Acommunity-examples). -* 7. Contribute to the [examples](https://github.com/huggingface/diffusers/tree/main/examples). -* 8. Fix a more difficult issue, marked by the "Good second issue" label, see [here](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22). -* 9. Add a new pipeline, model, or scheduler, see ["New Pipeline/Model"](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22) and ["New scheduler"](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22) issues. For this contribution, please have a look at [Design Philosophy](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md). - -As said before, **all contributions are valuable to the community**. -In the following, we will explain each contribution a bit more in detail. - -For all contributions 4 - 9, you will need to open a PR. It is explained in detail how to do so in [Opening a pull request](#how-to-open-a-pr). - -### 1. Asking and answering questions on the Diffusers discussion forum or on the Diffusers Discord - -Any question or comment related to the Diffusers library can be asked on the [discussion forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/) or on [Discord](https://discord.gg/G7tWnz98XR). Such questions and comments include (but are not limited to): -- Reports of training or inference experiments in an attempt to share knowledge -- Presentation of personal projects -- Questions to non-official training examples -- Project proposals -- General feedback -- Paper summaries -- Asking for help on personal projects that build on top of the Diffusers library -- General questions -- Ethical questions regarding diffusion models -- ... - -Every question that is asked on the forum or on Discord actively encourages the community to publicly -share knowledge and might very well help a beginner in the future who has the same question you're -having. Please do pose any questions you might have. -In the same spirit, you are of immense help to the community by answering such questions because this way you are publicly documenting knowledge for everybody to learn from. - -**Please** keep in mind that the more effort you put into asking or answering a question, the higher -the quality of the publicly documented knowledge. In the same way, well-posed and well-answered questions create a high-quality knowledge database accessible to everybody, while badly posed questions or answers reduce the overall quality of the public knowledge database. -In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accessible*, and *well-formatted/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section. - -**NOTE about channels**: -[*The forum*](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) is much better indexed by search engines, such as Google. Posts are ranked by popularity rather than chronologically. Hence, it's easier to look up questions and answers that we posted some time ago. -In addition, questions and answers posted in the forum can easily be linked to. -In contrast, *Discord* has a chat-like format that invites fast back-and-forth communication. -While it will most likely take less time for you to get an answer to your question on Discord, your -question won't be visible anymore over time. Also, it's much harder to find information that was posted a while back on Discord. We therefore strongly recommend using the forum for high-quality questions and answers in an attempt to create long-lasting knowledge for the community. If discussions on Discord lead to very interesting answers and conclusions, we recommend posting the results on the forum to make the information more available for future readers. - -### 2. Opening new issues on the GitHub issues tab - -The 🧨 Diffusers library is robust and reliable thanks to the users who notify us of -the problems they encounter. So thank you for reporting an issue. - -Remember, GitHub issues are reserved for technical questions directly related to the Diffusers library, bug reports, feature requests, or feedback on the library design. - -In a nutshell, this means that everything that is **not** related to the **code of the Diffusers library** (including the documentation) should **not** be asked on GitHub, but rather on either the [forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) or [Discord](https://discord.gg/G7tWnz98XR). - -**Please consider the following guidelines when opening a new issue**: -- Make sure you have searched whether your issue has already been asked before (use the search bar on GitHub under Issues). -- Please never report a new issue on another (related) issue. If another issue is highly related, please -open a new issue nevertheless and link to the related issue. -- Make sure your issue is written in English. Please use one of the great, free online translation services, such as [DeepL](https://www.deepl.com/translator) to translate from your native language to English if you are not comfortable in English. -- Check whether your issue might be solved by updating to the newest Diffusers version. Before posting your issue, please make sure that `python -c "import diffusers; print(diffusers.__version__)"` is higher or matches the latest Diffusers version. -- Remember that the more effort you put into opening a new issue, the higher the quality of your answer will be and the better the overall quality of the Diffusers issues. - -New issues usually include the following. - -#### 2.1. Reproducible, minimal bug reports - -A bug report should always have a reproducible code snippet and be as minimal and concise as possible. -This means in more detail: -- Narrow the bug down as much as you can, **do not just dump your whole code file**. -- Format your code. -- Do not include any external libraries except for Diffusers depending on them. -- **Always** provide all necessary information about your environment; for this, you can run: `diffusers-cli env` in your shell and copy-paste the displayed information to the issue. -- Explain the issue. If the reader doesn't know what the issue is and why it is an issue, (s)he cannot solve it. -- **Always** make sure the reader can reproduce your issue with as little effort as possible. If your code snippet cannot be run because of missing libraries or undefined variables, the reader cannot help you. Make sure your reproducible code snippet is as minimal as possible and can be copy-pasted into a simple Python shell. -- If in order to reproduce your issue a model and/or dataset is required, make sure the reader has access to that model or dataset. You can always upload your model or dataset to the [Hub](https://huggingface.co) to make it easily downloadable. Try to keep your model and dataset as small as possible, to make the reproduction of your issue as effortless as possible. - -For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section. - -You can open a bug report [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=bug&projects=&template=bug-report.yml). - -#### 2.2. Feature requests - -A world-class feature request addresses the following points: - -1. Motivation first: -* Is it related to a problem/frustration with the library? If so, please explain -why. Providing a code snippet that demonstrates the problem is best. -* Is it related to something you would need for a project? We'd love to hear -about it! -* Is it something you worked on and think could benefit the community? -Awesome! Tell us what problem it solved for you. -2. Write a *full paragraph* describing the feature; -3. Provide a **code snippet** that demonstrates its future use; -4. In case this is related to a paper, please attach a link; -5. Attach any additional information (drawings, screenshots, etc.) you think may help. - -You can open a feature request [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feature_request.md&title=). - -#### 2.3 Feedback - -Feedback about the library design and why it is good or not good helps the core maintainers immensely to build a user-friendly library. To understand the philosophy behind the current design philosophy, please have a look [here](https://huggingface.co/docs/diffusers/conceptual/philosophy). If you feel like a certain design choice does not fit with the current design philosophy, please explain why and how it should be changed. If a certain design choice follows the design philosophy too much, hence restricting use cases, explain why and how it should be changed. -If a certain design choice is very useful for you, please also leave a note as this is great feedback for future design decisions. - -You can open an issue about feedback [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=). - -#### 2.4 Technical questions - -Technical questions are mainly about why certain code of the library was written in a certain way, or what a certain part of the code does. Please make sure to link to the code in question and please provide details on -why this part of the code is difficult to understand. - -You can open an issue about a technical question [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=bug&template=bug-report.yml). - -#### 2.5 Proposal to add a new model, scheduler, or pipeline - -If the diffusion model community released a new model, pipeline, or scheduler that you would like to see in the Diffusers library, please provide the following information: - -* Short description of the diffusion pipeline, model, or scheduler and link to the paper or public release. -* Link to any of its open-source implementation(s). -* Link to the model weights if they are available. - -If you are willing to contribute to the model yourself, let us know so we can best guide you. Also, don't forget -to tag the original author of the component (model, scheduler, pipeline, etc.) by GitHub handle if you can find it. - -You can open a request for a model/pipeline/scheduler [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=New+model%2Fpipeline%2Fscheduler&template=new-model-addition.yml). - -### 3. Answering issues on the GitHub issues tab - -Answering issues on GitHub might require some technical knowledge of Diffusers, but we encourage everybody to give it a try even if you are not 100% certain that your answer is correct. -Some tips to give a high-quality answer to an issue: -- Be as concise and minimal as possible. -- Stay on topic. An answer to the issue should concern the issue and only the issue. -- Provide links to code, papers, or other sources that prove or encourage your point. -- Answer in code. If a simple code snippet is the answer to the issue or shows how the issue can be solved, please provide a fully reproducible code snippet. - -Also, many issues tend to be simply off-topic, duplicates of other issues, or irrelevant. It is of great -help to the maintainers if you can answer such issues, encouraging the author of the issue to be -more precise, provide the link to a duplicated issue or redirect them to [the forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) or [Discord](https://discord.gg/G7tWnz98XR). - -If you have verified that the issued bug report is correct and requires a correction in the source code, -please have a look at the next sections. - -For all of the following contributions, you will need to open a PR. It is explained in detail how to do so in the [Opening a pull request](#how-to-open-a-pr) section. - -### 4. Fixing a "Good first issue" - -*Good first issues* are marked by the [Good first issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) label. Usually, the issue already -explains how a potential solution should look so that it is easier to fix. -If the issue hasn't been closed and you would like to try to fix this issue, you can just leave a message "I would like to try this issue.". There are usually three scenarios: -- a.) The issue description already proposes a fix. In this case and if the solution makes sense to you, you can open a PR or draft PR to fix it. -- b.) The issue description does not propose a fix. In this case, you can ask what a proposed fix could look like and someone from the Diffusers team should answer shortly. If you have a good idea of how to fix it, feel free to directly open a PR. -- c.) There is already an open PR to fix the issue, but the issue hasn't been closed yet. If the PR has gone stale, you can simply open a new PR and link to the stale PR. PRs often go stale if the original contributor who wanted to fix the issue suddenly cannot find the time anymore to proceed. This often happens in open-source and is very normal. In this case, the community will be very happy if you give it a new try and leverage the knowledge of the existing PR. If there is already a PR and it is active, you can help the author by giving suggestions, reviewing the PR or even asking whether you can contribute to the PR. - - -### 5. Contribute to the documentation - -A good library **always** has good documentation! The official documentation is often one of the first points of contact for new users of the library, and therefore contributing to the documentation is a **highly -valuable contribution**. - -Contributing to the library can have many forms: - -- Correcting spelling or grammatical errors. -- Correct incorrect formatting of the docstring. If you see that the official documentation is weirdly displayed or a link is broken, we would be very happy if you take some time to correct it. -- Correct the shape or dimensions of a docstring input or output tensor. -- Clarify documentation that is hard to understand or incorrect. -- Update outdated code examples. -- Translating the documentation to another language. - -Anything displayed on [the official Diffusers doc page](https://huggingface.co/docs/diffusers/index) is part of the official documentation and can be corrected, adjusted in the respective [documentation source](https://github.com/huggingface/diffusers/tree/main/docs/source). - -Please have a look at [this page](https://github.com/huggingface/diffusers/tree/main/docs) on how to verify changes made to the documentation locally. - -### 6. Contribute a community pipeline - -> [!TIP] -> Read the [Community pipelines](../using-diffusers/custom_pipeline_overview#community-pipelines) guide to learn more about the difference between a GitHub and Hugging Face Hub community pipeline. If you're interested in why we have community pipelines, take a look at GitHub Issue [#841](https://github.com/huggingface/diffusers/issues/841) (basically, we can't maintain all the possible ways diffusion models can be used for inference but we also don't want to prevent the community from building them). - -Contributing a community pipeline is a great way to share your creativity and work with the community. It lets you build on top of the [`DiffusionPipeline`] so that anyone can load and use it by setting the `custom_pipeline` parameter. This section will walk you through how to create a simple pipeline where the UNet only does a single forward pass and calls the scheduler once (a "one-step" pipeline). - -1. Create a one_step_unet.py file for your community pipeline. This file can contain whatever package you want to use as long as it's installed by the user. Make sure you only have one pipeline class that inherits from [`DiffusionPipeline`] to load model weights and the scheduler configuration from the Hub. Add a UNet and scheduler to the `__init__` function. - - You should also add the `register_modules` function to ensure your pipeline and its components can be saved with [`~DiffusionPipeline.save_pretrained`]. - -```py -from diffusers import DiffusionPipeline -import torch - -class UnetSchedulerOneForwardPipeline(DiffusionPipeline): - def __init__(self, unet, scheduler): - super().__init__() - - self.register_modules(unet=unet, scheduler=scheduler) -``` - -1. In the forward pass (which we recommend defining as `__call__`), you can add any feature you'd like. For the "one-step" pipeline, create a random image and call the UNet and scheduler once by setting `timestep=1`. - -```py - from diffusers import DiffusionPipeline - import torch - - class UnetSchedulerOneForwardPipeline(DiffusionPipeline): - def __init__(self, unet, scheduler): - super().__init__() - - self.register_modules(unet=unet, scheduler=scheduler) - - def __call__(self): - image = torch.randn( - (1, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size), - ) - timestep = 1 - - model_output = self.unet(image, timestep).sample - scheduler_output = self.scheduler.step(model_output, timestep, image).prev_sample - - return scheduler_output -``` - -Now you can run the pipeline by passing a UNet and scheduler to it or load pretrained weights if the pipeline structure is identical. - -```py -from diffusers import DDPMScheduler, UNet2DModel - -scheduler = DDPMScheduler() -unet = UNet2DModel() - -pipeline = UnetSchedulerOneForwardPipeline(unet=unet, scheduler=scheduler) -output = pipeline() -# load pretrained weights -pipeline = UnetSchedulerOneForwardPipeline.from_pretrained("google/ddpm-cifar10-32", use_safetensors=True) -output = pipeline() -``` - -You can either share your pipeline as a GitHub community pipeline or Hub community pipeline. - - - - -Share your GitHub pipeline by opening a pull request on the Diffusers [repository](https://github.com/huggingface/diffusers) and add the one_step_unet.py file to the [examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community) subfolder. - - - - -Share your Hub pipeline by creating a model repository on the Hub and uploading the one_step_unet.py file to it. - - - - -### 7. Contribute to training examples - -Diffusers examples are a collection of training scripts that reside in [examples](https://github.com/huggingface/diffusers/tree/main/examples). - -We support two types of training examples: - -- Official training examples -- Research training examples - -Research training examples are located in [examples/research_projects](https://github.com/huggingface/diffusers/tree/main/examples/research_projects) whereas official training examples include all folders under [examples](https://github.com/huggingface/diffusers/tree/main/examples) except the `research_projects` and `community` folders. -The official training examples are maintained by the Diffusers' core maintainers whereas the research training examples are maintained by the community. -This is because of the same reasons put forward in [6. Contribute a community pipeline](#6-contribute-a-community-pipeline) for official pipelines vs. community pipelines: It is not feasible for the core maintainers to maintain all possible training methods for diffusion models. -If the Diffusers core maintainers and the community consider a certain training paradigm to be too experimental or not popular enough, the corresponding training code should be put in the `research_projects` folder and maintained by the author. - -Both official training and research examples consist of a directory that contains one or more training scripts, a `requirements.txt` file, and a `README.md` file. In order for the user to make use of the -training examples, it is required to clone the repository: - -```bash -git clone https://github.com/huggingface/diffusers -``` - -as well as to install all additional dependencies required for training: - -```bash -cd diffusers -pip install -r examples//requirements.txt -``` - -Therefore when adding an example, the `requirements.txt` file shall define all pip dependencies required for your training example so that once all those are installed, the user can run the example's training script. See, for example, the [DreamBooth `requirements.txt` file](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/requirements.txt). - -Training examples of the Diffusers library should adhere to the following philosophy: -- All the code necessary to run the examples should be found in a single Python file. -- One should be able to run the example from the command line with `python .py --args`. -- Examples should be kept simple and serve as **an example** on how to use Diffusers for training. The purpose of example scripts is **not** to create state-of-the-art diffusion models, but rather to reproduce known training schemes without adding too much custom logic. As a byproduct of this point, our examples also strive to serve as good educational materials. - -To contribute an example, it is highly recommended to look at already existing examples such as [dreambooth](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py) to get an idea of how they should look like. -We strongly advise contributors to make use of the [Accelerate library](https://github.com/huggingface/accelerate) as it's tightly integrated -with Diffusers. -Once an example script works, please make sure to add a comprehensive `README.md` that states how to use the example exactly. This README should include: -- An example command on how to run the example script as shown [here](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth#running-locally-with-pytorch). -- A link to some training results (logs, models, etc.) that show what the user can expect as shown [here](https://api.wandb.ai/report/patrickvonplaten/xm6cd5q5). -- If you are adding a non-official/research training example, **please don't forget** to add a sentence that you are maintaining this training example which includes your git handle as shown [here](https://github.com/huggingface/diffusers/tree/main/examples/research_projects/intel_opts#diffusers-examples-with-intel-optimizations). - -If you are contributing to the official training examples, please also make sure to add a test to its folder such as [examples/dreambooth/test_dreambooth.py](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/test_dreambooth.py). This is not necessary for non-official training examples. - -### 8. Fixing a "Good second issue" - -*Good second issues* are marked by the [Good second issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22) label. Good second issues are -usually more complicated to solve than [Good first issues](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22). -The issue description usually gives less guidance on how to fix the issue and requires -a decent understanding of the library by the interested contributor. -If you are interested in tackling a good second issue, feel free to open a PR to fix it and link the PR to the issue. If you see that a PR has already been opened for this issue but did not get merged, have a look to understand why it wasn't merged and try to open an improved PR. -Good second issues are usually more difficult to get merged compared to good first issues, so don't hesitate to ask for help from the core maintainers. If your PR is almost finished the core maintainers can also jump into your PR and commit to it in order to get it merged. - -### 9. Adding pipelines, models, schedulers - -Pipelines, models, and schedulers are the most important pieces of the Diffusers library. -They provide easy access to state-of-the-art diffusion technologies and thus allow the community to -build powerful generative AI applications. - -By adding a new model, pipeline, or scheduler you might enable a new powerful use case for any of the user interfaces relying on Diffusers which can be of immense value for the whole generative AI ecosystem. - -Diffusers has a couple of open feature requests for all three components - feel free to gloss over them -if you don't know yet what specific component you would like to add: -- [Model or pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22) -- [Scheduler](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22) - -Before adding any of the three components, it is strongly recommended that you give the [Philosophy guide](philosophy) a read to better understand the design of any of the three components. Please be aware that we cannot merge model, scheduler, or pipeline additions that strongly diverge from our design philosophy -as it will lead to API inconsistencies. If you fundamentally disagree with a design choice, please open a [Feedback issue](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=) instead so that it can be discussed whether a certain design pattern/design choice shall be changed everywhere in the library and whether we shall update our design philosophy. Consistency across the library is very important for us. - -Please make sure to add links to the original codebase/paper to the PR and ideally also ping the original author directly on the PR so that they can follow the progress and potentially help with questions. - -If you are unsure or stuck in the PR, don't hesitate to leave a message to ask for a first review or help. - -#### Copied from mechanism - -A unique and important feature to understand when adding any pipeline, model or scheduler code is the `# Copied from` mechanism. You'll see this all over the Diffusers codebase, and the reason we use it is to keep the codebase easy to understand and maintain. Marking code with the `# Copied from` mechanism forces the marked code to be identical to the code it was copied from. This makes it easy to update and propagate changes across many files whenever you run `make fix-copies`. - -For example, in the code example below, [`~diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is the original code and `AltDiffusionPipelineOutput` uses the `# Copied from` mechanism to copy it. The only difference is changing the class prefix from `Stable` to `Alt`. - -```py -# Copied from diffusers.pipelines.stable_diffusion.pipeline_output.StableDiffusionPipelineOutput with Stable->Alt -class AltDiffusionPipelineOutput(BaseOutput): - """ - Output class for Alt Diffusion pipelines. - - Args: - images (`List[PIL.Image.Image]` or `np.ndarray`) - List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width, - num_channels)`. - nsfw_content_detected (`List[bool]`) - List indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content or - `None` if safety checking could not be performed. - """ -``` - -To learn more, read this section of the [~Don't~ Repeat Yourself*](https://huggingface.co/blog/transformers-design-philosophy#4-machine-learning-models-are-static) blog post. - -## How to write a good issue - -**The better your issue is written, the higher the chances that it will be quickly resolved.** - -1. Make sure that you've used the correct template for your issue. You can pick between *Bug Report*, *Feature Request*, *Feedback about API Design*, *New model/pipeline/scheduler addition*, *Forum*, or a blank issue. Make sure to pick the correct one when opening [a new issue](https://github.com/huggingface/diffusers/issues/new/choose). -2. **Be precise**: Give your issue a fitting title. Try to formulate your issue description as simple as possible. The more precise you are when submitting an issue, the less time it takes to understand the issue and potentially solve it. Make sure to open an issue for one issue only and not for multiple issues. If you found multiple issues, simply open multiple issues. If your issue is a bug, try to be as precise as possible about what bug it is - you should not just write "Error in diffusers". -3. **Reproducibility**: No reproducible code snippet == no solution. If you encounter a bug, maintainers **have to be able to reproduce** it. Make sure that you include a code snippet that can be copy-pasted into a Python interpreter to reproduce the issue. Make sure that your code snippet works, *i.e.* that there are no missing imports or missing links to images, ... Your issue should contain an error message **and** a code snippet that can be copy-pasted without any changes to reproduce the exact same error message. If your issue is using local model weights or local data that cannot be accessed by the reader, the issue cannot be solved. If you cannot share your data or model, try to make a dummy model or dummy data. -4. **Minimalistic**: Try to help the reader as much as you can to understand the issue as quickly as possible by staying as concise as possible. Remove all code / all information that is irrelevant to the issue. If you have found a bug, try to create the easiest code example you can to demonstrate your issue, do not just dump your whole workflow into the issue as soon as you have found a bug. E.g., if you train a model and get an error at some point during the training, you should first try to understand what part of the training code is responsible for the error and try to reproduce it with a couple of lines. Try to use dummy data instead of full datasets. -5. Add links. If you are referring to a certain naming, method, or model make sure to provide a link so that the reader can better understand what you mean. If you are referring to a specific PR or issue, make sure to link it to your issue. Do not assume that the reader knows what you are talking about. The more links you add to your issue the better. -6. Formatting. Make sure to nicely format your issue by formatting code into Python code syntax, and error messages into normal code syntax. See the [official GitHub formatting docs](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) for more information. -7. Think of your issue not as a ticket to be solved, but rather as a beautiful entry to a well-written encyclopedia. Every added issue is a contribution to publicly available knowledge. By adding a nicely written issue you not only make it easier for maintainers to solve your issue, but you are helping the whole community to better understand a certain aspect of the library. - -## How to write a good PR - -1. Be a chameleon. Understand existing design patterns and syntax and make sure your code additions flow seamlessly into the existing code base. Pull requests that significantly diverge from existing design patterns or user interfaces will not be merged. -2. Be laser focused. A pull request should solve one problem and one problem only. Make sure to not fall into the trap of "also fixing another problem while we're adding it". It is much more difficult to review pull requests that solve multiple, unrelated problems at once. -3. If helpful, try to add a code snippet that displays an example of how your addition can be used. -4. The title of your pull request should be a summary of its contribution. -5. If your pull request addresses an issue, please mention the issue number in -the pull request description to make sure they are linked (and people -consulting the issue know you are working on it); -6. To indicate a work in progress please prefix the title with `[WIP]`. These -are useful to avoid duplicated work, and to differentiate it from PRs ready -to be merged; -7. Try to formulate and format your text as explained in [How to write a good issue](#how-to-write-a-good-issue). -8. Make sure existing tests pass; -9. Add high-coverage tests. No quality testing = no merge. -- If you are adding new `@slow` tests, make sure they pass using -`RUN_SLOW=1 python -m pytest tests/test_my_new_model.py`. -CircleCI does not run the slow tests, but GitHub Actions does every night! -10. All public methods must have informative docstrings that work nicely with markdown. See [`pipeline_latent_diffusion.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py) for an example. -11. Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos, and other non-text files. We prefer to leverage a hf.co hosted `dataset` like -[`hf-internal-testing`](https://huggingface.co/hf-internal-testing) or [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images) to place these files. -If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images -to this dataset. - -## How to open a PR - -Before writing code, we strongly advise you to search through the existing PRs or -issues to make sure that nobody is already working on the same thing. If you are -unsure, it is always a good idea to open an issue to get some feedback. - -You will need basic `git` proficiency to be able to contribute to -🧨 Diffusers. `git` is not the easiest tool to use but it has the greatest -manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro -Git](https://git-scm.com/book/en/v2) is a very good reference. - -Follow these steps to start contributing ([supported Python versions](https://github.com/huggingface/diffusers/blob/83bc6c94eaeb6f7704a2a428931cf2d9ad973ae9/setup.py#L270)): - -1. Fork the [repository](https://github.com/huggingface/diffusers) by -clicking on the 'Fork' button on the repository's page. This creates a copy of the code -under your GitHub user account. - -2. Clone your fork to your local disk, and add the base repository as a remote: - - ```bash - $ git clone git@github.com:/diffusers.git - $ cd diffusers - $ git remote add upstream https://github.com/huggingface/diffusers.git - ``` - -3. Create a new branch to hold your development changes: - - ```bash - $ git checkout -b a-descriptive-name-for-my-changes - ``` - -**Do not** work on the `main` branch. - -4. Set up a development environment by running the following command in a virtual environment: - - ```bash - $ pip install -e ".[dev]" - ``` - -If you have already cloned the repo, you might need to `git pull` to get the most recent changes in the -library. - -5. Develop the features on your branch. - -As you work on the features, you should make sure that the test suite -passes. You should run the tests impacted by your changes like this: - - ```bash - $ pytest tests/.py - ``` - -Before you run the tests, please make sure you install the dependencies required for testing. You can do so -with this command: - - ```bash - $ pip install -e ".[test]" - ``` - -You can also run the full test suite with the following command, but it takes -a beefy machine to produce a result in a decent amount of time now that -Diffusers has grown a lot. Here is the command for it: - - ```bash - $ make test - ``` - -🧨 Diffusers relies on `black` and `isort` to format its source code -consistently. After you make changes, apply automatic style corrections and code verifications -that can't be automated in one go with: - - ```bash - $ make style - ``` - -🧨 Diffusers also uses `ruff` and a few custom scripts to check for coding mistakes. Quality -control runs in CI, however, you can also run the same checks with: - - ```bash - $ make quality - ``` - -Once you're happy with your changes, add changed files using `git add` and -make a commit with `git commit` to record your changes locally: - - ```bash - $ git add modified_file.py - $ git commit -m "A descriptive message about your changes." - ``` - -It is a good idea to sync your copy of the code with the original -repository regularly. This way you can quickly account for changes: - - ```bash - $ git pull upstream main - ``` - -Push the changes to your account using: - - ```bash - $ git push -u origin a-descriptive-name-for-my-changes - ``` - -6. Once you are satisfied, go to the -webpage of your fork on GitHub. Click on 'Pull request' to send your changes -to the project maintainers for review. - -7. It's OK if maintainers ask you for changes. It happens to core contributors -too! So everyone can see the changes in the Pull request, work in your local -branch and push the changes to your fork. They will automatically appear in -the pull request. - -### Tests - -An extensive test suite is included to test the library behavior and several examples. Library tests can be found in -the [tests folder](https://github.com/huggingface/diffusers/tree/main/tests). - -We like `pytest` and `pytest-xdist` because it's faster. From the root of the -repository, here's how to run tests with `pytest` for the library: - -```bash -$ python -m pytest -n auto --dist=loadfile -s -v ./tests/ -``` - -In fact, that's how `make test` is implemented! - -You can specify a smaller set of tests in order to test only the feature -you're working on. - -By default, slow tests are skipped. Set the `RUN_SLOW` environment variable to -`yes` to run them. This will download many gigabytes of models — make sure you -have enough disk space and a good Internet connection, or a lot of patience! - -```bash -$ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/ -``` - -`unittest` is fully supported, here's how to run tests with it: - -```bash -$ python -m unittest discover -s tests -t . -v -$ python -m unittest discover -s examples -t examples -v -``` - -### Syncing forked main with upstream (HuggingFace) main - -To avoid pinging the upstream repository which adds reference notes to each upstream PR and sends unnecessary notifications to the developers involved in these PRs, -when syncing the main branch of a forked repository, please, follow these steps: -1. When possible, avoid syncing with the upstream using a branch and PR on the forked repository. Instead, merge directly into the forked main. -2. If a PR is absolutely necessary, use the following steps after checking out your branch: -```bash -$ git checkout -b your-branch-for-syncing -$ git pull --squash --no-commit upstream main -$ git commit -m '' -$ git push --set-upstream origin your-branch-for-syncing -``` - -### Style guide - -For documentation strings, 🧨 Diffusers follows the [Google style](https://google.github.io/styleguide/pyguide.html). \ No newline at end of file diff --git a/diffusers/docs/source/en/conceptual/ethical_guidelines.md b/diffusers/docs/source/en/conceptual/ethical_guidelines.md deleted file mode 100644 index 426aed032d77315e2ebcc8ba9b532dc80c41c0c3..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/conceptual/ethical_guidelines.md +++ /dev/null @@ -1,63 +0,0 @@ - - -# 🧨 Diffusers’ Ethical Guidelines - -## Preamble - -[Diffusers](https://huggingface.co/docs/diffusers/index) provides pre-trained diffusion models and serves as a modular toolbox for inference and training. - -Given its real case applications in the world and potential negative impacts on society, we think it is important to provide the project with ethical guidelines to guide the development, users’ contributions, and usage of the Diffusers library. - -The risks associated with using this technology are still being examined, but to name a few: copyrights issues for artists; deep-fake exploitation; sexual content generation in inappropriate contexts; non-consensual impersonation; harmful social biases perpetuating the oppression of marginalized groups. -We will keep tracking risks and adapt the following guidelines based on the community's responsiveness and valuable feedback. - - -## Scope - -The Diffusers community will apply the following ethical guidelines to the project’s development and help coordinate how the community will integrate the contributions, especially concerning sensitive topics related to ethical concerns. - - -## Ethical guidelines - -The following ethical guidelines apply generally, but we will primarily implement them when dealing with ethically sensitive issues while making a technical choice. Furthermore, we commit to adapting those ethical principles over time following emerging harms related to the state of the art of the technology in question. - -- **Transparency**: we are committed to being transparent in managing PRs, explaining our choices to users, and making technical decisions. - -- **Consistency**: we are committed to guaranteeing our users the same level of attention in project management, keeping it technically stable and consistent. - -- **Simplicity**: with a desire to make it easy to use and exploit the Diffusers library, we are committed to keeping the project’s goals lean and coherent. - -- **Accessibility**: the Diffusers project helps lower the entry bar for contributors who can help run it even without technical expertise. Doing so makes research artifacts more accessible to the community. - -- **Reproducibility**: we aim to be transparent about the reproducibility of upstream code, models, and datasets when made available through the Diffusers library. - -- **Responsibility**: as a community and through teamwork, we hold a collective responsibility to our users by anticipating and mitigating this technology's potential risks and dangers. - - -## Examples of implementations: Safety features and Mechanisms - -The team works daily to make the technical and non-technical tools available to deal with the potential ethical and social risks associated with diffusion technology. Moreover, the community's input is invaluable in ensuring these features' implementation and raising awareness with us. - -- [**Community tab**](https://huggingface.co/docs/hub/repositories-pull-requests-discussions): it enables the community to discuss and better collaborate on a project. - -- **Bias exploration and evaluation**: the Hugging Face team provides a [space](https://huggingface.co/spaces/society-ethics/DiffusionBiasExplorer) to demonstrate the biases in Stable Diffusion interactively. In this sense, we support and encourage bias explorers and evaluations. - -- **Encouraging safety in deployment** - - - [**Safe Stable Diffusion**](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_safe): It mitigates the well-known issue that models, like Stable Diffusion, that are trained on unfiltered, web-crawled datasets tend to suffer from inappropriate degeneration. Related paper: [Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models](https://arxiv.org/abs/2211.05105). - - - [**Safety Checker**](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py): It checks and compares the class probability of a set of hard-coded harmful concepts in the embedding space against an image after it has been generated. The harmful concepts are intentionally hidden to prevent reverse engineering of the checker. - -- **Staged released on the Hub**: in particularly sensitive situations, access to some repositories should be restricted. This staged release is an intermediary step that allows the repository’s authors to have more control over its use. - -- **Licensing**: [OpenRAILs](https://huggingface.co/blog/open_rail), a new type of licensing, allow us to ensure free access while having a set of restrictions that ensure more responsible use. diff --git a/diffusers/docs/source/en/conceptual/evaluation.md b/diffusers/docs/source/en/conceptual/evaluation.md deleted file mode 100644 index 8dfbc8f2ac8004dbb3c677d80c6e2ec1f30573f4..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/conceptual/evaluation.md +++ /dev/null @@ -1,567 +0,0 @@ - - -# Evaluating Diffusion Models - - - Open In Colab - - -Evaluation of generative models like [Stable Diffusion](https://huggingface.co/docs/diffusers/stable_diffusion) is subjective in nature. But as practitioners and researchers, we often have to make careful choices amongst many different possibilities. So, when working with different generative models (like GANs, Diffusion, etc.), how do we choose one over the other? - -Qualitative evaluation of such models can be error-prone and might incorrectly influence a decision. -However, quantitative metrics don't necessarily correspond to image quality. So, usually, a combination -of both qualitative and quantitative evaluations provides a stronger signal when choosing one model -over the other. - -In this document, we provide a non-exhaustive overview of qualitative and quantitative methods to evaluate Diffusion models. For quantitative methods, we specifically focus on how to implement them alongside `diffusers`. - -The methods shown in this document can also be used to evaluate different [noise schedulers](https://huggingface.co/docs/diffusers/main/en/api/schedulers/overview) keeping the underlying generation model fixed. - -## Scenarios - -We cover Diffusion models with the following pipelines: - -- Text-guided image generation (such as the [`StableDiffusionPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img)). -- Text-guided image generation, additionally conditioned on an input image (such as the [`StableDiffusionImg2ImgPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/img2img) and [`StableDiffusionInstructPix2PixPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/pix2pix)). -- Class-conditioned image generation models (such as the [`DiTPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/dit)). - -## Qualitative Evaluation - -Qualitative evaluation typically involves human assessment of generated images. Quality is measured across aspects such as compositionality, image-text alignment, and spatial relations. Common prompts provide a degree of uniformity for subjective metrics. -DrawBench and PartiPrompts are prompt datasets used for qualitative benchmarking. DrawBench and PartiPrompts were introduced by [Imagen](https://imagen.research.google/) and [Parti](https://parti.research.google/) respectively. - -From the [official Parti website](https://parti.research.google/): - -> PartiPrompts (P2) is a rich set of over 1600 prompts in English that we release as part of this work. P2 can be used to measure model capabilities across various categories and challenge aspects. - -![parti-prompts](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/parti-prompts.png) - -PartiPrompts has the following columns: - -- Prompt -- Category of the prompt (such as “Abstract”, “World Knowledge”, etc.) -- Challenge reflecting the difficulty (such as “Basic”, “Complex”, “Writing & Symbols”, etc.) - -These benchmarks allow for side-by-side human evaluation of different image generation models. - -For this, the 🧨 Diffusers team has built **Open Parti Prompts**, which is a community-driven qualitative benchmark based on Parti Prompts to compare state-of-the-art open-source diffusion models: -- [Open Parti Prompts Game](https://huggingface.co/spaces/OpenGenAI/open-parti-prompts): For 10 parti prompts, 4 generated images are shown and the user selects the image that suits the prompt best. -- [Open Parti Prompts Leaderboard](https://huggingface.co/spaces/OpenGenAI/parti-prompts-leaderboard): The leaderboard comparing the currently best open-sourced diffusion models to each other. - -To manually compare images, let’s see how we can use `diffusers` on a couple of PartiPrompts. - -Below we show some prompts sampled across different challenges: Basic, Complex, Linguistic Structures, Imagination, and Writing & Symbols. Here we are using PartiPrompts as a [dataset](https://huggingface.co/datasets/nateraw/parti-prompts). - -```python -from datasets import load_dataset - -# prompts = load_dataset("nateraw/parti-prompts", split="train") -# prompts = prompts.shuffle() -# sample_prompts = [prompts[i]["Prompt"] for i in range(5)] - -# Fixing these sample prompts in the interest of reproducibility. -sample_prompts = [ - "a corgi", - "a hot air balloon with a yin-yang symbol, with the moon visible in the daytime sky", - "a car with no windows", - "a cube made of porcupine", - 'The saying "BE EXCELLENT TO EACH OTHER" written on a red brick wall with a graffiti image of a green alien wearing a tuxedo. A yellow fire hydrant is on a sidewalk in the foreground.', -] -``` - -Now we can use these prompts to generate some images using Stable Diffusion ([v1-4 checkpoint](https://huggingface.co/CompVis/stable-diffusion-v1-4)): - -```python -import torch - -seed = 0 -generator = torch.manual_seed(seed) - -images = sd_pipeline(sample_prompts, num_images_per_prompt=1, generator=generator).images -``` - -![parti-prompts-14](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/parti-prompts-14.png) - -We can also set `num_images_per_prompt` accordingly to compare different images for the same prompt. Running the same pipeline but with a different checkpoint ([v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5)), yields: - -![parti-prompts-15](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/parti-prompts-15.png) - -Once several images are generated from all the prompts using multiple models (under evaluation), these results are presented to human evaluators for scoring. For -more details on the DrawBench and PartiPrompts benchmarks, refer to their respective papers. - - - -It is useful to look at some inference samples while a model is training to measure the -training progress. In our [training scripts](https://github.com/huggingface/diffusers/tree/main/examples/), we support this utility with additional support for -logging to TensorBoard and Weights & Biases. - - - -## Quantitative Evaluation - -In this section, we will walk you through how to evaluate three different diffusion pipelines using: - -- CLIP score -- CLIP directional similarity -- FID - -### Text-guided image generation - -[CLIP score](https://arxiv.org/abs/2104.08718) measures the compatibility of image-caption pairs. Higher CLIP scores imply higher compatibility 🔼. The CLIP score is a quantitative measurement of the qualitative concept "compatibility". Image-caption pair compatibility can also be thought of as the semantic similarity between the image and the caption. CLIP score was found to have high correlation with human judgement. - -Let's first load a [`StableDiffusionPipeline`]: - -```python -from diffusers import StableDiffusionPipeline -import torch - -model_ckpt = "CompVis/stable-diffusion-v1-4" -sd_pipeline = StableDiffusionPipeline.from_pretrained(model_ckpt, torch_dtype=torch.float16).to("cuda") -``` - -Generate some images with multiple prompts: - -```python -prompts = [ - "a photo of an astronaut riding a horse on mars", - "A high tech solarpunk utopia in the Amazon rainforest", - "A pikachu fine dining with a view to the Eiffel Tower", - "A mecha robot in a favela in expressionist style", - "an insect robot preparing a delicious meal", - "A small cabin on top of a snowy mountain in the style of Disney, artstation", -] - -images = sd_pipeline(prompts, num_images_per_prompt=1, output_type="np").images - -print(images.shape) -# (6, 512, 512, 3) -``` - -And then, we calculate the CLIP score. - -```python -from torchmetrics.functional.multimodal import clip_score -from functools import partial - -clip_score_fn = partial(clip_score, model_name_or_path="openai/clip-vit-base-patch16") - -def calculate_clip_score(images, prompts): - images_int = (images * 255).astype("uint8") - clip_score = clip_score_fn(torch.from_numpy(images_int).permute(0, 3, 1, 2), prompts).detach() - return round(float(clip_score), 4) - -sd_clip_score = calculate_clip_score(images, prompts) -print(f"CLIP score: {sd_clip_score}") -# CLIP score: 35.7038 -``` - -In the above example, we generated one image per prompt. If we generated multiple images per prompt, we would have to take the average score from the generated images per prompt. - -Now, if we wanted to compare two checkpoints compatible with the [`StableDiffusionPipeline`] we should pass a generator while calling the pipeline. First, we generate images with a -fixed seed with the [v1-4 Stable Diffusion checkpoint](https://huggingface.co/CompVis/stable-diffusion-v1-4): - -```python -seed = 0 -generator = torch.manual_seed(seed) - -images = sd_pipeline(prompts, num_images_per_prompt=1, generator=generator, output_type="np").images -``` - -Then we load the [v1-5 checkpoint](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) to generate images: - -```python -model_ckpt_1_5 = "stable-diffusion-v1-5/stable-diffusion-v1-5" -sd_pipeline_1_5 = StableDiffusionPipeline.from_pretrained(model_ckpt_1_5, torch_dtype=weight_dtype).to(device) - -images_1_5 = sd_pipeline_1_5(prompts, num_images_per_prompt=1, generator=generator, output_type="np").images -``` - -And finally, we compare their CLIP scores: - -```python -sd_clip_score_1_4 = calculate_clip_score(images, prompts) -print(f"CLIP Score with v-1-4: {sd_clip_score_1_4}") -# CLIP Score with v-1-4: 34.9102 - -sd_clip_score_1_5 = calculate_clip_score(images_1_5, prompts) -print(f"CLIP Score with v-1-5: {sd_clip_score_1_5}") -# CLIP Score with v-1-5: 36.2137 -``` - -It seems like the [v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) checkpoint performs better than its predecessor. Note, however, that the number of prompts we used to compute the CLIP scores is quite low. For a more practical evaluation, this number should be way higher, and the prompts should be diverse. - - - -By construction, there are some limitations in this score. The captions in the training dataset -were crawled from the web and extracted from `alt` and similar tags associated an image on the internet. -They are not necessarily representative of what a human being would use to describe an image. Hence we -had to "engineer" some prompts here. - - - -### Image-conditioned text-to-image generation - -In this case, we condition the generation pipeline with an input image as well as a text prompt. Let's take the [`StableDiffusionInstructPix2PixPipeline`], as an example. It takes an edit instruction as an input prompt and an input image to be edited. - -Here is one example: - -![edit-instruction](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/edit-instruction.png) - -One strategy to evaluate such a model is to measure the consistency of the change between the two images (in [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) space) with the change between the two image captions (as shown in [CLIP-Guided Domain Adaptation of Image Generators](https://arxiv.org/abs/2108.00946)). This is referred to as the "**CLIP directional similarity**". - -- Caption 1 corresponds to the input image (image 1) that is to be edited. -- Caption 2 corresponds to the edited image (image 2). It should reflect the edit instruction. - -Following is a pictorial overview: - -![edit-consistency](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/edit-consistency.png) - -We have prepared a mini dataset to implement this metric. Let's first load the dataset. - -```python -from datasets import load_dataset - -dataset = load_dataset("sayakpaul/instructpix2pix-demo", split="train") -dataset.features -``` - -```bash -{'input': Value(dtype='string', id=None), - 'edit': Value(dtype='string', id=None), - 'output': Value(dtype='string', id=None), - 'image': Image(decode=True, id=None)} -``` - -Here we have: - -- `input` is a caption corresponding to the `image`. -- `edit` denotes the edit instruction. -- `output` denotes the modified caption reflecting the `edit` instruction. - -Let's take a look at a sample. - -```python -idx = 0 -print(f"Original caption: {dataset[idx]['input']}") -print(f"Edit instruction: {dataset[idx]['edit']}") -print(f"Modified caption: {dataset[idx]['output']}") -``` - -```bash -Original caption: 2. FAROE ISLANDS: An archipelago of 18 mountainous isles in the North Atlantic Ocean between Norway and Iceland, the Faroe Islands has 'everything you could hope for', according to Big 7 Travel. It boasts 'crystal clear waterfalls, rocky cliffs that seem to jut out of nowhere and velvety green hills' -Edit instruction: make the isles all white marble -Modified caption: 2. WHITE MARBLE ISLANDS: An archipelago of 18 mountainous white marble isles in the North Atlantic Ocean between Norway and Iceland, the White Marble Islands has 'everything you could hope for', according to Big 7 Travel. It boasts 'crystal clear waterfalls, rocky cliffs that seem to jut out of nowhere and velvety green hills' -``` - -And here is the image: - -```python -dataset[idx]["image"] -``` - -![edit-dataset](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/edit-dataset.png) - -We will first edit the images of our dataset with the edit instruction and compute the directional similarity. - -Let's first load the [`StableDiffusionInstructPix2PixPipeline`]: - -```python -from diffusers import StableDiffusionInstructPix2PixPipeline - -instruct_pix2pix_pipeline = StableDiffusionInstructPix2PixPipeline.from_pretrained( - "timbrooks/instruct-pix2pix", torch_dtype=torch.float16 -).to(device) -``` - -Now, we perform the edits: - -```python -import numpy as np - - -def edit_image(input_image, instruction): - image = instruct_pix2pix_pipeline( - instruction, - image=input_image, - output_type="np", - generator=generator, - ).images[0] - return image - -input_images = [] -original_captions = [] -modified_captions = [] -edited_images = [] - -for idx in range(len(dataset)): - input_image = dataset[idx]["image"] - edit_instruction = dataset[idx]["edit"] - edited_image = edit_image(input_image, edit_instruction) - - input_images.append(np.array(input_image)) - original_captions.append(dataset[idx]["input"]) - modified_captions.append(dataset[idx]["output"]) - edited_images.append(edited_image) -``` - -To measure the directional similarity, we first load CLIP's image and text encoders: - -```python -from transformers import ( - CLIPTokenizer, - CLIPTextModelWithProjection, - CLIPVisionModelWithProjection, - CLIPImageProcessor, -) - -clip_id = "openai/clip-vit-large-patch14" -tokenizer = CLIPTokenizer.from_pretrained(clip_id) -text_encoder = CLIPTextModelWithProjection.from_pretrained(clip_id).to(device) -image_processor = CLIPImageProcessor.from_pretrained(clip_id) -image_encoder = CLIPVisionModelWithProjection.from_pretrained(clip_id).to(device) -``` - -Notice that we are using a particular CLIP checkpoint, i.e., `openai/clip-vit-large-patch14`. This is because the Stable Diffusion pre-training was performed with this CLIP variant. For more details, refer to the [documentation](https://huggingface.co/docs/transformers/model_doc/clip). - -Next, we prepare a PyTorch `nn.Module` to compute directional similarity: - -```python -import torch.nn as nn -import torch.nn.functional as F - - -class DirectionalSimilarity(nn.Module): - def __init__(self, tokenizer, text_encoder, image_processor, image_encoder): - super().__init__() - self.tokenizer = tokenizer - self.text_encoder = text_encoder - self.image_processor = image_processor - self.image_encoder = image_encoder - - def preprocess_image(self, image): - image = self.image_processor(image, return_tensors="pt")["pixel_values"] - return {"pixel_values": image.to(device)} - - def tokenize_text(self, text): - inputs = self.tokenizer( - text, - max_length=self.tokenizer.model_max_length, - padding="max_length", - truncation=True, - return_tensors="pt", - ) - return {"input_ids": inputs.input_ids.to(device)} - - def encode_image(self, image): - preprocessed_image = self.preprocess_image(image) - image_features = self.image_encoder(**preprocessed_image).image_embeds - image_features = image_features / image_features.norm(dim=1, keepdim=True) - return image_features - - def encode_text(self, text): - tokenized_text = self.tokenize_text(text) - text_features = self.text_encoder(**tokenized_text).text_embeds - text_features = text_features / text_features.norm(dim=1, keepdim=True) - return text_features - - def compute_directional_similarity(self, img_feat_one, img_feat_two, text_feat_one, text_feat_two): - sim_direction = F.cosine_similarity(img_feat_two - img_feat_one, text_feat_two - text_feat_one) - return sim_direction - - def forward(self, image_one, image_two, caption_one, caption_two): - img_feat_one = self.encode_image(image_one) - img_feat_two = self.encode_image(image_two) - text_feat_one = self.encode_text(caption_one) - text_feat_two = self.encode_text(caption_two) - directional_similarity = self.compute_directional_similarity( - img_feat_one, img_feat_two, text_feat_one, text_feat_two - ) - return directional_similarity -``` - -Let's put `DirectionalSimilarity` to use now. - -```python -dir_similarity = DirectionalSimilarity(tokenizer, text_encoder, image_processor, image_encoder) -scores = [] - -for i in range(len(input_images)): - original_image = input_images[i] - original_caption = original_captions[i] - edited_image = edited_images[i] - modified_caption = modified_captions[i] - - similarity_score = dir_similarity(original_image, edited_image, original_caption, modified_caption) - scores.append(float(similarity_score.detach().cpu())) - -print(f"CLIP directional similarity: {np.mean(scores)}") -# CLIP directional similarity: 0.0797976553440094 -``` - -Like the CLIP Score, the higher the CLIP directional similarity, the better it is. - -It should be noted that the `StableDiffusionInstructPix2PixPipeline` exposes two arguments, namely, `image_guidance_scale` and `guidance_scale` that let you control the quality of the final edited image. We encourage you to experiment with these two arguments and see the impact of that on the directional similarity. - -We can extend the idea of this metric to measure how similar the original image and edited version are. To do that, we can just do `F.cosine_similarity(img_feat_two, img_feat_one)`. For these kinds of edits, we would still want the primary semantics of the images to be preserved as much as possible, i.e., a high similarity score. - -We can use these metrics for similar pipelines such as the [`StableDiffusionPix2PixZeroPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/pix2pix_zero#diffusers.StableDiffusionPix2PixZeroPipeline). - - - -Both CLIP score and CLIP direction similarity rely on the CLIP model, which can make the evaluations biased. - - - -***Extending metrics like IS, FID (discussed later), or KID can be difficult*** when the model under evaluation was pre-trained on a large image-captioning dataset (such as the [LAION-5B dataset](https://laion.ai/blog/laion-5b/)). This is because underlying these metrics is an InceptionNet (pre-trained on the ImageNet-1k dataset) used for extracting intermediate image features. The pre-training dataset of Stable Diffusion may have limited overlap with the pre-training dataset of InceptionNet, so it is not a good candidate here for feature extraction. - -***Using the above metrics helps evaluate models that are class-conditioned. For example, [DiT](https://huggingface.co/docs/diffusers/main/en/api/pipelines/dit). It was pre-trained being conditioned on the ImageNet-1k classes.*** - -### Class-conditioned image generation - -Class-conditioned generative models are usually pre-trained on a class-labeled dataset such as [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k). Popular metrics for evaluating these models include Fréchet Inception Distance (FID), Kernel Inception Distance (KID), and Inception Score (IS). In this document, we focus on FID ([Heusel et al.](https://arxiv.org/abs/1706.08500)). We show how to compute it with the [`DiTPipeline`](https://huggingface.co/docs/diffusers/api/pipelines/dit), which uses the [DiT model](https://arxiv.org/abs/2212.09748) under the hood. - -FID aims to measure how similar are two datasets of images. As per [this resource](https://mmgeneration.readthedocs.io/en/latest/quick_run.html#fid): - -> Fréchet Inception Distance is a measure of similarity between two datasets of images. It was shown to correlate well with the human judgment of visual quality and is most often used to evaluate the quality of samples of Generative Adversarial Networks. FID is calculated by computing the Fréchet distance between two Gaussians fitted to feature representations of the Inception network. - -These two datasets are essentially the dataset of real images and the dataset of fake images (generated images in our case). FID is usually calculated with two large datasets. However, for this document, we will work with two mini datasets. - -Let's first download a few images from the ImageNet-1k training set: - -```python -from zipfile import ZipFile -import requests - - -def download(url, local_filepath): - r = requests.get(url) - with open(local_filepath, "wb") as f: - f.write(r.content) - return local_filepath - -dummy_dataset_url = "https://hf.co/datasets/sayakpaul/sample-datasets/resolve/main/sample-imagenet-images.zip" -local_filepath = download(dummy_dataset_url, dummy_dataset_url.split("/")[-1]) - -with ZipFile(local_filepath, "r") as zipper: - zipper.extractall(".") -``` - -```python -from PIL import Image -import os - -dataset_path = "sample-imagenet-images" -image_paths = sorted([os.path.join(dataset_path, x) for x in os.listdir(dataset_path)]) - -real_images = [np.array(Image.open(path).convert("RGB")) for path in image_paths] -``` - -These are 10 images from the following ImageNet-1k classes: "cassette_player", "chain_saw" (x2), "church", "gas_pump" (x3), "parachute" (x2), and "tench". - -

- real-images
- Real images. -

- -Now that the images are loaded, let's apply some lightweight pre-processing on them to use them for FID calculation. - -```python -from torchvision.transforms import functional as F - - -def preprocess_image(image): - image = torch.tensor(image).unsqueeze(0) - image = image.permute(0, 3, 1, 2) / 255.0 - return F.center_crop(image, (256, 256)) - -real_images = torch.cat([preprocess_image(image) for image in real_images]) -print(real_images.shape) -# torch.Size([10, 3, 256, 256]) -``` - -We now load the [`DiTPipeline`](https://huggingface.co/docs/diffusers/api/pipelines/dit) to generate images conditioned on the above-mentioned classes. - -```python -from diffusers import DiTPipeline, DPMSolverMultistepScheduler - -dit_pipeline = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", torch_dtype=torch.float16) -dit_pipeline.scheduler = DPMSolverMultistepScheduler.from_config(dit_pipeline.scheduler.config) -dit_pipeline = dit_pipeline.to("cuda") - -words = [ - "cassette player", - "chainsaw", - "chainsaw", - "church", - "gas pump", - "gas pump", - "gas pump", - "parachute", - "parachute", - "tench", -] - -class_ids = dit_pipeline.get_label_ids(words) -output = dit_pipeline(class_labels=class_ids, generator=generator, output_type="np") - -fake_images = output.images -fake_images = torch.tensor(fake_images) -fake_images = fake_images.permute(0, 3, 1, 2) -print(fake_images.shape) -# torch.Size([10, 3, 256, 256]) -``` - -Now, we can compute the FID using [`torchmetrics`](https://torchmetrics.readthedocs.io/). - -```python -from torchmetrics.image.fid import FrechetInceptionDistance - -fid = FrechetInceptionDistance(normalize=True) -fid.update(real_images, real=True) -fid.update(fake_images, real=False) - -print(f"FID: {float(fid.compute())}") -# FID: 177.7147216796875 -``` - -The lower the FID, the better it is. Several things can influence FID here: - -- Number of images (both real and fake) -- Randomness induced in the diffusion process -- Number of inference steps in the diffusion process -- The scheduler being used in the diffusion process - -For the last two points, it is, therefore, a good practice to run the evaluation across different seeds and inference steps, and then report an average result. - - - -FID results tend to be fragile as they depend on a lot of factors: - -* The specific Inception model used during computation. -* The implementation accuracy of the computation. -* The image format (not the same if we start from PNGs vs JPGs). - -Keeping that in mind, FID is often most useful when comparing similar runs, but it is -hard to reproduce paper results unless the authors carefully disclose the FID -measurement code. - -These points apply to other related metrics too, such as KID and IS. - - - -As a final step, let's visually inspect the `fake_images`. - -

- fake-images
- Fake images. -

diff --git a/diffusers/docs/source/en/conceptual/philosophy.md b/diffusers/docs/source/en/conceptual/philosophy.md deleted file mode 100644 index 7a351239982b778530635f2fdf4ca69e79a1c259..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/conceptual/philosophy.md +++ /dev/null @@ -1,110 +0,0 @@ - - -# Philosophy - -🧨 Diffusers provides **state-of-the-art** pretrained diffusion models across multiple modalities. -Its purpose is to serve as a **modular toolbox** for both inference and training. - -We aim at building a library that stands the test of time and therefore take API design very seriously. - -In a nutshell, Diffusers is built to be a natural extension of PyTorch. Therefore, most of our design choices are based on [PyTorch's Design Principles](https://pytorch.org/docs/stable/community/design.html#pytorch-design-philosophy). Let's go over the most important ones: - -## Usability over Performance - -- While Diffusers has many built-in performance-enhancing features (see [Memory and Speed](https://huggingface.co/docs/diffusers/optimization/fp16)), models are always loaded with the highest precision and lowest optimization. Therefore, by default diffusion pipelines are always instantiated on CPU with float32 precision if not otherwise defined by the user. This ensures usability across different platforms and accelerators and means that no complex installations are required to run the library. -- Diffusers aims to be a **light-weight** package and therefore has very few required dependencies, but many soft dependencies that can improve performance (such as `accelerate`, `safetensors`, `onnx`, etc...). We strive to keep the library as lightweight as possible so that it can be added without much concern as a dependency on other packages. -- Diffusers prefers simple, self-explainable code over condensed, magic code. This means that short-hand code syntaxes such as lambda functions, and advanced PyTorch operators are often not desired. - -## Simple over easy - -As PyTorch states, **explicit is better than implicit** and **simple is better than complex**. This design philosophy is reflected in multiple parts of the library: -- We follow PyTorch's API with methods like [`DiffusionPipeline.to`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.to) to let the user handle device management. -- Raising concise error messages is preferred to silently correct erroneous input. Diffusers aims at teaching the user, rather than making the library as easy to use as possible. -- Complex model vs. scheduler logic is exposed instead of magically handled inside. Schedulers/Samplers are separated from diffusion models with minimal dependencies on each other. This forces the user to write the unrolled denoising loop. However, the separation allows for easier debugging and gives the user more control over adapting the denoising process or switching out diffusion models or schedulers. -- Separately trained components of the diffusion pipeline, *e.g.* the text encoder, the unet, and the variational autoencoder, each have their own model class. This forces the user to handle the interaction between the different model components, and the serialization format separates the model components into different files. However, this allows for easier debugging and customization. DreamBooth or Textual Inversion training -is very simple thanks to Diffusers' ability to separate single components of the diffusion pipeline. - -## Tweakable, contributor-friendly over abstraction - -For large parts of the library, Diffusers adopts an important design principle of the [Transformers library](https://github.com/huggingface/transformers), which is to prefer copy-pasted code over hasty abstractions. This design principle is very opinionated and stands in stark contrast to popular design principles such as [Don't repeat yourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself). -In short, just like Transformers does for modeling files, Diffusers prefers to keep an extremely low level of abstraction and very self-contained code for pipelines and schedulers. -Functions, long code blocks, and even classes can be copied across multiple files which at first can look like a bad, sloppy design choice that makes the library unmaintainable. -**However**, this design has proven to be extremely successful for Transformers and makes a lot of sense for community-driven, open-source machine learning libraries because: -- Machine Learning is an extremely fast-moving field in which paradigms, model architectures, and algorithms are changing rapidly, which therefore makes it very difficult to define long-lasting code abstractions. -- Machine Learning practitioners like to be able to quickly tweak existing code for ideation and research and therefore prefer self-contained code over one that contains many abstractions. -- Open-source libraries rely on community contributions and therefore must build a library that is easy to contribute to. The more abstract the code, the more dependencies, the harder to read, and the harder to contribute to. Contributors simply stop contributing to very abstract libraries out of fear of breaking vital functionality. If contributing to a library cannot break other fundamental code, not only is it more inviting for potential new contributors, but it is also easier to review and contribute to multiple parts in parallel. - -At Hugging Face, we call this design the **single-file policy** which means that almost all of the code of a certain class should be written in a single, self-contained file. To read more about the philosophy, you can have a look -at [this blog post](https://huggingface.co/blog/transformers-design-philosophy). - -In Diffusers, we follow this philosophy for both pipelines and schedulers, but only partly for diffusion models. The reason we don't follow this design fully for diffusion models is because almost all diffusion pipelines, such -as [DDPM](https://huggingface.co/docs/diffusers/api/pipelines/ddpm), [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview#stable-diffusion-pipelines), [unCLIP (DALL·E 2)](https://huggingface.co/docs/diffusers/api/pipelines/unclip) and [Imagen](https://imagen.research.google/) all rely on the same diffusion model, the [UNet](https://huggingface.co/docs/diffusers/api/models/unet2d-cond). - -Great, now you should have generally understood why 🧨 Diffusers is designed the way it is 🤗. -We try to apply these design principles consistently across the library. Nevertheless, there are some minor exceptions to the philosophy or some unlucky design choices. If you have feedback regarding the design, we would ❤️ to hear it [directly on GitHub](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=). - -## Design Philosophy in Details - -Now, let's look a bit into the nitty-gritty details of the design philosophy. Diffusers essentially consists of three major classes: [pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines), [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models), and [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers). -Let's walk through more in-detail design decisions for each class. - -### Pipelines - -Pipelines are designed to be easy to use (therefore do not follow [*Simple over easy*](#simple-over-easy) 100%), are not feature complete, and should loosely be seen as examples of how to use [models](#models) and [schedulers](#schedulers) for inference. - -The following design principles are followed: -- Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as it’s done for [`src/diffusers/pipelines/stable-diffusion`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion). If pipelines share similar functionality, one can make use of the [# Copied from mechanism](https://github.com/huggingface/diffusers/blob/125d783076e5bd9785beb05367a2d2566843a271/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py#L251). -- Pipelines all inherit from [`DiffusionPipeline`]. -- Every pipeline consists of different model and scheduler components, that are documented in the [`model_index.json` file](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/model_index.json), are accessible under the same name as attributes of the pipeline and can be shared between pipelines with [`DiffusionPipeline.components`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.components) function. -- Every pipeline should be loadable via the [`DiffusionPipeline.from_pretrained`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained) function. -- Pipelines should be used **only** for inference. -- Pipelines should be very readable, self-explanatory, and easy to tweak. -- Pipelines should be designed to build on top of each other and be easy to integrate into higher-level APIs. -- Pipelines are **not** intended to be feature-complete user interfaces. For feature-complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner). -- Every pipeline should have one and only one way to run it via a `__call__` method. The naming of the `__call__` arguments should be shared across all pipelines. -- Pipelines should be named after the task they are intended to solve. -- In almost all cases, novel diffusion pipelines shall be implemented in a new pipeline folder/file. - -### Models - -Models are designed as configurable toolboxes that are natural extensions of [PyTorch's Module class](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). They only partly follow the **single-file policy**. - -The following design principles are followed: -- Models correspond to **a type of model architecture**. *E.g.* the [`UNet2DConditionModel`] class is used for all UNet variations that expect 2D image inputs and are conditioned on some context. -- All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unets/unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_condition.py), [`transformers/transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_2d.py), etc... -- Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py), [`resnet.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy. -- Models intend to expose complexity, just like PyTorch's `Module` class, and give clear error messages. -- Models all inherit from `ModelMixin` and `ConfigMixin`. -- Models can be optimized for performance when it doesn’t demand major code changes, keeps backward compatibility, and gives significant memory or compute gain. -- Models should by default have the highest precision and lowest performance setting. -- To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different. -- Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work. -- The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and -readable long-term, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). - -### Schedulers - -Schedulers are responsible to guide the denoising process for inference as well as to define a noise schedule for training. They are designed as individual classes with loadable configuration files and strongly follow the **single-file policy**. - -The following design principles are followed: -- All schedulers are found in [`src/diffusers/schedulers`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers). -- Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained. -- One scheduler Python file corresponds to one scheduler algorithm (as might be defined in a paper). -- If schedulers share similar functionalities, we can make use of the `# Copied from` mechanism. -- Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`. -- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](../using-diffusers/schedulers). -- Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called. -- Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon. -- The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1). -- Given the complexity of diffusion schedulers, the `step` function does not expose all the complexity and can be a bit of a "black box". -- In almost all cases, novel schedulers shall be implemented in a new scheduling file. \ No newline at end of file diff --git a/diffusers/docs/source/en/imgs/access_request.png b/diffusers/docs/source/en/imgs/access_request.png deleted file mode 100644 index 1a19908c64bd08dcba67f10375813d2821bf6f66..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/imgs/access_request.png +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:9688dabf75e180590251cd1f75d18966f9c94d5d6584bc7d0278b698c175c61f -size 104814 diff --git a/diffusers/docs/source/en/imgs/diffusers_library.jpg b/diffusers/docs/source/en/imgs/diffusers_library.jpg deleted file mode 100644 index 07ba9c6571a3f070d9d10b78dccfd4d4537dd539..0000000000000000000000000000000000000000 Binary files a/diffusers/docs/source/en/imgs/diffusers_library.jpg and /dev/null differ diff --git a/diffusers/docs/source/en/index.md b/diffusers/docs/source/en/index.md deleted file mode 100644 index 957d90786dd796f5b1b7f75c6db84aeb7e26cf63..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/index.md +++ /dev/null @@ -1,48 +0,0 @@ - - -

-
- -
-

- -# Diffusers - -🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or want to train your own diffusion model, 🤗 Diffusers is a modular toolbox that supports both. Our library is designed with a focus on [usability over performance](conceptual/philosophy#usability-over-performance), [simple over easy](conceptual/philosophy#simple-over-easy), and [customizability over abstractions](conceptual/philosophy#tweakable-contributorfriendly-over-abstraction). - -The library has three main components: - -- State-of-the-art diffusion pipelines for inference with just a few lines of code. There are many pipelines in 🤗 Diffusers, check out the table in the pipeline [overview](api/pipelines/overview) for a complete list of available pipelines and the task they solve. -- Interchangeable [noise schedulers](api/schedulers/overview) for balancing trade-offs between generation speed and quality. -- Pretrained [models](api/models) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems. - - diff --git a/diffusers/docs/source/en/installation.md b/diffusers/docs/source/en/installation.md deleted file mode 100644 index 74cfa70d70fc17031755be410cb2f94a475dbc6b..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/installation.md +++ /dev/null @@ -1,164 +0,0 @@ - - -# Installation - -🤗 Diffusers is tested on Python 3.8+, PyTorch 1.7.0+, and Flax. Follow the installation instructions below for the deep learning library you are using: - -- [PyTorch](https://pytorch.org/get-started/locally/) installation instructions -- [Flax](https://flax.readthedocs.io/en/latest/) installation instructions - -## Install with pip - -You should install 🤗 Diffusers in a [virtual environment](https://docs.python.org/3/library/venv.html). -If you're unfamiliar with Python virtual environments, take a look at this [guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). -A virtual environment makes it easier to manage different projects and avoid compatibility issues between dependencies. - -Start by creating a virtual environment in your project directory: - -```bash -python -m venv .env -``` - -Activate the virtual environment: - -```bash -source .env/bin/activate -``` - -You should also install 🤗 Transformers because 🤗 Diffusers relies on its models: - - - - -Note - PyTorch only supports Python 3.8 - 3.11 on Windows. -```bash -pip install diffusers["torch"] transformers -``` - - -```bash -pip install diffusers["flax"] transformers -``` - - - -## Install with conda - -After activating your virtual environment, with `conda` (maintained by the community): - -```bash -conda install -c conda-forge diffusers -``` - -## Install from source - -Before installing 🤗 Diffusers from source, make sure you have PyTorch and 🤗 Accelerate installed. - -To install 🤗 Accelerate: - -```bash -pip install accelerate -``` - -Then install 🤗 Diffusers from source: - -```bash -pip install git+https://github.com/huggingface/diffusers -``` - -This command installs the bleeding edge `main` version rather than the latest `stable` version. -The `main` version is useful for staying up-to-date with the latest developments. -For instance, if a bug has been fixed since the last official release but a new release hasn't been rolled out yet. -However, this means the `main` version may not always be stable. -We strive to keep the `main` version operational, and most issues are usually resolved within a few hours or a day. -If you run into a problem, please open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) so we can fix it even sooner! - -## Editable install - -You will need an editable install if you'd like to: - -* Use the `main` version of the source code. -* Contribute to 🤗 Diffusers and need to test changes in the code. - -Clone the repository and install 🤗 Diffusers with the following commands: - -```bash -git clone https://github.com/huggingface/diffusers.git -cd diffusers -``` - - - -```bash -pip install -e ".[torch]" -``` - - -```bash -pip install -e ".[flax]" -``` - - - -These commands will link the folder you cloned the repository to and your Python library paths. -Python will now look inside the folder you cloned to in addition to the normal library paths. -For example, if your Python packages are typically installed in `~/anaconda3/envs/main/lib/python3.10/site-packages/`, Python will also search the `~/diffusers/` folder you cloned to. - - - -You must keep the `diffusers` folder if you want to keep using the library. - - - -Now you can easily update your clone to the latest version of 🤗 Diffusers with the following command: - -```bash -cd ~/diffusers/ -git pull -``` - -Your Python environment will find the `main` version of 🤗 Diffusers on the next run. - -## Cache - -Model weights and files are downloaded from the Hub to a cache which is usually your home directory. You can change the cache location by specifying the `HF_HOME` or `HUGGINFACE_HUB_CACHE` environment variables or configuring the `cache_dir` parameter in methods like [`~DiffusionPipeline.from_pretrained`]. - -Cached files allow you to run 🤗 Diffusers offline. To prevent 🤗 Diffusers from connecting to the internet, set the `HF_HUB_OFFLINE` environment variable to `True` and 🤗 Diffusers will only load previously downloaded files in the cache. - -```shell -export HF_HUB_OFFLINE=True -``` - -For more details about managing and cleaning the cache, take a look at the [caching](https://huggingface.co/docs/huggingface_hub/guides/manage-cache) guide. - -## Telemetry logging - -Our library gathers telemetry information during [`~DiffusionPipeline.from_pretrained`] requests. -The data gathered includes the version of 🤗 Diffusers and PyTorch/Flax, the requested model or pipeline class, -and the path to a pretrained checkpoint if it is hosted on the Hugging Face Hub. -This usage data helps us debug issues and prioritize new features. -Telemetry is only sent when loading models and pipelines from the Hub, -and it is not collected if you're loading local files. - -We understand that not everyone wants to share additional information,and we respect your privacy. -You can disable telemetry collection by setting the `DISABLE_TELEMETRY` environment variable from your terminal: - -On Linux/MacOS: -```bash -export DISABLE_TELEMETRY=YES -``` - -On Windows: -```bash -set DISABLE_TELEMETRY=YES -``` diff --git a/diffusers/docs/source/en/optimization/coreml.md b/diffusers/docs/source/en/optimization/coreml.md deleted file mode 100644 index d090ef0ed3ba76ea2877669820ff908f51d52d44..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/coreml.md +++ /dev/null @@ -1,164 +0,0 @@ - - -# How to run Stable Diffusion with Core ML - -[Core ML](https://developer.apple.com/documentation/coreml) is the model format and machine learning library supported by Apple frameworks. If you are interested in running Stable Diffusion models inside your macOS or iOS/iPadOS apps, this guide will show you how to convert existing PyTorch checkpoints into the Core ML format and use them for inference with Python or Swift. - -Core ML models can leverage all the compute engines available in Apple devices: the CPU, the GPU, and the Apple Neural Engine (or ANE, a tensor-optimized accelerator available in Apple Silicon Macs and modern iPhones/iPads). Depending on the model and the device it's running on, Core ML can mix and match compute engines too, so some portions of the model may run on the CPU while others run on GPU, for example. - - - -You can also run the `diffusers` Python codebase on Apple Silicon Macs using the `mps` accelerator built into PyTorch. This approach is explained in depth in [the mps guide](mps), but it is not compatible with native apps. - - - -## Stable Diffusion Core ML Checkpoints - -Stable Diffusion weights (or checkpoints) are stored in the PyTorch format, so you need to convert them to the Core ML format before we can use them inside native apps. - -Thankfully, Apple engineers developed [a conversion tool](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml) based on `diffusers` to convert the PyTorch checkpoints to Core ML. - -Before you convert a model, though, take a moment to explore the Hugging Face Hub – chances are the model you're interested in is already available in Core ML format: - -- the [Apple](https://huggingface.co/apple) organization includes Stable Diffusion versions 1.4, 1.5, 2.0 base, and 2.1 base -- [coreml community](https://huggingface.co/coreml-community) includes custom finetuned models -- use this [filter](https://huggingface.co/models?pipeline_tag=text-to-image&library=coreml&p=2&sort=likes) to return all available Core ML checkpoints - -If you can't find the model you're interested in, we recommend you follow the instructions for [Converting Models to Core ML](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml) by Apple. - -## Selecting the Core ML Variant to Use - -Stable Diffusion models can be converted to different Core ML variants intended for different purposes: - -- The type of attention blocks used. The attention operation is used to "pay attention" to the relationship between different areas in the image representations and to understand how the image and text representations are related. Attention is compute- and memory-intensive, so different implementations exist that consider the hardware characteristics of different devices. For Core ML Stable Diffusion models, there are two attention variants: - * `split_einsum` ([introduced by Apple](https://machinelearning.apple.com/research/neural-engine-transformers)) is optimized for ANE devices, which is available in modern iPhones, iPads and M-series computers. - * The "original" attention (the base implementation used in `diffusers`) is only compatible with CPU/GPU and not ANE. It can be *faster* to run your model on CPU + GPU using `original` attention than ANE. See [this performance benchmark](https://huggingface.co/blog/fast-mac-diffusers#performance-benchmarks) as well as some [additional measures provided by the community](https://github.com/huggingface/swift-coreml-diffusers/issues/31) for additional details. - -- The supported inference framework. - * `packages` are suitable for Python inference. This can be used to test converted Core ML models before attempting to integrate them inside native apps, or if you want to explore Core ML performance but don't need to support native apps. For example, an application with a web UI could perfectly use a Python Core ML backend. - * `compiled` models are required for Swift code. The `compiled` models in the Hub split the large UNet model weights into several files for compatibility with iOS and iPadOS devices. This corresponds to the [`--chunk-unet` conversion option](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml). If you want to support native apps, then you need to select the `compiled` variant. - -The official Core ML Stable Diffusion [models](https://huggingface.co/apple/coreml-stable-diffusion-v1-4/tree/main) include these variants, but the community ones may vary: - -``` -coreml-stable-diffusion-v1-4 -├── README.md -├── original -│ ├── compiled -│ └── packages -└── split_einsum - ├── compiled - └── packages -``` - -You can download and use the variant you need as shown below. - -## Core ML Inference in Python - -Install the following libraries to run Core ML inference in Python: - -```bash -pip install huggingface_hub -pip install git+https://github.com/apple/ml-stable-diffusion -``` - -### Download the Model Checkpoints - -To run inference in Python, use one of the versions stored in the `packages` folders because the `compiled` ones are only compatible with Swift. You may choose whether you want to use `original` or `split_einsum` attention. - -This is how you'd download the `original` attention variant from the Hub to a directory called `models`: - -```Python -from huggingface_hub import snapshot_download -from pathlib import Path - -repo_id = "apple/coreml-stable-diffusion-v1-4" -variant = "original/packages" - -model_path = Path("./models") / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_")) -snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, local_dir_use_symlinks=False) -print(f"Model downloaded at {model_path}") -``` - -### Inference[[python-inference]] - -Once you have downloaded a snapshot of the model, you can test it using Apple's Python script. - -```shell -python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i ./models/coreml-stable-diffusion-v1-4_original_packages/original/packages -o --compute-unit CPU_AND_GPU --seed 93 -``` - -Pass the path of the downloaded checkpoint with `-i` flag to the script. `--compute-unit` indicates the hardware you want to allow for inference. It must be one of the following options: `ALL`, `CPU_AND_GPU`, `CPU_ONLY`, `CPU_AND_NE`. You may also provide an optional output path, and a seed for reproducibility. - -The inference script assumes you're using the original version of the Stable Diffusion model, `CompVis/stable-diffusion-v1-4`. If you use another model, you *have* to specify its Hub id in the inference command line, using the `--model-version` option. This works for models already supported and custom models you trained or fine-tuned yourself. - -For example, if you want to use [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5): - -```shell -python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" --compute-unit ALL -o output --seed 93 -i models/coreml-stable-diffusion-v1-5_original_packages --model-version stable-diffusion-v1-5/stable-diffusion-v1-5 -``` - -## Core ML inference in Swift - -Running inference in Swift is slightly faster than in Python because the models are already compiled in the `mlmodelc` format. This is noticeable on app startup when the model is loaded but shouldn’t be noticeable if you run several generations afterward. - -### Download - -To run inference in Swift on your Mac, you need one of the `compiled` checkpoint versions. We recommend you download them locally using Python code similar to the previous example, but with one of the `compiled` variants: - -```Python -from huggingface_hub import snapshot_download -from pathlib import Path - -repo_id = "apple/coreml-stable-diffusion-v1-4" -variant = "original/compiled" - -model_path = Path("./models") / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_")) -snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, local_dir_use_symlinks=False) -print(f"Model downloaded at {model_path}") -``` - -### Inference[[swift-inference]] - -To run inference, please clone Apple's repo: - -```bash -git clone https://github.com/apple/ml-stable-diffusion -cd ml-stable-diffusion -``` - -And then use Apple's command line tool, [Swift Package Manager](https://www.swift.org/package-manager/#): - -```bash -swift run StableDiffusionSample --resource-path models/coreml-stable-diffusion-v1-4_original_compiled --compute-units all "a photo of an astronaut riding a horse on mars" -``` - -You have to specify in `--resource-path` one of the checkpoints downloaded in the previous step, so please make sure it contains compiled Core ML bundles with the extension `.mlmodelc`. The `--compute-units` has to be one of these values: `all`, `cpuOnly`, `cpuAndGPU`, `cpuAndNeuralEngine`. - -For more details, please refer to the [instructions in Apple's repo](https://github.com/apple/ml-stable-diffusion). - -## Supported Diffusers Features - -The Core ML models and inference code don't support many of the features, options, and flexibility of 🧨 Diffusers. These are some of the limitations to keep in mind: - -- Core ML models are only suitable for inference. They can't be used for training or fine-tuning. -- Only two schedulers have been ported to Swift, the default one used by Stable Diffusion and `DPMSolverMultistepScheduler`, which we ported to Swift from our `diffusers` implementation. We recommend you use `DPMSolverMultistepScheduler`, since it produces the same quality in about half the steps. -- Negative prompts, classifier-free guidance scale, and image-to-image tasks are available in the inference code. Advanced features such as depth guidance, ControlNet, and latent upscalers are not available yet. - -Apple's [conversion and inference repo](https://github.com/apple/ml-stable-diffusion) and our own [swift-coreml-diffusers](https://github.com/huggingface/swift-coreml-diffusers) repos are intended as technology demonstrators to enable other developers to build upon. - -If you feel strongly about any missing features, please feel free to open a feature request or, better yet, a contribution PR 🙂. - -## Native Diffusers Swift app - -One easy way to run Stable Diffusion on your own Apple hardware is to use [our open-source Swift repo](https://github.com/huggingface/swift-coreml-diffusers), based on `diffusers` and Apple's conversion and inference repo. You can study the code, compile it with [Xcode](https://developer.apple.com/xcode/) and adapt it for your own needs. For your convenience, there's also a [standalone Mac app in the App Store](https://apps.apple.com/app/diffusers/id1666309574), so you can play with it without having to deal with the code or IDE. If you are a developer and have determined that Core ML is the best solution to build your Stable Diffusion app, then you can use the rest of this guide to get started with your project. We can't wait to see what you'll build 🙂. diff --git a/diffusers/docs/source/en/optimization/deepcache.md b/diffusers/docs/source/en/optimization/deepcache.md deleted file mode 100644 index ce3a44269788b45e1838e794e9f375bc3b23f558..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/deepcache.md +++ /dev/null @@ -1,62 +0,0 @@ - - -# DeepCache -[DeepCache](https://huggingface.co/papers/2312.00858) accelerates [`StableDiffusionPipeline`] and [`StableDiffusionXLPipeline`] by strategically caching and reusing high-level features while efficiently updating low-level features by taking advantage of the U-Net architecture. - -Start by installing [DeepCache](https://github.com/horseee/DeepCache): -```bash -pip install DeepCache -``` - -Then load and enable the [`DeepCacheSDHelper`](https://github.com/horseee/DeepCache#usage): - -```diff - import torch - from diffusers import StableDiffusionPipeline - pipe = StableDiffusionPipeline.from_pretrained('stable-diffusion-v1-5/stable-diffusion-v1-5', torch_dtype=torch.float16).to("cuda") - -+ from DeepCache import DeepCacheSDHelper -+ helper = DeepCacheSDHelper(pipe=pipe) -+ helper.set_params( -+ cache_interval=3, -+ cache_branch_id=0, -+ ) -+ helper.enable() - - image = pipe("a photo of an astronaut on a moon").images[0] -``` - -The `set_params` method accepts two arguments: `cache_interval` and `cache_branch_id`. `cache_interval` means the frequency of feature caching, specified as the number of steps between each cache operation. `cache_branch_id` identifies which branch of the network (ordered from the shallowest to the deepest layer) is responsible for executing the caching processes. -Opting for a lower `cache_branch_id` or a larger `cache_interval` can lead to faster inference speed at the expense of reduced image quality (ablation experiments of these two hyperparameters can be found in the [paper](https://arxiv.org/abs/2312.00858)). Once those arguments are set, use the `enable` or `disable` methods to activate or deactivate the `DeepCacheSDHelper`. - -
- -
- -You can find more generated samples (original pipeline vs DeepCache) and the corresponding inference latency in the [WandB report](https://wandb.ai/horseee/DeepCache/runs/jwlsqqgt?workspace=user-horseee). The prompts are randomly selected from the [MS-COCO 2017](https://cocodataset.org/#home) dataset. - -## Benchmark - -We tested how much faster DeepCache accelerates [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) with 50 inference steps on an NVIDIA RTX A5000, using different configurations for resolution, batch size, cache interval (I), and cache branch (B). - -| **Resolution** | **Batch size** | **Original** | **DeepCache(I=3, B=0)** | **DeepCache(I=5, B=0)** | **DeepCache(I=5, B=1)** | -|----------------|----------------|--------------|-------------------------|-------------------------|-------------------------| -| 512| 8| 15.96| 6.88(2.32x)| 5.03(3.18x)| 7.27(2.20x)| -| | 4| 8.39| 3.60(2.33x)| 2.62(3.21x)| 3.75(2.24x)| -| | 1| 2.61| 1.12(2.33x)| 0.81(3.24x)| 1.11(2.35x)| -| 768| 8| 43.58| 18.99(2.29x)| 13.96(3.12x)| 21.27(2.05x)| -| | 4| 22.24| 9.67(2.30x)| 7.10(3.13x)| 10.74(2.07x)| -| | 1| 6.33| 2.72(2.33x)| 1.97(3.21x)| 2.98(2.12x)| -| 1024| 8| 101.95| 45.57(2.24x)| 33.72(3.02x)| 53.00(1.92x)| -| | 4| 49.25| 21.86(2.25x)| 16.19(3.04x)| 25.78(1.91x)| -| | 1| 13.83| 6.07(2.28x)| 4.43(3.12x)| 7.15(1.93x)| diff --git a/diffusers/docs/source/en/optimization/fp16.md b/diffusers/docs/source/en/optimization/fp16.md deleted file mode 100644 index 7a8fee02b7f5e04ccde8e90e308234a2822ccaa7..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/fp16.md +++ /dev/null @@ -1,129 +0,0 @@ - - -# Speed up inference - -There are several ways to optimize Diffusers for inference speed, such as reducing the computational burden by lowering the data precision or using a lightweight distilled model. There are also memory-efficient attention implementations, [xFormers](xformers) and [scaled dot product attention](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) in PyTorch 2.0, that reduce memory usage which also indirectly speeds up inference. Different speed optimizations can be stacked together to get the fastest inference times. - -> [!TIP] -> Optimizing for inference speed or reduced memory usage can lead to improved performance in the other category, so you should try to optimize for both whenever you can. This guide focuses on inference speed, but you can learn more about lowering memory usage in the [Reduce memory usage](memory) guide. - -The inference times below are obtained from generating a single 512x512 image from the prompt "a photo of an astronaut riding a horse on mars" with 50 DDIM steps on a NVIDIA A100. - -| setup | latency | speed-up | -|----------|---------|----------| -| baseline | 5.27s | x1 | -| tf32 | 4.14s | x1.27 | -| fp16 | 3.51s | x1.50 | -| combined | 3.41s | x1.54 | - -## TensorFloat-32 - -On Ampere and later CUDA devices, matrix multiplications and convolutions can use the [TensorFloat-32 (tf32)](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) mode for faster, but slightly less accurate computations. By default, PyTorch enables tf32 mode for convolutions but not matrix multiplications. Unless your network requires full float32 precision, we recommend enabling tf32 for matrix multiplications. It can significantly speed up computations with typically negligible loss in numerical accuracy. - -```python -import torch - -torch.backends.cuda.matmul.allow_tf32 = True -``` - -Learn more about tf32 in the [Mixed precision training](https://huggingface.co/docs/transformers/en/perf_train_gpu_one#tf32) guide. - -## Half-precision weights - -To save GPU memory and get more speed, set `torch_dtype=torch.float16` to load and run the model weights directly with half-precision weights. - -```Python -import torch -from diffusers import DiffusionPipeline - -pipe = DiffusionPipeline.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) -pipe = pipe.to("cuda") -``` - -> [!WARNING] -> Don't use [torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast) in any of the pipelines as it can lead to black images and is always slower than pure float16 precision. - -## Distilled model - -You could also use a distilled Stable Diffusion model and autoencoder to speed up inference. During distillation, many of the UNet's residual and attention blocks are shed to reduce the model size by 51% and improve latency on CPU/GPU by 43%. The distilled model is faster and uses less memory while generating images of comparable quality to the full Stable Diffusion model. - -> [!TIP] -> Read the [Open-sourcing Knowledge Distillation Code and Weights of SD-Small and SD-Tiny](https://huggingface.co/blog/sd_distillation) blog post to learn more about how knowledge distillation training works to produce a faster, smaller, and cheaper generative model. - -The inference times below are obtained from generating 4 images from the prompt "a photo of an astronaut riding a horse on mars" with 25 PNDM steps on a NVIDIA A100. Each generation is repeated 3 times with the distilled Stable Diffusion v1.4 model by [Nota AI](https://hf.co/nota-ai). - -| setup | latency | speed-up | -|------------------------------|---------|----------| -| baseline | 6.37s | x1 | -| distilled | 4.18s | x1.52 | -| distilled + tiny autoencoder | 3.83s | x1.66 | - -Let's load the distilled Stable Diffusion model and compare it against the original Stable Diffusion model. - -```py -from diffusers import StableDiffusionPipeline -import torch - -distilled = StableDiffusionPipeline.from_pretrained( - "nota-ai/bk-sdm-small", torch_dtype=torch.float16, use_safetensors=True, -).to("cuda") -prompt = "a golden vase with different flowers" -generator = torch.manual_seed(2023) -image = distilled("a golden vase with different flowers", num_inference_steps=25, generator=generator).images[0] -image -``` - -
-
- -
original Stable Diffusion
-
-
- -
distilled Stable Diffusion
-
-
- -### Tiny AutoEncoder - -To speed inference up even more, replace the autoencoder with a [distilled version](https://huggingface.co/sayakpaul/taesdxl-diffusers) of it. - -```py -import torch -from diffusers import AutoencoderTiny, StableDiffusionPipeline - -distilled = StableDiffusionPipeline.from_pretrained( - "nota-ai/bk-sdm-small", torch_dtype=torch.float16, use_safetensors=True, -).to("cuda") -distilled.vae = AutoencoderTiny.from_pretrained( - "sayakpaul/taesd-diffusers", torch_dtype=torch.float16, use_safetensors=True, -).to("cuda") - -prompt = "a golden vase with different flowers" -generator = torch.manual_seed(2023) -image = distilled("a golden vase with different flowers", num_inference_steps=25, generator=generator).images[0] -image -``` - -
-
- -
distilled Stable Diffusion + Tiny AutoEncoder
-
-
- -More tiny autoencoder models for other Stable Diffusion models, like Stable Diffusion 3, are available from [madebyollin](https://huggingface.co/madebyollin). \ No newline at end of file diff --git a/diffusers/docs/source/en/optimization/habana.md b/diffusers/docs/source/en/optimization/habana.md deleted file mode 100644 index 86a0cf0ba01911492c51d07c556a66676d31ec86..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/habana.md +++ /dev/null @@ -1,76 +0,0 @@ - - -# Habana Gaudi - -🤗 Diffusers is compatible with Habana Gaudi through 🤗 [Optimum](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion). Follow the [installation](https://docs.habana.ai/en/latest/Installation_Guide/index.html) guide to install the SynapseAI and Gaudi drivers, and then install Optimum Habana: - -```bash -python -m pip install --upgrade-strategy eager optimum[habana] -``` - -To generate images with Stable Diffusion 1 and 2 on Gaudi, you need to instantiate two instances: - -- [`~optimum.habana.diffusers.GaudiStableDiffusionPipeline`], a pipeline for text-to-image generation. -- [`~optimum.habana.diffusers.GaudiDDIMScheduler`], a Gaudi-optimized scheduler. - -When you initialize the pipeline, you have to specify `use_habana=True` to deploy it on HPUs and to get the fastest possible generation, you should enable **HPU graphs** with `use_hpu_graphs=True`. - -Finally, specify a [`~optimum.habana.GaudiConfig`] which can be downloaded from the [Habana](https://huggingface.co/Habana) organization on the Hub. - -```python -from optimum.habana import GaudiConfig -from optimum.habana.diffusers import GaudiDDIMScheduler, GaudiStableDiffusionPipeline - -model_name = "stabilityai/stable-diffusion-2-base" -scheduler = GaudiDDIMScheduler.from_pretrained(model_name, subfolder="scheduler") -pipeline = GaudiStableDiffusionPipeline.from_pretrained( - model_name, - scheduler=scheduler, - use_habana=True, - use_hpu_graphs=True, - gaudi_config="Habana/stable-diffusion-2", -) -``` - -Now you can call the pipeline to generate images by batches from one or several prompts: - -```python -outputs = pipeline( - prompt=[ - "High quality photo of an astronaut riding a horse in space", - "Face of a yellow cat, high resolution, sitting on a park bench", - ], - num_images_per_prompt=10, - batch_size=4, -) -``` - -For more information, check out 🤗 Optimum Habana's [documentation](https://huggingface.co/docs/optimum/habana/usage_guides/stable_diffusion) and the [example](https://github.com/huggingface/optimum-habana/tree/main/examples/stable-diffusion) provided in the official GitHub repository. - -## Benchmark - -We benchmarked Habana's first-generation Gaudi and Gaudi2 with the [Habana/stable-diffusion](https://huggingface.co/Habana/stable-diffusion) and [Habana/stable-diffusion-2](https://huggingface.co/Habana/stable-diffusion-2) Gaudi configurations (mixed precision bf16/fp32) to demonstrate their performance. - -For [Stable Diffusion v1.5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) on 512x512 images: - -| | Latency (batch size = 1) | Throughput | -| ---------------------- |:------------------------:|:---------------------------:| -| first-generation Gaudi | 3.80s | 0.308 images/s (batch size = 8) | -| Gaudi2 | 1.33s | 1.081 images/s (batch size = 8) | - -For [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) on 768x768 images: - -| | Latency (batch size = 1) | Throughput | -| ---------------------- |:------------------------:|:-------------------------------:| -| first-generation Gaudi | 10.2s | 0.108 images/s (batch size = 4) | -| Gaudi2 | 3.17s | 0.379 images/s (batch size = 8) | diff --git a/diffusers/docs/source/en/optimization/memory.md b/diffusers/docs/source/en/optimization/memory.md deleted file mode 100644 index a2150f9aa0b7c9da52215d8ef115de33b3dc3784..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/memory.md +++ /dev/null @@ -1,332 +0,0 @@ - - -# Reduce memory usage - -A barrier to using diffusion models is the large amount of memory required. To overcome this challenge, there are several memory-reducing techniques you can use to run even some of the largest models on free-tier or consumer GPUs. Some of these techniques can even be combined to further reduce memory usage. - - - -In many cases, optimizing for memory or speed leads to improved performance in the other, so you should try to optimize for both whenever you can. This guide focuses on minimizing memory usage, but you can also learn more about how to [Speed up inference](fp16). - - - -The results below are obtained from generating a single 512x512 image from the prompt a photo of an astronaut riding a horse on mars with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect as a result of reduced memory consumption. - -| | latency | speed-up | -| ---------------- | ------- | ------- | -| original | 9.50s | x1 | -| fp16 | 3.61s | x2.63 | -| channels last | 3.30s | x2.88 | -| traced UNet | 3.21s | x2.96 | -| memory-efficient attention | 2.63s | x3.61 | - -## Sliced VAE - -Sliced VAE enables decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time. You'll likely want to couple this with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to reduce memory use further if you have xFormers installed. - -To use sliced VAE, call [`~StableDiffusionPipeline.enable_vae_slicing`] on your pipeline before inference: - -```python -import torch -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) -pipe = pipe.to("cuda") - -prompt = "a photo of an astronaut riding a horse on mars" -pipe.enable_vae_slicing() -#pipe.enable_xformers_memory_efficient_attention() -images = pipe([prompt] * 32).images -``` - -You may see a small performance boost in VAE decoding on multi-image batches, and there should be no performance impact on single-image batches. - -## Tiled VAE - -Tiled VAE processing also enables working with large images on limited VRAM (for example, generating 4k images on 8GB of VRAM) by splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image. You should also used tiled VAE with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to reduce memory use further if you have xFormers installed. - -To use tiled VAE processing, call [`~StableDiffusionPipeline.enable_vae_tiling`] on your pipeline before inference: - -```python -import torch -from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler - -pipe = StableDiffusionPipeline.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) -pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) -pipe = pipe.to("cuda") -prompt = "a beautiful landscape photograph" -pipe.enable_vae_tiling() -#pipe.enable_xformers_memory_efficient_attention() - -image = pipe([prompt], width=3840, height=2224, num_inference_steps=20).images[0] -``` - -The output image has some tile-to-tile tone variation because the tiles are decoded separately, but you shouldn't see any sharp and obvious seams between the tiles. Tiling is turned off for images that are 512x512 or smaller. - -## CPU offloading - -Offloading the weights to the CPU and only loading them on the GPU when performing the forward pass can also save memory. Often, this technique can reduce memory consumption to less than 3GB. - -To perform CPU offloading, call [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]: - -```Python -import torch -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) - -prompt = "a photo of an astronaut riding a horse on mars" -pipe.enable_sequential_cpu_offload() -image = pipe(prompt).images[0] -``` - -CPU offloading works on submodules rather than whole models. This is the best way to minimize memory consumption, but inference is much slower due to the iterative nature of the diffusion process. The UNet component of the pipeline runs several times (as many as `num_inference_steps`); each time, the different UNet submodules are sequentially onloaded and offloaded as needed, resulting in a large number of memory transfers. - - - -Consider using [model offloading](#model-offloading) if you want to optimize for speed because it is much faster. The tradeoff is your memory savings won't be as large. - - - - - -When using [`~StableDiffusionPipeline.enable_sequential_cpu_offload`], don't move the pipeline to CUDA beforehand or else the gain in memory consumption will only be minimal (see this [issue](https://github.com/huggingface/diffusers/issues/1934) for more information). - -[`~StableDiffusionPipeline.enable_sequential_cpu_offload`] is a stateful operation that installs hooks on the models. - - - -## Model offloading - - - -Model offloading requires 🤗 Accelerate version 0.17.0 or higher. - - - -[Sequential CPU offloading](#cpu-offloading) preserves a lot of memory but it makes inference slower because submodules are moved to GPU as needed, and they're immediately returned to the CPU when a new module runs. - -Full-model offloading is an alternative that moves whole models to the GPU, instead of handling each model's constituent *submodules*. There is a negligible impact on inference time (compared with moving the pipeline to `cuda`), and it still provides some memory savings. - -During model offloading, only one of the main components of the pipeline (typically the text encoder, UNet and VAE) -is placed on the GPU while the others wait on the CPU. Components like the UNet that run for multiple iterations stay on the GPU until they're no longer needed. - -Enable model offloading by calling [`~StableDiffusionPipeline.enable_model_cpu_offload`] on the pipeline: - -```Python -import torch -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) - -prompt = "a photo of an astronaut riding a horse on mars" -pipe.enable_model_cpu_offload() -image = pipe(prompt).images[0] -``` - - - -In order to properly offload models after they're called, it is required to run the entire pipeline and models are called in the pipeline's expected order. Exercise caution if models are reused outside the context of the pipeline after hooks have been installed. See [Removing Hooks](https://huggingface.co/docs/accelerate/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module) for more information. - -[`~StableDiffusionPipeline.enable_model_cpu_offload`] is a stateful operation that installs hooks on the models and state on the pipeline. - - - -## Channels-last memory format - -The channels-last memory format is an alternative way of ordering NCHW tensors in memory to preserve dimension ordering. Channels-last tensors are ordered in such a way that the channels become the densest dimension (storing images pixel-per-pixel). Since not all operators currently support the channels-last format, it may result in worst performance but you should still try and see if it works for your model. - -For example, to set the pipeline's UNet to use the channels-last format: - -```python -print(pipe.unet.conv_out.state_dict()["weight"].stride()) # (2880, 9, 3, 1) -pipe.unet.to(memory_format=torch.channels_last) # in-place operation -print( - pipe.unet.conv_out.state_dict()["weight"].stride() -) # (2880, 1, 960, 320) having a stride of 1 for the 2nd dimension proves that it works -``` - -## Tracing - -Tracing runs an example input tensor through the model and captures the operations that are performed on it as that input makes its way through the model's layers. The executable or `ScriptFunction` that is returned is optimized with just-in-time compilation. - -To trace a UNet: - -```python -import time -import torch -from diffusers import StableDiffusionPipeline -import functools - -# torch disable grad -torch.set_grad_enabled(False) - -# set variables -n_experiments = 2 -unet_runs_per_experiment = 50 - - -# load inputs -def generate_inputs(): - sample = torch.randn((2, 4, 64, 64), device="cuda", dtype=torch.float16) - timestep = torch.rand(1, device="cuda", dtype=torch.float16) * 999 - encoder_hidden_states = torch.randn((2, 77, 768), device="cuda", dtype=torch.float16) - return sample, timestep, encoder_hidden_states - - -pipe = StableDiffusionPipeline.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -).to("cuda") -unet = pipe.unet -unet.eval() -unet.to(memory_format=torch.channels_last) # use channels_last memory format -unet.forward = functools.partial(unet.forward, return_dict=False) # set return_dict=False as default - -# warmup -for _ in range(3): - with torch.inference_mode(): - inputs = generate_inputs() - orig_output = unet(*inputs) - -# trace -print("tracing..") -unet_traced = torch.jit.trace(unet, inputs) -unet_traced.eval() -print("done tracing") - - -# warmup and optimize graph -for _ in range(5): - with torch.inference_mode(): - inputs = generate_inputs() - orig_output = unet_traced(*inputs) - - -# benchmarking -with torch.inference_mode(): - for _ in range(n_experiments): - torch.cuda.synchronize() - start_time = time.time() - for _ in range(unet_runs_per_experiment): - orig_output = unet_traced(*inputs) - torch.cuda.synchronize() - print(f"unet traced inference took {time.time() - start_time:.2f} seconds") - for _ in range(n_experiments): - torch.cuda.synchronize() - start_time = time.time() - for _ in range(unet_runs_per_experiment): - orig_output = unet(*inputs) - torch.cuda.synchronize() - print(f"unet inference took {time.time() - start_time:.2f} seconds") - -# save the model -unet_traced.save("unet_traced.pt") -``` - -Replace the `unet` attribute of the pipeline with the traced model: - -```python -from diffusers import StableDiffusionPipeline -import torch -from dataclasses import dataclass - - -@dataclass -class UNet2DConditionOutput: - sample: torch.Tensor - - -pipe = StableDiffusionPipeline.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -).to("cuda") - -# use jitted unet -unet_traced = torch.jit.load("unet_traced.pt") - - -# del pipe.unet -class TracedUNet(torch.nn.Module): - def __init__(self): - super().__init__() - self.in_channels = pipe.unet.config.in_channels - self.device = pipe.unet.device - - def forward(self, latent_model_input, t, encoder_hidden_states): - sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0] - return UNet2DConditionOutput(sample=sample) - - -pipe.unet = TracedUNet() - -with torch.inference_mode(): - image = pipe([prompt] * 1, num_inference_steps=50).images[0] -``` - -## Memory-efficient attention - -Recent work on optimizing bandwidth in the attention block has generated huge speed-ups and reductions in GPU memory usage. The most recent type of memory-efficient attention is [Flash Attention](https://arxiv.org/abs/2205.14135) (you can check out the original code at [HazyResearch/flash-attention](https://github.com/HazyResearch/flash-attention)). - - - -If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling `xformers`. - - - -To use Flash Attention, install the following: - -- PyTorch > 1.12 -- CUDA available -- [xFormers](xformers) - -Then call [`~ModelMixin.enable_xformers_memory_efficient_attention`] on the pipeline: - -```python -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -).to("cuda") - -pipe.enable_xformers_memory_efficient_attention() - -with torch.inference_mode(): - sample = pipe("a small cat") - -# optional: You can disable it via -# pipe.disable_xformers_memory_efficient_attention() -``` - -The iteration speed when using `xformers` should match the iteration speed of PyTorch 2.0 as described [here](torch2.0). diff --git a/diffusers/docs/source/en/optimization/mps.md b/diffusers/docs/source/en/optimization/mps.md deleted file mode 100644 index 2c6dc9306cf9dd2cd9d54ae5d65d89d4ea6dedc3..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/mps.md +++ /dev/null @@ -1,74 +0,0 @@ - - -# Metal Performance Shaders (MPS) - -🤗 Diffusers is compatible with Apple silicon (M1/M2 chips) using the PyTorch [`mps`](https://pytorch.org/docs/stable/notes/mps.html) device, which uses the Metal framework to leverage the GPU on MacOS devices. You'll need to have: - -- macOS computer with Apple silicon (M1/M2) hardware -- macOS 12.6 or later (13.0 or later recommended) -- arm64 version of Python -- [PyTorch 2.0](https://pytorch.org/get-started/locally/) (recommended) or 1.13 (minimum version supported for `mps`) - -The `mps` backend uses PyTorch's `.to()` interface to move the Stable Diffusion pipeline on to your M1 or M2 device: - -```python -from diffusers import DiffusionPipeline - -pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5") -pipe = pipe.to("mps") - -# Recommended if your computer has < 64 GB of RAM -pipe.enable_attention_slicing() - -prompt = "a photo of an astronaut riding a horse on mars" -image = pipe(prompt).images[0] -image -``` - - - -Generating multiple prompts in a batch can [crash](https://github.com/huggingface/diffusers/issues/363) or fail to work reliably. We believe this is related to the [`mps`](https://github.com/pytorch/pytorch/issues/84039) backend in PyTorch. While this is being investigated, you should iterate instead of batching. - - - -If you're using **PyTorch 1.13**, you need to "prime" the pipeline with an additional one-time pass through it. This is a temporary workaround for an issue where the first inference pass produces slightly different results than subsequent ones. You only need to do this pass once, and after just one inference step you can discard the result. - -```diff - from diffusers import DiffusionPipeline - - pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5").to("mps") - pipe.enable_attention_slicing() - - prompt = "a photo of an astronaut riding a horse on mars" - # First-time "warmup" pass if PyTorch version is 1.13 -+ _ = pipe(prompt, num_inference_steps=1) - - # Results match those from the CPU device after the warmup pass. - image = pipe(prompt).images[0] -``` - -## Troubleshoot - -M1/M2 performance is very sensitive to memory pressure. When this occurs, the system automatically swaps if it needs to which significantly degrades performance. - -To prevent this from happening, we recommend *attention slicing* to reduce memory pressure during inference and prevent swapping. This is especially relevant if your computer has less than 64GB of system RAM, or if you generate images at non-standard resolutions larger than 512×512 pixels. Call the [`~DiffusionPipeline.enable_attention_slicing`] function on your pipeline: - -```py -from diffusers import DiffusionPipeline -import torch - -pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("mps") -pipeline.enable_attention_slicing() -``` - -Attention slicing performs the costly attention operation in multiple steps instead of all at once. It usually improves performance by ~20% in computers without universal memory, but we've observed *better performance* in most Apple silicon computers unless you have 64GB of RAM or more. diff --git a/diffusers/docs/source/en/optimization/neuron.md b/diffusers/docs/source/en/optimization/neuron.md deleted file mode 100644 index b10050e64d7f976451c84b734e8f537a75c888fe..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/neuron.md +++ /dev/null @@ -1,61 +0,0 @@ - - -# AWS Neuron - -Diffusers functionalities are available on [AWS Inf2 instances](https://aws.amazon.com/ec2/instance-types/inf2/), which are EC2 instances powered by [Neuron machine learning accelerators](https://aws.amazon.com/machine-learning/inferentia/). These instances aim to provide better compute performance (higher throughput, lower latency) with good cost-efficiency, making them good candidates for AWS users to deploy diffusion models to production. - -[Optimum Neuron](https://huggingface.co/docs/optimum-neuron/en/index) is the interface between Hugging Face libraries and AWS Accelerators, including AWS [Trainium](https://aws.amazon.com/machine-learning/trainium/) and AWS [Inferentia](https://aws.amazon.com/machine-learning/inferentia/). It supports many of the features in Diffusers with similar APIs, so it is easier to learn if you're already familiar with Diffusers. Once you have created an AWS Inf2 instance, install Optimum Neuron. - -```bash -python -m pip install --upgrade-strategy eager optimum[neuronx] -``` - - - -We provide pre-built [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) (DLAMI) and Optimum Neuron containers for Amazon SageMaker. It's recommended to correctly set up your environment. - - - -The example below demonstrates how to generate images with the Stable Diffusion XL model on an inf2.8xlarge instance (you can switch to cheaper inf2.xlarge instances once the model is compiled). To generate some images, use the [`~optimum.neuron.NeuronStableDiffusionXLPipeline`] class, which is similar to the [`StableDiffusionXLPipeline`] class in Diffusers. - -Unlike Diffusers, you need to compile models in the pipeline to the Neuron format, `.neuron`. Launch the following command to export the model to the `.neuron` format. - -```bash -optimum-cli export neuron --model stabilityai/stable-diffusion-xl-base-1.0 \ - --batch_size 1 \ - --height 1024 `# height in pixels of generated image, eg. 768, 1024` \ - --width 1024 `# width in pixels of generated image, eg. 768, 1024` \ - --num_images_per_prompt 1 `# number of images to generate per prompt, defaults to 1` \ - --auto_cast matmul `# cast only matrix multiplication operations` \ - --auto_cast_type bf16 `# cast operations from FP32 to BF16` \ - sd_neuron_xl/ -``` - -Now generate some images with the pre-compiled SDXL model. - -```python ->>> from optimum.neuron import NeuronStableDiffusionXLPipeline - ->>> stable_diffusion_xl = NeuronStableDiffusionXLPipeline.from_pretrained("sd_neuron_xl/") ->>> prompt = "a pig with wings flying in floating US dollar banknotes in the air, skyscrapers behind, warm color palette, muted colors, detailed, 8k" ->>> image = stable_diffusion_xl(prompt).images[0] -``` - -peggy generated by sdxl on inf2 - -Feel free to check out more guides and examples on different use cases from the Optimum Neuron [documentation](https://huggingface.co/docs/optimum-neuron/en/inference_tutorials/stable_diffusion#generate-images-with-stable-diffusion-models-on-aws-inferentia)! diff --git a/diffusers/docs/source/en/optimization/onnx.md b/diffusers/docs/source/en/optimization/onnx.md deleted file mode 100644 index 84c0d0c263e51866fbc2fffb74d26b30753cea89..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/onnx.md +++ /dev/null @@ -1,86 +0,0 @@ - - -# ONNX Runtime - -🤗 [Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with ONNX Runtime. You'll need to install 🤗 Optimum with the following command for ONNX Runtime support: - -```bash -pip install -q optimum["onnxruntime"] -``` - -This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with ONNX Runtime. - -## Stable Diffusion - -To load and run inference, use the [`~optimum.onnxruntime.ORTStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the ONNX format on-the-fly, set `export=True`: - -```python -from optimum.onnxruntime import ORTStableDiffusionPipeline - -model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" -pipeline = ORTStableDiffusionPipeline.from_pretrained(model_id, export=True) -prompt = "sailing ship in storm by Leonardo da Vinci" -image = pipeline(prompt).images[0] -pipeline.save_pretrained("./onnx-stable-diffusion-v1-5") -``` - - - -Generating multiple prompts in a batch seems to take too much memory. While we look into it, you may need to iterate instead of batching. - - - -To export the pipeline in the ONNX format offline and use it later for inference, -use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command: - -```bash -optimum-cli export onnx --model stable-diffusion-v1-5/stable-diffusion-v1-5 sd_v15_onnx/ -``` - -Then to perform inference (you don't have to specify `export=True` again): - -```python -from optimum.onnxruntime import ORTStableDiffusionPipeline - -model_id = "sd_v15_onnx" -pipeline = ORTStableDiffusionPipeline.from_pretrained(model_id) -prompt = "sailing ship in storm by Leonardo da Vinci" -image = pipeline(prompt).images[0] -``` - -
- -
- -You can find more examples in 🤗 Optimum [documentation](https://huggingface.co/docs/optimum/), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting. - -## Stable Diffusion XL - -To load and run inference with SDXL, use the [`~optimum.onnxruntime.ORTStableDiffusionXLPipeline`]: - -```python -from optimum.onnxruntime import ORTStableDiffusionXLPipeline - -model_id = "stabilityai/stable-diffusion-xl-base-1.0" -pipeline = ORTStableDiffusionXLPipeline.from_pretrained(model_id) -prompt = "sailing ship in storm by Leonardo da Vinci" -image = pipeline(prompt).images[0] -``` - -To export the pipeline in the ONNX format and use it later for inference, use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command: - -```bash -optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl sd_xl_onnx/ -``` - -SDXL in the ONNX format is supported for text-to-image and image-to-image. diff --git a/diffusers/docs/source/en/optimization/open_vino.md b/diffusers/docs/source/en/optimization/open_vino.md deleted file mode 100644 index b2af9d9d62e14d22d703a11ac7f47e878677c829..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/open_vino.md +++ /dev/null @@ -1,80 +0,0 @@ - - -# OpenVINO - -🤗 [Optimum](https://github.com/huggingface/optimum-intel) provides Stable Diffusion pipelines compatible with OpenVINO to perform inference on a variety of Intel processors (see the [full list](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) of supported devices). - -You'll need to install 🤗 Optimum Intel with the `--upgrade-strategy eager` option to ensure [`optimum-intel`](https://github.com/huggingface/optimum-intel) is using the latest version: - -```bash -pip install --upgrade-strategy eager optimum["openvino"] -``` - -This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with OpenVINO. - -## Stable Diffusion - -To load and run inference, use the [`~optimum.intel.OVStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the OpenVINO format on-the-fly, set `export=True`: - -```python -from optimum.intel import OVStableDiffusionPipeline - -model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" -pipeline = OVStableDiffusionPipeline.from_pretrained(model_id, export=True) -prompt = "sailing ship in storm by Rembrandt" -image = pipeline(prompt).images[0] - -# Don't forget to save the exported model -pipeline.save_pretrained("openvino-sd-v1-5") -``` - -To further speed-up inference, statically reshape the model. If you change any parameters such as the outputs height or width, you’ll need to statically reshape your model again. - -```python -# Define the shapes related to the inputs and desired outputs -batch_size, num_images, height, width = 1, 1, 512, 512 - -# Statically reshape the model -pipeline.reshape(batch_size, height, width, num_images) -# Compile the model before inference -pipeline.compile() - -image = pipeline( - prompt, - height=height, - width=width, - num_images_per_prompt=num_images, -).images[0] -``` -
- -
- -You can find more examples in the 🤗 Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting. - -## Stable Diffusion XL - -To load and run inference with SDXL, use the [`~optimum.intel.OVStableDiffusionXLPipeline`]: - -```python -from optimum.intel import OVStableDiffusionXLPipeline - -model_id = "stabilityai/stable-diffusion-xl-base-1.0" -pipeline = OVStableDiffusionXLPipeline.from_pretrained(model_id) -prompt = "sailing ship in storm by Rembrandt" -image = pipeline(prompt).images[0] -``` - -To further speed-up inference, [statically reshape](#stable-diffusion) the model as shown in the Stable Diffusion section. - -You can find more examples in the 🤗 Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion-xl), and running SDXL in OpenVINO is supported for text-to-image and image-to-image. diff --git a/diffusers/docs/source/en/optimization/tgate.md b/diffusers/docs/source/en/optimization/tgate.md deleted file mode 100644 index 90e0bc32f71b085244ce60b43bb7538c767724ef..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/tgate.md +++ /dev/null @@ -1,182 +0,0 @@ -# T-GATE - -[T-GATE](https://github.com/HaozheLiu-ST/T-GATE/tree/main) accelerates inference for [Stable Diffusion](../api/pipelines/stable_diffusion/overview), [PixArt](../api/pipelines/pixart), and [Latency Consistency Model](../api/pipelines/latent_consistency_models.md) pipelines by skipping the cross-attention calculation once it converges. This method doesn't require any additional training and it can speed up inference from 10-50%. T-GATE is also compatible with other optimization methods like [DeepCache](./deepcache). - -Before you begin, make sure you install T-GATE. - -```bash -pip install tgate -pip install -U torch diffusers transformers accelerate DeepCache -``` - - -To use T-GATE with a pipeline, you need to use its corresponding loader. - -| Pipeline | T-GATE Loader | -|---|---| -| PixArt | TgatePixArtLoader | -| Stable Diffusion XL | TgateSDXLLoader | -| Stable Diffusion XL + DeepCache | TgateSDXLDeepCacheLoader | -| Stable Diffusion | TgateSDLoader | -| Stable Diffusion + DeepCache | TgateSDDeepCacheLoader | - -Next, create a `TgateLoader` with a pipeline, the gate step (the time step to stop calculating the cross attention), and the number of inference steps. Then call the `tgate` method on the pipeline with a prompt, gate step, and the number of inference steps. - -Let's see how to enable this for several different pipelines. - - - - -Accelerate `PixArtAlphaPipeline` with T-GATE: - -```py -import torch -from diffusers import PixArtAlphaPipeline -from tgate import TgatePixArtLoader - -pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16) - -gate_step = 8 -inference_step = 25 -pipe = TgatePixArtLoader( - pipe, - gate_step=gate_step, - num_inference_steps=inference_step, -).to("cuda") - -image = pipe.tgate( - "An alpaca made of colorful building blocks, cyberpunk.", - gate_step=gate_step, - num_inference_steps=inference_step, -).images[0] -``` - - - -Accelerate `StableDiffusionXLPipeline` with T-GATE: - -```py -import torch -from diffusers import StableDiffusionXLPipeline -from diffusers import DPMSolverMultistepScheduler -from tgate import TgateSDXLLoader - -pipe = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - torch_dtype=torch.float16, - variant="fp16", - use_safetensors=True, -) -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) - -gate_step = 10 -inference_step = 25 -pipe = TgateSDXLLoader( - pipe, - gate_step=gate_step, - num_inference_steps=inference_step, -).to("cuda") - -image = pipe.tgate( - "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.", - gate_step=gate_step, - num_inference_steps=inference_step -).images[0] -``` - - - -Accelerate `StableDiffusionXLPipeline` with [DeepCache](https://github.com/horseee/DeepCache) and T-GATE: - -```py -import torch -from diffusers import StableDiffusionXLPipeline -from diffusers import DPMSolverMultistepScheduler -from tgate import TgateSDXLDeepCacheLoader - -pipe = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - torch_dtype=torch.float16, - variant="fp16", - use_safetensors=True, -) -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) - -gate_step = 10 -inference_step = 25 -pipe = TgateSDXLDeepCacheLoader( - pipe, - cache_interval=3, - cache_branch_id=0, -).to("cuda") - -image = pipe.tgate( - "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.", - gate_step=gate_step, - num_inference_steps=inference_step -).images[0] -``` - - - -Accelerate `latent-consistency/lcm-sdxl` with T-GATE: - -```py -import torch -from diffusers import StableDiffusionXLPipeline -from diffusers import UNet2DConditionModel, LCMScheduler -from diffusers import DPMSolverMultistepScheduler -from tgate import TgateSDXLLoader - -unet = UNet2DConditionModel.from_pretrained( - "latent-consistency/lcm-sdxl", - torch_dtype=torch.float16, - variant="fp16", -) -pipe = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - unet=unet, - torch_dtype=torch.float16, - variant="fp16", -) -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -gate_step = 1 -inference_step = 4 -pipe = TgateSDXLLoader( - pipe, - gate_step=gate_step, - num_inference_steps=inference_step, - lcm=True -).to("cuda") - -image = pipe.tgate( - "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.", - gate_step=gate_step, - num_inference_steps=inference_step -).images[0] -``` - - - -T-GATE also supports [`StableDiffusionPipeline`] and [PixArt-alpha/PixArt-LCM-XL-2-1024-MS](https://hf.co/PixArt-alpha/PixArt-LCM-XL-2-1024-MS). - -## Benchmarks -| Model | MACs | Param | Latency | Zero-shot 10K-FID on MS-COCO | -|-----------------------|----------|-----------|---------|---------------------------| -| SD-1.5 | 16.938T | 859.520M | 7.032s | 23.927 | -| SD-1.5 w/ T-GATE | 9.875T | 815.557M | 4.313s | 20.789 | -| SD-2.1 | 38.041T | 865.785M | 16.121s | 22.609 | -| SD-2.1 w/ T-GATE | 22.208T | 815.433 M | 9.878s | 19.940 | -| SD-XL | 149.438T | 2.570B | 53.187s | 24.628 | -| SD-XL w/ T-GATE | 84.438T | 2.024B | 27.932s | 22.738 | -| Pixart-Alpha | 107.031T | 611.350M | 61.502s | 38.669 | -| Pixart-Alpha w/ T-GATE | 65.318T | 462.585M | 37.867s | 35.825 | -| DeepCache (SD-XL) | 57.888T | - | 19.931s | 23.755 | -| DeepCache w/ T-GATE | 43.868T | - | 14.666s | 23.999 | -| LCM (SD-XL) | 11.955T | 2.570B | 3.805s | 25.044 | -| LCM w/ T-GATE | 11.171T | 2.024B | 3.533s | 25.028 | -| LCM (Pixart-Alpha) | 8.563T | 611.350M | 4.733s | 36.086 | -| LCM w/ T-GATE | 7.623T | 462.585M | 4.543s | 37.048 | - -The latency is tested on an NVIDIA 1080TI, MACs and Params are calculated with [calflops](https://github.com/MrYxJ/calculate-flops.pytorch), and the FID is calculated with [PytorchFID](https://github.com/mseitzer/pytorch-fid). diff --git a/diffusers/docs/source/en/optimization/tome.md b/diffusers/docs/source/en/optimization/tome.md deleted file mode 100644 index 3e574efbfe1bf4fe7ebb4d06a84ddc8fc60c7fa0..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/tome.md +++ /dev/null @@ -1,96 +0,0 @@ - - -# Token merging - -[Token merging](https://huggingface.co/papers/2303.17604) (ToMe) merges redundant tokens/patches progressively in the forward pass of a Transformer-based network which can speed-up the inference latency of [`StableDiffusionPipeline`]. - -Install ToMe from `pip`: - -```bash -pip install tomesd -``` - -You can use ToMe from the [`tomesd`](https://github.com/dbolya/tomesd) library with the [`apply_patch`](https://github.com/dbolya/tomesd?tab=readme-ov-file#usage) function: - -```diff - from diffusers import StableDiffusionPipeline - import torch - import tomesd - - pipeline = StableDiffusionPipeline.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True, - ).to("cuda") -+ tomesd.apply_patch(pipeline, ratio=0.5) - - image = pipeline("a photo of an astronaut riding a horse on mars").images[0] -``` - -The `apply_patch` function exposes a number of [arguments](https://github.com/dbolya/tomesd#usage) to help strike a balance between pipeline inference speed and the quality of the generated tokens. The most important argument is `ratio` which controls the number of tokens that are merged during the forward pass. - -As reported in the [paper](https://huggingface.co/papers/2303.17604), ToMe can greatly preserve the quality of the generated images while boosting inference speed. By increasing the `ratio`, you can speed-up inference even further, but at the cost of some degraded image quality. - -To test the quality of the generated images, we sampled a few prompts from [Parti Prompts](https://parti.research.google/) and performed inference with the [`StableDiffusionPipeline`] with the following settings: - -
- -
- -We didn’t notice any significant decrease in the quality of the generated samples, and you can check out the generated samples in this [WandB report](https://wandb.ai/sayakpaul/tomesd-results/runs/23j4bj3i?workspace=). If you're interested in reproducing this experiment, use this [script](https://gist.github.com/sayakpaul/8cac98d7f22399085a060992f411ecbd). - -## Benchmarks - -We also benchmarked the impact of `tomesd` on the [`StableDiffusionPipeline`] with [xFormers](https://huggingface.co/docs/diffusers/optimization/xformers) enabled across several image resolutions. The results are obtained from A100 and V100 GPUs in the following development environment: - -```bash -- `diffusers` version: 0.15.1 -- Python version: 3.8.16 -- PyTorch version (GPU?): 1.13.1+cu116 (True) -- Huggingface_hub version: 0.13.2 -- Transformers version: 4.27.2 -- Accelerate version: 0.18.0 -- xFormers version: 0.0.16 -- tomesd version: 0.1.2 -``` - -To reproduce this benchmark, feel free to use this [script](https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335). The results are reported in seconds, and where applicable we report the speed-up percentage over the vanilla pipeline when using ToMe and ToMe + xFormers. - -| **GPU** | **Resolution** | **Batch size** | **Vanilla** | **ToMe** | **ToMe + xFormers** | -|----------|----------------|----------------|-------------|----------------|---------------------| -| **A100** | 512 | 10 | 6.88 | 5.26 (+23.55%) | 4.69 (+31.83%) | -| | 768 | 10 | OOM | 14.71 | 11 | -| | | 8 | OOM | 11.56 | 8.84 | -| | | 4 | OOM | 5.98 | 4.66 | -| | | 2 | 4.99 | 3.24 (+35.07%) | 2.1 (+37.88%) | -| | | 1 | 3.29 | 2.24 (+31.91%) | 2.03 (+38.3%) | -| | 1024 | 10 | OOM | OOM | OOM | -| | | 8 | OOM | OOM | OOM | -| | | 4 | OOM | 12.51 | 9.09 | -| | | 2 | OOM | 6.52 | 4.96 | -| | | 1 | 6.4 | 3.61 (+43.59%) | 2.81 (+56.09%) | -| **V100** | 512 | 10 | OOM | 10.03 | 9.29 | -| | | 8 | OOM | 8.05 | 7.47 | -| | | 4 | 5.7 | 4.3 (+24.56%) | 3.98 (+30.18%) | -| | | 2 | 3.14 | 2.43 (+22.61%) | 2.27 (+27.71%) | -| | | 1 | 1.88 | 1.57 (+16.49%) | 1.57 (+16.49%) | -| | 768 | 10 | OOM | OOM | 23.67 | -| | | 8 | OOM | OOM | 18.81 | -| | | 4 | OOM | 11.81 | 9.7 | -| | | 2 | OOM | 6.27 | 5.2 | -| | | 1 | 5.43 | 3.38 (+37.75%) | 2.82 (+48.07%) | -| | 1024 | 10 | OOM | OOM | OOM | -| | | 8 | OOM | OOM | OOM | -| | | 4 | OOM | OOM | 19.35 | -| | | 2 | OOM | 13 | 10.78 | -| | | 1 | OOM | 6.66 | 5.54 | - -As seen in the tables above, the speed-up from `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with [`torch.compile`](torch2.0). diff --git a/diffusers/docs/source/en/optimization/torch2.0.md b/diffusers/docs/source/en/optimization/torch2.0.md deleted file mode 100644 index 01ea00310a75e980f8cec15496f97bcc407082e0..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/torch2.0.md +++ /dev/null @@ -1,421 +0,0 @@ - - -# PyTorch 2.0 - -🤗 Diffusers supports the latest optimizations from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) which include: - -1. A memory-efficient attention implementation, scaled dot product attention, without requiring any extra dependencies such as xFormers. -2. [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html), a just-in-time (JIT) compiler to provide an extra performance boost when individual models are compiled. - -Both of these optimizations require PyTorch 2.0 or later and 🤗 Diffusers > 0.13.0. - -```bash -pip install --upgrade torch diffusers -``` - -## Scaled dot product attention - -[`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA) is an optimized and memory-efficient attention (similar to xFormers) that automatically enables several other optimizations depending on the model inputs and GPU type. SDPA is enabled by default if you're using PyTorch 2.0 and the latest version of 🤗 Diffusers, so you don't need to add anything to your code. - -However, if you want to explicitly enable it, you can set a [`DiffusionPipeline`] to use [`~models.attention_processor.AttnProcessor2_0`]: - -```diff - import torch - from diffusers import DiffusionPipeline -+ from diffusers.models.attention_processor import AttnProcessor2_0 - - pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -+ pipe.unet.set_attn_processor(AttnProcessor2_0()) - - prompt = "a photo of an astronaut riding a horse on mars" - image = pipe(prompt).images[0] -``` - -SDPA should be as fast and memory efficient as `xFormers`; check the [benchmark](#benchmark) for more details. - -In some cases - such as making the pipeline more deterministic or converting it to other formats - it may be helpful to use the vanilla attention processor, [`~models.attention_processor.AttnProcessor`]. To revert to [`~models.attention_processor.AttnProcessor`], call the [`~UNet2DConditionModel.set_default_attn_processor`] function on the pipeline: - -```diff - import torch - from diffusers import DiffusionPipeline - - pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -+ pipe.unet.set_default_attn_processor() - - prompt = "a photo of an astronaut riding a horse on mars" - image = pipe(prompt).images[0] -``` - -## torch.compile - -The `torch.compile` function can often provide an additional speed-up to your PyTorch code. In 🤗 Diffusers, it is usually best to wrap the UNet with `torch.compile` because it does most of the heavy lifting in the pipeline. - -```python -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) -images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images[0] -``` - -Depending on GPU type, `torch.compile` can provide an *additional speed-up* of **5-300x** on top of SDPA! If you're using more recent GPU architectures such as Ampere (A100, 3090), Ada (4090), and Hopper (H100), `torch.compile` is able to squeeze even more performance out of these GPUs. - -Compilation requires some time to complete, so it is best suited for situations where you prepare your pipeline once and then perform the same type of inference operations multiple times. For example, calling the compiled pipeline on a different image size triggers compilation again which can be expensive. - -For more information and different options about `torch.compile`, refer to the [`torch_compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) tutorial. - -> [!TIP] -> Learn more about other ways PyTorch 2.0 can help optimize your model in the [Accelerate inference of text-to-image diffusion models](../tutorials/fast_diffusion) tutorial. - -## Benchmark - -We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. The code is benchmarked on 🤗 Diffusers v0.17.0.dev0 to optimize `torch.compile` usage (see [here](https://github.com/huggingface/diffusers/pull/3313) for more details). - -Expand the dropdown below to find the code used to benchmark each pipeline: - -
- -### Stable Diffusion text-to-image - -```python -from diffusers import DiffusionPipeline -import torch - -path = "stable-diffusion-v1-5/stable-diffusion-v1-5" - -run_compile = True # Set True / False - -pipe = DiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - images = pipe(prompt=prompt).images -``` - -### Stable Diffusion image-to-image - -```python -from diffusers import StableDiffusionImg2ImgPipeline -from diffusers.utils import load_image -import torch - -url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" - -init_image = load_image(url) -init_image = init_image.resize((512, 512)) - -path = "stable-diffusion-v1-5/stable-diffusion-v1-5" - -run_compile = True # Set True / False - -pipe = StableDiffusionImg2ImgPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - image = pipe(prompt=prompt, image=init_image).images[0] -``` - -### Stable Diffusion inpainting - -```python -from diffusers import StableDiffusionInpaintPipeline -from diffusers.utils import load_image -import torch - -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" - -init_image = load_image(img_url).resize((512, 512)) -mask_image = load_image(mask_url).resize((512, 512)) - -path = "runwayml/stable-diffusion-inpainting" - -run_compile = True # Set True / False - -pipe = StableDiffusionInpaintPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0] -``` - -### ControlNet - -```python -from diffusers import StableDiffusionControlNetPipeline, ControlNetModel -from diffusers.utils import load_image -import torch - -url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" - -init_image = load_image(url) -init_image = init_image.resize((512, 512)) - -path = "stable-diffusion-v1-5/stable-diffusion-v1-5" - -run_compile = True # Set True / False -controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True) -pipe = StableDiffusionControlNetPipeline.from_pretrained( - path, controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True -) - -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) -pipe.controlnet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - pipe.controlnet = torch.compile(pipe.controlnet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - image = pipe(prompt=prompt, image=init_image).images[0] -``` - -### DeepFloyd IF text-to-image + upscaling - -```python -from diffusers import DiffusionPipeline -import torch - -run_compile = True # Set True / False - -pipe_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True) -pipe_1.to("cuda") -pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True) -pipe_2.to("cuda") -pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, use_safetensors=True) -pipe_3.to("cuda") - - -pipe_1.unet.to(memory_format=torch.channels_last) -pipe_2.unet.to(memory_format=torch.channels_last) -pipe_3.unet.to(memory_format=torch.channels_last) - -if run_compile: - pipe_1.unet = torch.compile(pipe_1.unet, mode="reduce-overhead", fullgraph=True) - pipe_2.unet = torch.compile(pipe_2.unet, mode="reduce-overhead", fullgraph=True) - pipe_3.unet = torch.compile(pipe_3.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "the blue hulk" - -prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16) -neg_prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16) - -for _ in range(3): - image_1 = pipe_1(prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images - image_2 = pipe_2(image=image_1, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images - image_3 = pipe_3(prompt=prompt, image=image_1, noise_level=100).images -``` -
- -The graph below highlights the relative speed-ups for the [`StableDiffusionPipeline`] across five GPU families with PyTorch 2.0 and `torch.compile` enabled. The benchmarks for the following graphs are measured in *number of iterations/second*. - -![t2i_speedup](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/t2i_speedup.png) - -To give you an even better idea of how this speed-up holds for the other pipelines, consider the following -graph for an A100 with PyTorch 2.0 and `torch.compile`: - -![a100_numbers](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/a100_numbers.png) - -In the following tables, we report our findings in terms of the *number of iterations/second*. - -### A100 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 21.66 | 23.13 | 44.03 | 49.74 | -| SD - img2img | 21.81 | 22.40 | 43.92 | 46.32 | -| SD - inpaint | 22.24 | 23.23 | 43.76 | 49.25 | -| SD - controlnet | 15.02 | 15.82 | 32.13 | 36.08 | -| IF | 20.21 /
13.84 /
24.00 | 20.12 /
13.70 /
24.03 | ❌ | 97.34 /
27.23 /
111.66 | -| SDXL - txt2img | 8.64 | 9.9 | - | - | - -### A100 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 11.6 | 13.12 | 14.62 | 17.27 | -| SD - img2img | 11.47 | 13.06 | 14.66 | 17.25 | -| SD - inpaint | 11.67 | 13.31 | 14.88 | 17.48 | -| SD - controlnet | 8.28 | 9.38 | 10.51 | 12.41 | -| IF | 25.02 | 18.04 | ❌ | 48.47 | -| SDXL - txt2img | 2.44 | 2.74 | - | - | - -### A100 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 3.04 | 3.6 | 3.83 | 4.68 | -| SD - img2img | 2.98 | 3.58 | 3.83 | 4.67 | -| SD - inpaint | 3.04 | 3.66 | 3.9 | 4.76 | -| SD - controlnet | 2.15 | 2.58 | 2.74 | 3.35 | -| IF | 8.78 | 9.82 | ❌ | 16.77 | -| SDXL - txt2img | 0.64 | 0.72 | - | - | - -### V100 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 18.99 | 19.14 | 20.95 | 22.17 | -| SD - img2img | 18.56 | 19.18 | 20.95 | 22.11 | -| SD - inpaint | 19.14 | 19.06 | 21.08 | 22.20 | -| SD - controlnet | 13.48 | 13.93 | 15.18 | 15.88 | -| IF | 20.01 /
9.08 /
23.34 | 19.79 /
8.98 /
24.10 | ❌ | 55.75 /
11.57 /
57.67 | - -### V100 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 5.96 | 5.89 | 6.83 | 6.86 | -| SD - img2img | 5.90 | 5.91 | 6.81 | 6.82 | -| SD - inpaint | 5.99 | 6.03 | 6.93 | 6.95 | -| SD - controlnet | 4.26 | 4.29 | 4.92 | 4.93 | -| IF | 15.41 | 14.76 | ❌ | 22.95 | - -### V100 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 1.66 | 1.66 | 1.92 | 1.90 | -| SD - img2img | 1.65 | 1.65 | 1.91 | 1.89 | -| SD - inpaint | 1.69 | 1.69 | 1.95 | 1.93 | -| SD - controlnet | 1.19 | 1.19 | OOM after warmup | 1.36 | -| IF | 5.43 | 5.29 | ❌ | 7.06 | - -### T4 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 6.9 | 6.95 | 7.3 | 7.56 | -| SD - img2img | 6.84 | 6.99 | 7.04 | 7.55 | -| SD - inpaint | 6.91 | 6.7 | 7.01 | 7.37 | -| SD - controlnet | 4.89 | 4.86 | 5.35 | 5.48 | -| IF | 17.42 /
2.47 /
18.52 | 16.96 /
2.45 /
18.69 | ❌ | 24.63 /
2.47 /
23.39 | -| SDXL - txt2img | 1.15 | 1.16 | - | - | - -### T4 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 1.79 | 1.79 | 2.03 | 1.99 | -| SD - img2img | 1.77 | 1.77 | 2.05 | 2.04 | -| SD - inpaint | 1.81 | 1.82 | 2.09 | 2.09 | -| SD - controlnet | 1.34 | 1.27 | 1.47 | 1.46 | -| IF | 5.79 | 5.61 | ❌ | 7.39 | -| SDXL - txt2img | 0.288 | 0.289 | - | - | - -### T4 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 2.34s | 2.30s | OOM after 2nd iteration | 1.99s | -| SD - img2img | 2.35s | 2.31s | OOM after warmup | 2.00s | -| SD - inpaint | 2.30s | 2.26s | OOM after 2nd iteration | 1.95s | -| SD - controlnet | OOM after 2nd iteration | OOM after 2nd iteration | OOM after warmup | OOM after warmup | -| IF * | 1.44 | 1.44 | ❌ | 1.94 | -| SDXL - txt2img | OOM | OOM | - | - | - -### RTX 3090 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 22.56 | 22.84 | 23.84 | 25.69 | -| SD - img2img | 22.25 | 22.61 | 24.1 | 25.83 | -| SD - inpaint | 22.22 | 22.54 | 24.26 | 26.02 | -| SD - controlnet | 16.03 | 16.33 | 17.38 | 18.56 | -| IF | 27.08 /
9.07 /
31.23 | 26.75 /
8.92 /
31.47 | ❌ | 68.08 /
11.16 /
65.29 | - -### RTX 3090 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 6.46 | 6.35 | 7.29 | 7.3 | -| SD - img2img | 6.33 | 6.27 | 7.31 | 7.26 | -| SD - inpaint | 6.47 | 6.4 | 7.44 | 7.39 | -| SD - controlnet | 4.59 | 4.54 | 5.27 | 5.26 | -| IF | 16.81 | 16.62 | ❌ | 21.57 | - -### RTX 3090 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 1.7 | 1.69 | 1.93 | 1.91 | -| SD - img2img | 1.68 | 1.67 | 1.93 | 1.9 | -| SD - inpaint | 1.72 | 1.71 | 1.97 | 1.94 | -| SD - controlnet | 1.23 | 1.22 | 1.4 | 1.38 | -| IF | 5.01 | 5.00 | ❌ | 6.33 | - -### RTX 4090 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 40.5 | 41.89 | 44.65 | 49.81 | -| SD - img2img | 40.39 | 41.95 | 44.46 | 49.8 | -| SD - inpaint | 40.51 | 41.88 | 44.58 | 49.72 | -| SD - controlnet | 29.27 | 30.29 | 32.26 | 36.03 | -| IF | 69.71 /
18.78 /
85.49 | 69.13 /
18.80 /
85.56 | ❌ | 124.60 /
26.37 /
138.79 | -| SDXL - txt2img | 6.8 | 8.18 | - | - | - -### RTX 4090 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 12.62 | 12.84 | 15.32 | 15.59 | -| SD - img2img | 12.61 | 12,.79 | 15.35 | 15.66 | -| SD - inpaint | 12.65 | 12.81 | 15.3 | 15.58 | -| SD - controlnet | 9.1 | 9.25 | 11.03 | 11.22 | -| IF | 31.88 | 31.14 | ❌ | 43.92 | -| SDXL - txt2img | 2.19 | 2.35 | - | - | - -### RTX 4090 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 3.17 | 3.2 | 3.84 | 3.85 | -| SD - img2img | 3.16 | 3.2 | 3.84 | 3.85 | -| SD - inpaint | 3.17 | 3.2 | 3.85 | 3.85 | -| SD - controlnet | 2.23 | 2.3 | 2.7 | 2.75 | -| IF | 9.26 | 9.2 | ❌ | 13.31 | -| SDXL - txt2img | 0.52 | 0.53 | - | - | - -## Notes - -* Follow this [PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks. -* For the DeepFloyd IF pipeline where batch sizes > 1, we only used a batch size of > 1 in the first IF pipeline for text-to-image generation and NOT for upscaling. That means the two upscaling pipelines received a batch size of 1. - -*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.* diff --git a/diffusers/docs/source/en/optimization/xdit.md b/diffusers/docs/source/en/optimization/xdit.md deleted file mode 100644 index 33ff8dc255d0166beaa4231b0bec2064a97db651..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/xdit.md +++ /dev/null @@ -1,121 +0,0 @@ -# xDiT - -[xDiT](https://github.com/xdit-project/xDiT) is an inference engine designed for the large scale parallel deployment of Diffusion Transformers (DiTs). xDiT provides a suite of efficient parallel approaches for Diffusion Models, as well as GPU kernel accelerations. - -There are four parallel methods supported in xDiT, including [Unified Sequence Parallelism](https://arxiv.org/abs/2405.07719), [PipeFusion](https://arxiv.org/abs/2405.14430), CFG parallelism and data parallelism. The four parallel methods in xDiT can be configured in a hybrid manner, optimizing communication patterns to best suit the underlying network hardware. - -Optimization orthogonal to parallelization focuses on accelerating single GPU performance. In addition to utilizing well-known Attention optimization libraries, we leverage compilation acceleration technologies such as torch.compile and onediff. - -The overview of xDiT is shown as follows. - -
- -
-You can install xDiT using the following command: - - -```bash -pip install xfuser -``` - -Here's an example of using xDiT to accelerate inference of a Diffusers model. - -```diff - import torch - from diffusers import StableDiffusion3Pipeline - - from xfuser import xFuserArgs, xDiTParallel - from xfuser.config import FlexibleArgumentParser - from xfuser.core.distributed import get_world_group - - def main(): -+ parser = FlexibleArgumentParser(description="xFuser Arguments") -+ args = xFuserArgs.add_cli_args(parser).parse_args() -+ engine_args = xFuserArgs.from_cli_args(args) -+ engine_config, input_config = engine_args.create_config() - - local_rank = get_world_group().local_rank - pipe = StableDiffusion3Pipeline.from_pretrained( - pretrained_model_name_or_path=engine_config.model_config.model, - torch_dtype=torch.float16, - ).to(f"cuda:{local_rank}") - -# do anything you want with pipeline here - -+ pipe = xDiTParallel(pipe, engine_config, input_config) - - pipe( - height=input_config.height, - width=input_config.height, - prompt=input_config.prompt, - num_inference_steps=input_config.num_inference_steps, - output_type=input_config.output_type, - generator=torch.Generator(device="cuda").manual_seed(input_config.seed), - ) - -+ if input_config.output_type == "pil": -+ pipe.save("results", "stable_diffusion_3") - -if __name__ == "__main__": - main() - -``` - -As you can see, we only need to use xFuserArgs from xDiT to get configuration parameters, and pass these parameters along with the pipeline object from the Diffusers library into xDiTParallel to complete the parallelization of a specific pipeline in Diffusers. - -xDiT runtime parameters can be viewed in the command line using `-h`, and you can refer to this [usage](https://github.com/xdit-project/xDiT?tab=readme-ov-file#2-usage) example for more details. - -xDiT needs to be launched using torchrun to support its multi-node, multi-GPU parallel capabilities. For example, the following command can be used for 8-GPU parallel inference: - -```bash -torchrun --nproc_per_node=8 ./inference.py --model models/FLUX.1-dev --data_parallel_degree 2 --ulysses_degree 2 --ring_degree 2 --prompt "A snowy mountain" "A small dog" --num_inference_steps 50 -``` - -## Supported models - -A subset of Diffusers models are supported in xDiT, such as Flux.1, Stable Diffusion 3, etc. The latest supported models can be found [here](https://github.com/xdit-project/xDiT?tab=readme-ov-file#-supported-dits). - -## Benchmark -We tested different models on various machines, and here is some of the benchmark data. - -### Flux.1-schnell -
- -
- - -
- -
- -### Stable Diffusion 3 -
- -
- -
- -
- -### HunyuanDiT -
- -
- -
- -
- -
- -
- -More detailed performance metric can be found on our [github page](https://github.com/xdit-project/xDiT?tab=readme-ov-file#perf). - -## Reference - -[xDiT-project](https://github.com/xdit-project/xDiT) - -[USP: A Unified Sequence Parallelism Approach for Long Context Generative AI](https://arxiv.org/abs/2405.07719) - -[PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models](https://arxiv.org/abs/2405.14430) \ No newline at end of file diff --git a/diffusers/docs/source/en/optimization/xformers.md b/diffusers/docs/source/en/optimization/xformers.md deleted file mode 100644 index 4ef0da9e890dcc68d6f4b143d71841c2701d573e..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/optimization/xformers.md +++ /dev/null @@ -1,35 +0,0 @@ - - -# xFormers - -We recommend [xFormers](https://github.com/facebookresearch/xformers) for both inference and training. In our tests, the optimizations performed in the attention blocks allow for both faster speed and reduced memory consumption. - -Install xFormers from `pip`: - -```bash -pip install xformers -``` - - - -The xFormers `pip` package requires the latest version of PyTorch. If you need to use a previous version of PyTorch, then we recommend [installing xFormers from the source](https://github.com/facebookresearch/xformers#installing-xformers). - - - -After xFormers is installed, you can use `enable_xformers_memory_efficient_attention()` for faster inference and reduced memory consumption as shown in this [section](memory#memory-efficient-attention). - - - -According to this [issue](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212), xFormers `v0.0.16` cannot be used for training (fine-tune or DreamBooth) in some GPUs. If you observe this problem, please install a development version as indicated in the issue comments. - - diff --git a/diffusers/docs/source/en/quantization/bitsandbytes.md b/diffusers/docs/source/en/quantization/bitsandbytes.md deleted file mode 100644 index 118511b75d50e4c21ed948cc85e9a135f86398b3..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/quantization/bitsandbytes.md +++ /dev/null @@ -1,260 +0,0 @@ - - -# bitsandbytes - -[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance. - -4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs. - - -To use bitsandbytes, make sure you have the following libraries installed: - -```bash -pip install diffusers transformers accelerate bitsandbytes -U -``` - -Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers. - - - - -Quantizing a model in 8-bit halves the memory-usage: - -```py -from diffusers import FluxTransformer2DModel, BitsAndBytesConfig - -quantization_config = BitsAndBytesConfig(load_in_8bit=True) - -model_8bit = FluxTransformer2DModel.from_pretrained( - "black-forest-labs/FLUX.1-dev", - subfolder="transformer", - quantization_config=quantization_config -) -``` - -By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want: - -```py -from diffusers import FluxTransformer2DModel, BitsAndBytesConfig - -quantization_config = BitsAndBytesConfig(load_in_8bit=True) - -model_8bit = FluxTransformer2DModel.from_pretrained( - "black-forest-labs/FLUX.1-dev", - subfolder="transformer", - quantization_config=quantization_config, - torch_dtype=torch.float32 -) -model_8bit.transformer_blocks.layers[-1].norm2.weight.dtype -``` - -Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`]. - - - - -Quantizing a model in 4-bit reduces your memory-usage by 4x: - -```py -from diffusers import FluxTransformer2DModel, BitsAndBytesConfig - -quantization_config = BitsAndBytesConfig(load_in_4bit=True) - -model_4bit = FluxTransformer2DModel.from_pretrained( - "black-forest-labs/FLUX.1-dev", - subfolder="transformer", - quantization_config=quantization_config -) -``` - -By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want: - -```py -from diffusers import FluxTransformer2DModel, BitsAndBytesConfig - -quantization_config = BitsAndBytesConfig(load_in_4bit=True) - -model_4bit = FluxTransformer2DModel.from_pretrained( - "black-forest-labs/FLUX.1-dev", - subfolder="transformer", - quantization_config=quantization_config, - torch_dtype=torch.float32 -) -model_4bit.transformer_blocks.layers[-1].norm2.weight.dtype -``` - -Call [`~ModelMixin.push_to_hub`] after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`]. - - - - - - -Training with 8-bit and 4-bit weights are only supported for training *extra* parameters. - - - -Check your memory footprint with the `get_memory_footprint` method: - -```py -print(model.get_memory_footprint()) -``` - -Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to specify the `quantization_config` parameters: - -```py -from diffusers import FluxTransformer2DModel, BitsAndBytesConfig - -quantization_config = BitsAndBytesConfig(load_in_4bit=True) - -model_4bit = FluxTransformer2DModel.from_pretrained( - "hf-internal-testing/flux.1-dev-nf4-pkg", subfolder="transformer" -) -``` - -## 8-bit (LLM.int8() algorithm) - - - -Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)! - - - -This section explores some of the specific features of 8-bit models, such as outlier thresholds and skipping module conversion. - -### Outlier threshold - -An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning). - -To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]: - -```py -from diffusers import FluxTransformer2DModel, BitsAndBytesConfig - -quantization_config = BitsAndBytesConfig( - load_in_8bit=True, llm_int8_threshold=10, -) - -model_8bit = FluxTransformer2DModel.from_pretrained( - "black-forest-labs/FLUX.1-dev", - subfolder="transformer", - quantization_config=quantization_config, -) -``` - -### Skip module conversion - -For some models, you don't need to quantize every module to 8-bit which can actually cause instability. For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]: - -```py -from diffusers import SD3Transformer2DModel, BitsAndBytesConfig - -quantization_config = BitsAndBytesConfig( - load_in_8bit=True, llm_int8_skip_modules=["proj_out"], -) - -model_8bit = SD3Transformer2DModel.from_pretrained( - "stabilityai/stable-diffusion-3-medium-diffusers", - subfolder="transformer", - quantization_config=quantization_config, -) -``` - - -## 4-bit (QLoRA algorithm) - - - -Learn more about its details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes). - - - -This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization. - - -### Compute data type - -To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]: - -```py -import torch -from diffusers import BitsAndBytesConfig - -quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) -``` - -### Normal Float 4 (NF4) - -NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]: - -```py -from diffusers import BitsAndBytesConfig - -nf4_config = BitsAndBytesConfig( - load_in_4bit=True, - bnb_4bit_quant_type="nf4", -) - -model_nf4 = SD3Transformer2DModel.from_pretrained( - "stabilityai/stable-diffusion-3-medium-diffusers", - subfolder="transformer", - quantization_config=nf4_config, -) -``` - -For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values. - -### Nested quantization - -Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. - -```py -from diffusers import BitsAndBytesConfig - -double_quant_config = BitsAndBytesConfig( - load_in_4bit=True, - bnb_4bit_use_double_quant=True, -) - -double_quant_model = SD3Transformer2DModel.from_pretrained( - "stabilityai/stable-diffusion-3-medium-diffusers", - subfolder="transformer", - quantization_config=double_quant_config, -) -``` - -## Dequantizing `bitsandbytes` models - -Once quantized, you can dequantize the model to the original precision but this might result in a small quality loss of the model. Make sure you have enough GPU RAM to fit the dequantized model. - -```python -from diffusers import BitsAndBytesConfig - -double_quant_config = BitsAndBytesConfig( - load_in_4bit=True, - bnb_4bit_use_double_quant=True, -) - -double_quant_model = SD3Transformer2DModel.from_pretrained( - "stabilityai/stable-diffusion-3-medium-diffusers", - subfolder="transformer", - quantization_config=double_quant_config, -) -model.dequantize() -``` - -## Resources - -* [End-to-end notebook showing Flux.1 Dev inference in a free-tier Colab](https://gist.github.com/sayakpaul/c76bd845b48759e11687ac550b99d8b4) -* [Training](https://gist.github.com/sayakpaul/05afd428bc089b47af7c016e42004527) \ No newline at end of file diff --git a/diffusers/docs/source/en/quantization/overview.md b/diffusers/docs/source/en/quantization/overview.md deleted file mode 100644 index d8adbc85a259c04d56c9d86ea5d8e13f2681796c..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/quantization/overview.md +++ /dev/null @@ -1,35 +0,0 @@ - - -# Quantization - -Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits. - - - -Interested in adding a new quantization method to Transformers? Refer to the [Contribute new quantization method guide](https://huggingface.co/docs/transformers/main/en/quantization/contribute) to learn more about adding a new quantization method. - - - - - -If you are new to the quantization field, we recommend you to check out these beginner-friendly courses about quantization in collaboration with DeepLearning.AI: - -* [Quantization Fundamentals with Hugging Face](https://www.deeplearning.ai/short-courses/quantization-fundamentals-with-hugging-face/) -* [Quantization in Depth](https://www.deeplearning.ai/short-courses/quantization-in-depth/) - - - -## When to use what? - -This section will be expanded once Diffusers has multiple quantization backends. Currently, we only support `bitsandbytes`. [This resource](https://huggingface.co/docs/transformers/main/en/quantization/overview#when-to-use-what) provides a good overview of the pros and cons of different quantization techniques. \ No newline at end of file diff --git a/diffusers/docs/source/en/quicktour.md b/diffusers/docs/source/en/quicktour.md deleted file mode 100644 index 2d9f7fe3736ab5915496adfa5e8c25bc942588bf..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/quicktour.md +++ /dev/null @@ -1,320 +0,0 @@ - - -[[open-in-colab]] - -# Quicktour - -Diffusion models are trained to denoise random Gaussian noise step-by-step to generate a sample of interest, such as an image or audio. This has sparked a tremendous amount of interest in generative AI, and you have probably seen examples of diffusion generated images on the internet. 🧨 Diffusers is a library aimed at making diffusion models widely accessible to everyone. - -Whether you're a developer or an everyday user, this quicktour will introduce you to 🧨 Diffusers and help you get up and generating quickly! There are three main components of the library to know about: - -* The [`DiffusionPipeline`] is a high-level end-to-end class designed to rapidly generate samples from pretrained diffusion models for inference. -* Popular pretrained [model](./api/models) architectures and modules that can be used as building blocks for creating diffusion systems. -* Many different [schedulers](./api/schedulers/overview) - algorithms that control how noise is added for training, and how to generate denoised images during inference. - -The quicktour will show you how to use the [`DiffusionPipeline`] for inference, and then walk you through how to combine a model and scheduler to replicate what's happening inside the [`DiffusionPipeline`]. - - - -The quicktour is a simplified version of the introductory 🧨 Diffusers [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb) to help you get started quickly. If you want to learn more about 🧨 Diffusers' goal, design philosophy, and additional details about its core API, check out the notebook! - - - -Before you begin, make sure you have all the necessary libraries installed: - -```py -# uncomment to install the necessary libraries in Colab -#!pip install --upgrade diffusers accelerate transformers -``` - -- [🤗 Accelerate](https://huggingface.co/docs/accelerate/index) speeds up model loading for inference and training. -- [🤗 Transformers](https://huggingface.co/docs/transformers/index) is required to run the most popular diffusion models, such as [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview). - -## DiffusionPipeline - -The [`DiffusionPipeline`] is the easiest way to use a pretrained diffusion system for inference. It is an end-to-end system containing the model and the scheduler. You can use the [`DiffusionPipeline`] out-of-the-box for many tasks. Take a look at the table below for some supported tasks, and for a complete list of supported tasks, check out the [🧨 Diffusers Summary](./api/pipelines/overview#diffusers-summary) table. - -| **Task** | **Description** | **Pipeline** -|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------| -| Unconditional Image Generation | generate an image from Gaussian noise | [unconditional_image_generation](./using-diffusers/unconditional_image_generation) | -| Text-Guided Image Generation | generate an image given a text prompt | [conditional_image_generation](./using-diffusers/conditional_image_generation) | -| Text-Guided Image-to-Image Translation | adapt an image guided by a text prompt | [img2img](./using-diffusers/img2img) | -| Text-Guided Image-Inpainting | fill the masked part of an image given the image, the mask and a text prompt | [inpaint](./using-diffusers/inpaint) | -| Text-Guided Depth-to-Image Translation | adapt parts of an image guided by a text prompt while preserving structure via depth estimation | [depth2img](./using-diffusers/depth2img) | - -Start by creating an instance of a [`DiffusionPipeline`] and specify which pipeline checkpoint you would like to download. -You can use the [`DiffusionPipeline`] for any [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads) stored on the Hugging Face Hub. -In this quicktour, you'll load the [`stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) checkpoint for text-to-image generation. - - - -For [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion) models, please carefully read the [license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) first before running the model. 🧨 Diffusers implements a [`safety_checker`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) to prevent offensive or harmful content, but the model's improved image generation capabilities can still produce potentially harmful content. - - - -Load the model with the [`~DiffusionPipeline.from_pretrained`] method: - -```python ->>> from diffusers import DiffusionPipeline - ->>> pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True) -``` - -The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components. You'll see that the Stable Diffusion pipeline is composed of the [`UNet2DConditionModel`] and [`PNDMScheduler`] among other things: - -```py ->>> pipeline -StableDiffusionPipeline { - "_class_name": "StableDiffusionPipeline", - "_diffusers_version": "0.21.4", - ..., - "scheduler": [ - "diffusers", - "PNDMScheduler" - ], - ..., - "unet": [ - "diffusers", - "UNet2DConditionModel" - ], - "vae": [ - "diffusers", - "AutoencoderKL" - ] -} -``` - -We strongly recommend running the pipeline on a GPU because the model consists of roughly 1.4 billion parameters. -You can move the generator object to a GPU, just like you would in PyTorch: - -```python ->>> pipeline.to("cuda") -``` - -Now you can pass a text prompt to the `pipeline` to generate an image, and then access the denoised image. By default, the image output is wrapped in a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) object. - -```python ->>> image = pipeline("An image of a squirrel in Picasso style").images[0] ->>> image -``` - -
- -
- -Save the image by calling `save`: - -```python ->>> image.save("image_of_squirrel_painting.png") -``` - -### Local pipeline - -You can also use the pipeline locally. The only difference is you need to download the weights first: - -```bash -!git lfs install -!git clone https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5 -``` - -Then load the saved weights into the pipeline: - -```python ->>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True) -``` - -Now, you can run the pipeline as you would in the section above. - -### Swapping schedulers - -Different schedulers come with different denoising speeds and quality trade-offs. The best way to find out which one works best for you is to try them out! One of the main features of 🧨 Diffusers is to allow you to easily switch between schedulers. For example, to replace the default [`PNDMScheduler`] with the [`EulerDiscreteScheduler`], load it with the [`~diffusers.ConfigMixin.from_config`] method: - -```py ->>> from diffusers import EulerDiscreteScheduler - ->>> pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True) ->>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) -``` - -Try generating an image with the new scheduler and see if you notice a difference! - -In the next section, you'll take a closer look at the components - the model and scheduler - that make up the [`DiffusionPipeline`] and learn how to use these components to generate an image of a cat. - -## Models - -Most models take a noisy sample, and at each timestep it predicts the *noise residual* (other models learn to predict the previous sample directly or the velocity or [`v-prediction`](https://github.com/huggingface/diffusers/blob/5e5ce13e2f89ac45a0066cb3f369462a3cf1d9ef/src/diffusers/schedulers/scheduling_ddim.py#L110)), the difference between a less noisy image and the input image. You can mix and match models to create other diffusion systems. - -Models are initiated with the [`~ModelMixin.from_pretrained`] method which also locally caches the model weights so it is faster the next time you load the model. For the quicktour, you'll load the [`UNet2DModel`], a basic unconditional image generation model with a checkpoint trained on cat images: - -```py ->>> from diffusers import UNet2DModel - ->>> repo_id = "google/ddpm-cat-256" ->>> model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True) -``` - -To access the model parameters, call `model.config`: - -```py ->>> model.config -``` - -The model configuration is a 🧊 frozen 🧊 dictionary, which means those parameters can't be changed after the model is created. This is intentional and ensures that the parameters used to define the model architecture at the start remain the same, while other parameters can still be adjusted during inference. - -Some of the most important parameters are: - -* `sample_size`: the height and width dimension of the input sample. -* `in_channels`: the number of input channels of the input sample. -* `down_block_types` and `up_block_types`: the type of down- and upsampling blocks used to create the UNet architecture. -* `block_out_channels`: the number of output channels of the downsampling blocks; also used in reverse order for the number of input channels of the upsampling blocks. -* `layers_per_block`: the number of ResNet blocks present in each UNet block. - -To use the model for inference, create the image shape with random Gaussian noise. It should have a `batch` axis because the model can receive multiple random noises, a `channel` axis corresponding to the number of input channels, and a `sample_size` axis for the height and width of the image: - -```py ->>> import torch - ->>> torch.manual_seed(0) - ->>> noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) ->>> noisy_sample.shape -torch.Size([1, 3, 256, 256]) -``` - -For inference, pass the noisy image and a `timestep` to the model. The `timestep` indicates how noisy the input image is, with more noise at the beginning and less at the end. This helps the model determine its position in the diffusion process, whether it is closer to the start or the end. Use the `sample` method to get the model output: - -```py ->>> with torch.no_grad(): -... noisy_residual = model(sample=noisy_sample, timestep=2).sample -``` - -To generate actual examples though, you'll need a scheduler to guide the denoising process. In the next section, you'll learn how to couple a model with a scheduler. - -## Schedulers - -Schedulers manage going from a noisy sample to a less noisy sample given the model output - in this case, it is the `noisy_residual`. - - - -🧨 Diffusers is a toolbox for building diffusion systems. While the [`DiffusionPipeline`] is a convenient way to get started with a pre-built diffusion system, you can also choose your own model and scheduler components separately to build a custom diffusion system. - - - -For the quicktour, you'll instantiate the [`DDPMScheduler`] with its [`~diffusers.ConfigMixin.from_config`] method: - -```py ->>> from diffusers import DDPMScheduler - ->>> scheduler = DDPMScheduler.from_pretrained(repo_id) ->>> scheduler -DDPMScheduler { - "_class_name": "DDPMScheduler", - "_diffusers_version": "0.21.4", - "beta_end": 0.02, - "beta_schedule": "linear", - "beta_start": 0.0001, - "clip_sample": true, - "clip_sample_range": 1.0, - "dynamic_thresholding_ratio": 0.995, - "num_train_timesteps": 1000, - "prediction_type": "epsilon", - "sample_max_value": 1.0, - "steps_offset": 0, - "thresholding": false, - "timestep_spacing": "leading", - "trained_betas": null, - "variance_type": "fixed_small" -} -``` - - - -💡 Unlike a model, a scheduler does not have trainable weights and is parameter-free! - - - -Some of the most important parameters are: - -* `num_train_timesteps`: the length of the denoising process or, in other words, the number of timesteps required to process random Gaussian noise into a data sample. -* `beta_schedule`: the type of noise schedule to use for inference and training. -* `beta_start` and `beta_end`: the start and end noise values for the noise schedule. - -To predict a slightly less noisy image, pass the following to the scheduler's [`~diffusers.DDPMScheduler.step`] method: model output, `timestep`, and current `sample`. - -```py ->>> less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample).prev_sample ->>> less_noisy_sample.shape -torch.Size([1, 3, 256, 256]) -``` - -The `less_noisy_sample` can be passed to the next `timestep` where it'll get even less noisy! Let's bring it all together now and visualize the entire denoising process. - -First, create a function that postprocesses and displays the denoised image as a `PIL.Image`: - -```py ->>> import PIL.Image ->>> import numpy as np - - ->>> def display_sample(sample, i): -... image_processed = sample.cpu().permute(0, 2, 3, 1) -... image_processed = (image_processed + 1.0) * 127.5 -... image_processed = image_processed.numpy().astype(np.uint8) - -... image_pil = PIL.Image.fromarray(image_processed[0]) -... display(f"Image at step {i}") -... display(image_pil) -``` - -To speed up the denoising process, move the input and model to a GPU: - -```py ->>> model.to("cuda") ->>> noisy_sample = noisy_sample.to("cuda") -``` - -Now create a denoising loop that predicts the residual of the less noisy sample, and computes the less noisy sample with the scheduler: - -```py ->>> import tqdm - ->>> sample = noisy_sample - ->>> for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)): -... # 1. predict noise residual -... with torch.no_grad(): -... residual = model(sample, t).sample - -... # 2. compute less noisy image and set x_t -> x_t-1 -... sample = scheduler.step(residual, t, sample).prev_sample - -... # 3. optionally look at image -... if (i + 1) % 50 == 0: -... display_sample(sample, i + 1) -``` - -Sit back and watch as a cat is generated from nothing but noise! 😻 - -
- -
- -## Next steps - -Hopefully, you generated some cool images with 🧨 Diffusers in this quicktour! For your next steps, you can: - -* Train or finetune a model to generate your own images in the [training](./tutorials/basic_training) tutorial. -* See example official and community [training or finetuning scripts](https://github.com/huggingface/diffusers/tree/main/examples#-diffusers-examples) for a variety of use cases. -* Learn more about loading, accessing, changing, and comparing schedulers in the [Using different Schedulers](./using-diffusers/schedulers) guide. -* Explore prompt engineering, speed and memory optimizations, and tips and tricks for generating higher-quality images with the [Stable Diffusion](./stable_diffusion) guide. -* Dive deeper into speeding up 🧨 Diffusers with guides on [optimized PyTorch on a GPU](./optimization/fp16), and inference guides for running [Stable Diffusion on Apple Silicon (M1/M2)](./optimization/mps) and [ONNX Runtime](./optimization/onnx). diff --git a/diffusers/docs/source/en/stable_diffusion.md b/diffusers/docs/source/en/stable_diffusion.md deleted file mode 100644 index fc20d259f5f7f6f906440346c092edd3c1a38c3f..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/stable_diffusion.md +++ /dev/null @@ -1,261 +0,0 @@ - - -# Effective and efficient diffusion - -[[open-in-colab]] - -Getting the [`DiffusionPipeline`] to generate images in a certain style or include what you want can be tricky. Often times, you have to run the [`DiffusionPipeline`] several times before you end up with an image you're happy with. But generating something out of nothing is a computationally intensive process, especially if you're running inference over and over again. - -This is why it's important to get the most *computational* (speed) and *memory* (GPU vRAM) efficiency from the pipeline to reduce the time between inference cycles so you can iterate faster. - -This tutorial walks you through how to generate faster and better with the [`DiffusionPipeline`]. - -Begin by loading the [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) model: - -```python -from diffusers import DiffusionPipeline - -model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" -pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True) -``` - -The example prompt you'll use is a portrait of an old warrior chief, but feel free to use your own prompt: - -```python -prompt = "portrait photo of a old warrior chief" -``` - -## Speed - - - -💡 If you don't have access to a GPU, you can use one for free from a GPU provider like [Colab](https://colab.research.google.com/)! - - - -One of the simplest ways to speed up inference is to place the pipeline on a GPU the same way you would with any PyTorch module: - -```python -pipeline = pipeline.to("cuda") -``` - -To make sure you can use the same image and improve on it, use a [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed for [reproducibility](./using-diffusers/reusing_seeds): - -```python -import torch - -generator = torch.Generator("cuda").manual_seed(0) -``` - -Now you can generate an image: - -```python -image = pipeline(prompt, generator=generator).images[0] -image -``` - -
- -
- -This process took ~30 seconds on a T4 GPU (it might be faster if your allocated GPU is better than a T4). By default, the [`DiffusionPipeline`] runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps. - -Let's start by loading the model in `float16` and generate an image: - -```python -import torch - -pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True) -pipeline = pipeline.to("cuda") -generator = torch.Generator("cuda").manual_seed(0) -image = pipeline(prompt, generator=generator).images[0] -image -``` - -
- -
- -This time, it only took ~11 seconds to generate the image, which is almost 3x faster than before! - - - -💡 We strongly suggest always running your pipelines in `float16`, and so far, we've rarely seen any degradation in output quality. - - - -Another option is to reduce the number of inference steps. Choosing a more efficient scheduler could help decrease the number of steps without sacrificing output quality. You can find which schedulers are compatible with the current model in the [`DiffusionPipeline`] by calling the `compatibles` method: - -```python -pipeline.scheduler.compatibles -[ - diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, - diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler, - diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler, - diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler, - diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, - diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler, - diffusers.schedulers.scheduling_ddpm.DDPMScheduler, - diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler, - diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler, - diffusers.utils.dummy_torch_and_torchsde_objects.DPMSolverSDEScheduler, - diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler, - diffusers.schedulers.scheduling_pndm.PNDMScheduler, - diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, - diffusers.schedulers.scheduling_ddim.DDIMScheduler, -] -``` - -The Stable Diffusion model uses the [`PNDMScheduler`] by default which usually requires ~50 inference steps, but more performant schedulers like [`DPMSolverMultistepScheduler`], require only ~20 or 25 inference steps. Use the [`~ConfigMixin.from_config`] method to load a new scheduler: - -```python -from diffusers import DPMSolverMultistepScheduler - -pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) -``` - -Now set the `num_inference_steps` to 20: - -```python -generator = torch.Generator("cuda").manual_seed(0) -image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0] -image -``` - -
- -
- -Great, you've managed to cut the inference time to just 4 seconds! ⚡️ - -## Memory - -The other key to improving pipeline performance is consuming less memory, which indirectly implies more speed, since you're often trying to maximize the number of images generated per second. The easiest way to see how many images you can generate at once is to try out different batch sizes until you get an `OutOfMemoryError` (OOM). - -Create a function that'll generate a batch of images from a list of prompts and `Generators`. Make sure to assign each `Generator` a seed so you can reuse it if it produces a good result. - -```python -def get_inputs(batch_size=1): - generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)] - prompts = batch_size * [prompt] - num_inference_steps = 20 - - return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps} -``` - -Start with `batch_size=4` and see how much memory you've consumed: - -```python -from diffusers.utils import make_image_grid - -images = pipeline(**get_inputs(batch_size=4)).images -make_image_grid(images, 2, 2) -``` - -Unless you have a GPU with more vRAM, the code above probably returned an `OOM` error! Most of the memory is taken up by the cross-attention layers. Instead of running this operation in a batch, you can run it sequentially to save a significant amount of memory. All you have to do is configure the pipeline to use the [`~DiffusionPipeline.enable_attention_slicing`] function: - -```python -pipeline.enable_attention_slicing() -``` - -Now try increasing the `batch_size` to 8! - -```python -images = pipeline(**get_inputs(batch_size=8)).images -make_image_grid(images, rows=2, cols=4) -``` - -
- -
- -Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~3.5 seconds per image! This is probably the fastest you can go on a T4 GPU without sacrificing quality. - -## Quality - -In the last two sections, you learned how to optimize the speed of your pipeline by using `fp16`, reducing the number of inference steps by using a more performant scheduler, and enabling attention slicing to reduce memory consumption. Now you're going to focus on how to improve the quality of generated images. - -### Better checkpoints - -The most obvious step is to use better checkpoints. The Stable Diffusion model is a good starting point, and since its official launch, several improved versions have also been released. However, using a newer version doesn't automatically mean you'll get better results. You'll still have to experiment with different checkpoints yourself, and do a little research (such as using [negative prompts](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)) to get the best results. - -As the field grows, there are more and more high-quality checkpoints finetuned to produce certain styles. Try exploring the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) and [Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) to find one you're interested in! - -### Better pipeline components - -You can also try replacing the current pipeline components with a newer version. Let's try loading the latest [autoencoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae) from Stability AI into the pipeline, and generate some images: - -```python -from diffusers import AutoencoderKL - -vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda") -pipeline.vae = vae -images = pipeline(**get_inputs(batch_size=8)).images -make_image_grid(images, rows=2, cols=4) -``` - -
- -
- -### Better prompt engineering - -The text prompt you use to generate an image is super important, so much so that it is called *prompt engineering*. Some considerations to keep during prompt engineering are: - -- How is the image or similar images of the one I want to generate stored on the internet? -- What additional detail can I give that steers the model towards the style I want? - -With this in mind, let's improve the prompt to include color and higher quality details: - -```python -prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes" -prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta" -``` - -Generate a batch of images with the new prompt: - -```python -images = pipeline(**get_inputs(batch_size=8)).images -make_image_grid(images, rows=2, cols=4) -``` - -
- -
- -Pretty impressive! Let's tweak the second image - corresponding to the `Generator` with a seed of `1` - a bit more by adding some text about the age of the subject: - -```python -prompts = [ - "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", - "portrait photo of an old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", - "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", - "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", -] - -generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))] -images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images -make_image_grid(images, 2, 2) -``` - -
- -
- -## Next steps - -In this tutorial, you learned how to optimize a [`DiffusionPipeline`] for computational and memory efficiency as well as improving the quality of generated outputs. If you're interested in making your pipeline even faster, take a look at the following resources: - -- Learn how [PyTorch 2.0](./optimization/torch2.0) and [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) can yield 5 - 300% faster inference speed. On an A100 GPU, inference can be up to 50% faster! -- If you can't use PyTorch 2, we recommend you install [xFormers](./optimization/xformers). Its memory-efficient attention mechanism works great with PyTorch 1.13.1 for faster speed and reduced memory consumption. -- Other optimization techniques, such as model offloading, are covered in [this guide](./optimization/fp16). diff --git a/diffusers/docs/source/en/training/adapt_a_model.md b/diffusers/docs/source/en/training/adapt_a_model.md deleted file mode 100644 index e6a088675a34f418245f9332acc29587ee7f692e..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/training/adapt_a_model.md +++ /dev/null @@ -1,47 +0,0 @@ -# Adapt a model to a new task - -Many diffusion systems share the same components, allowing you to adapt a pretrained model for one task to an entirely different task. - -This guide will show you how to adapt a pretrained text-to-image model for inpainting by initializing and modifying the architecture of a pretrained [`UNet2DConditionModel`]. - -## Configure UNet2DConditionModel parameters - -A [`UNet2DConditionModel`] by default accepts 4 channels in the [input sample](https://huggingface.co/docs/diffusers/v0.16.0/en/api/models#diffusers.UNet2DConditionModel.in_channels). For example, load a pretrained text-to-image model like [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) and take a look at the number of `in_channels`: - -```py -from diffusers import StableDiffusionPipeline - -pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True) -pipeline.unet.config["in_channels"] -4 -``` - -Inpainting requires 9 channels in the input sample. You can check this value in a pretrained inpainting model like [`runwayml/stable-diffusion-inpainting`](https://huggingface.co/runwayml/stable-diffusion-inpainting): - -```py -from diffusers import StableDiffusionPipeline - -pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-inpainting", use_safetensors=True) -pipeline.unet.config["in_channels"] -9 -``` - -To adapt your text-to-image model for inpainting, you'll need to change the number of `in_channels` from 4 to 9. - -Initialize a [`UNet2DConditionModel`] with the pretrained text-to-image model weights, and change `in_channels` to 9. Changing the number of `in_channels` means you need to set `ignore_mismatched_sizes=True` and `low_cpu_mem_usage=False` to avoid a size mismatch error because the shape is different now. - -```py -from diffusers import UNet2DConditionModel - -model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" -unet = UNet2DConditionModel.from_pretrained( - model_id, - subfolder="unet", - in_channels=9, - low_cpu_mem_usage=False, - ignore_mismatched_sizes=True, - use_safetensors=True, -) -``` - -The pretrained weights of the other components from the text-to-image model are initialized from their checkpoints, but the input channel weights (`conv_in.weight`) of the `unet` are randomly initialized. It is important to finetune the model for inpainting because otherwise the model returns noise. diff --git a/diffusers/docs/source/en/training/cogvideox.md b/diffusers/docs/source/en/training/cogvideox.md deleted file mode 100644 index 657e58bfd5eb0ff37f9e38d15ff5f364858d1da6..0000000000000000000000000000000000000000 --- a/diffusers/docs/source/en/training/cogvideox.md +++ /dev/null @@ -1,291 +0,0 @@ - -# CogVideoX - -CogVideoX is a text-to-video generation model focused on creating more coherent videos aligned with a prompt. It achieves this using several methods. - -- a 3D variational autoencoder that compresses videos spatially and temporally, improving compression rate and video accuracy. - -- an expert transformer block to help align text and video, and a 3D full attention module for capturing and creating spatially and temporally accurate videos. - -The actual test of the video instruction dimension found that CogVideoX has good effects on consistent theme, dynamic information, consistent background, object information, smooth motion, color, scene, appearance style, and temporal style but cannot achieve good results with human action, spatial relationship, and multiple objects. - -Finetuning with Diffusers can help make up for these poor results. - -## Data Preparation - -The training scripts accepts data in two formats. - -The first format is suited for small-scale training, and the second format uses a CSV format, which is more appropriate for streaming data for large-scale training. In the future, Diffusers will support the `