Mobile Stable Diffusion

Find more information about SqueezeBits Inc. on http://squeezebits.com/

Keywords

Stable Diffusion v2.1
Galaxy S22 (Snapdragon 8 Gen 1)
Tensorflow Lite (w/ GPU delegate)
Quantized, pruned and distilled
< 8 sec/image (512x512)

What is Stable Diffusion?

Stable Diffusion is a deep learning-based text-to-image generation model released in 2022 by Stability AI. Since the model has been open-sourced to public, a number of applications have been created (ex. image-to-image, text-to-video, etc.)

The Stable Diffusion model for text-to-image generation consists of three sub-models, namely a text encoder, a diffusion model(U-net), and a decoder. In total, Stable Diffusion has 1B+ parameters and requires 2GB+ memory (assuming FP16 precision). Thus, most of the Stable Diffusion-based applications utilize GPUs on either local or cloud environments.

대지 3@2x.png

Compressing Stable Diffusion for Mobile

A month ago (2023-02-24), Qualcomm has demonstrated World’s first on-device Stable Diffusion on an Android phone. Based on the powerful performance of Snapdragon 8 Gen 2 AP(Application Processor), the Stable Diffusion v1.5 model ran within 15 seconds on Galaxy S23 device. The model was quantized to INT8 precision using Post-Training Quantization and ran on the Hexagon Processor of Snapdragon AP. The number of denoising steps was set to 20 and a 512x512 image was generated within 15 seconds.

Our team at SqueezeBits and our collaborator SNU-VLSI lab have successfully developed Mobile Stable Diffusion by compressing the Stable Diffusion v2.1 model to run on Galaxy S22 device. We applied (1) Quantization, (2) Pruning, and (3) Knowledge Distillation techniques to compress the model. The Mobile Stable Diffusion model runs in less than 8 seconds using mobile GPU(Adreno) in Snapdragon 8 Gen 1 AP. We used Tensorflow Lite and its GPU delegation. Since we used mobile GPU, we used FP16 precision for activations and INT8/FP16 mixed precision for weights. The number of denoising steps was originally 20, but we reduced it while keeping the output quality using a distillation technique. We also pruned out some uninfluential parts of the model to further compress the model.