Latent Guard: a Safety Framework for Text-to-image Generation

Runtao Liu1, Ashkan Khakzar2, Jindong Gu2, Qifeng Chen1, Philip Torr2, Fabio Pizzati2
1Hong Kong University of Science and Technology, 2University of Oxford
runtao219 [at] gmail [dot] com

image


An open-source, efficient, and extensible framework for enhancing safety🛡️ in text-to-image (T2I) generation🖼️, designed to prevent misuse and improve flexibility.

  • ⚡️Fast: Detects unsafe input prompts in approximately 1ms and can be trained on a single NVIDIA 3090 GPU.
  • 🔧Extensible: Supports customized unsafe concepts to block; compatible with all T2I models based on text encoders, SD/SDXL, etc.
  • 🏆State-of-the-art: Faster performance, higher accuracy, and better scalability than existing safety methods.
  • 🌐Open: All processes—data generation, training, testing, and inference—are fully open-source.

[2024/09/25 New]🚀🚀🚀: We released our code📝 and the model weights⚙️!

[2024/07]: We released our dataset CoPro.

[2024/07]: Our paper has been accepted by ECCV 2024.

BibTeX

@article{liu2024latent,
  title={Latent Guard: a Safety Framework for Text-to-image Generation},
  author={Liu, Runtao and Khakzar, Ashkan and Gu, Jindong and Chen, Qifeng and Torr, Philip and Pizzati, Fabio},
  journal={arXiv preprint arXiv:2404.08031},
  year={2024}
}

Motivation & Background

image

Recent text-to-image generators are composed of a text encoder and a diffusion model. Their deployment without appropriate safety measures creates risks of misuse (left). We propose Latent Guard (right), a safety method designed to block malicious input prompts. Our idea is to detect the presence of blacklisted concepts on a learned latent space on top of the text encoder. This allows to detect blacklisted concepts beyond their exact wording, extending to some adversarial attacks too ("<ADV>"). The blacklist is adaptable at test time, for adding or removing concepts without retraining. Blocked prompts are not processed by the diffusion model, saving computational costs.

Abstract

With the ability to generate high-quality images, text-to-image (T2I) models can be exploited for creating inappropriate content. To prevent misuse, existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification, requiring large datasets for training and offering low flexibility. Hence, we propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts in the input text embeddings. Our proposed framework is composed of a data generation pipeline specific to the task using large language models, ad-hoc architectural components, and a contrastive learning strategy to benefit from the generated data. The effectiveness of our method is verified on three datasets and against four baselines.

Approach

image

Overview of Latent Guard. We first generate a dataset of safe and unsafe prompts centered around blacklisted concepts (left). Then, we leverage pretrained textual encoders to extract features, and map them to a learned latent space with our Embedding Mapping Layer (center). Only the Embedding Mapping Layer is trained, while all other parameters are kept frozen. We train by imposing a contrastive loss on the extracted embedding, bringing closer the embeddings of unsafe prompts and concepts, while separating them from safe ones (right).

Dataset CoPro Generation

image

CoPro generation. For \(\mathcal{C}\) concepts, we sample unsafe \(\mathcal{U}\) prompts with an LLM as described in Section 3.1. Then, we create Synonym prompts by replacing \(c\) with a synonym, also using an LLM, and obtaining \(\mathcal{U}^\text{syn}\). Furthermore, we use an adversarial attack method to replace \(c\) with an "<ADV>" Adversarial text (\(\mathcal{U}^\text{adv}\)). Safe prompts \(\mathcal{S}\) are obtained from \(\mathcal{U}\). This is done for each ID and OOD data.

Qualitative and Quantitative Results

Evaluation on CoPro. We provide accuracy (a) and AUC (b) for Latent Guard and baselines on CoPro. We either rank first or second in all setups, training only on Explicit ID training data. We show examples of prompts of CoPro and generated images in (c). The unsafe image generated advocate the quality of our dataset. Latent Guard is the only method blocking all the tested prompts.

image

Evaluation on Unseen Datasets. We test Latent Guard on existing datasets for both Unsafe Diffusion and I2P++. Although the input T2I prompts distribution is different from the one in CoPro, we still outperform all baselines and achieve a robust classification.

image

Speed and Feature Space Analysis

Computational cost. We measure processing times and memory usage for different batch sizes and concepts in \(\mathcal{C}_\text{check}\). In all cases, requirements are limited.

image

Feature space analysis. Training Latent Guard on CoPro makes safe/unsafe regions naturally emerge (right). In the CLIP latent space, safe/unsafe embeddings are mixed (left).

image