Exploring the Limits of Diffusion Models to Generate Person Detection Training Datasets
Authors: Hugo Rodríguez Arce Miguel Ortiz Huamani
Date: 17.09.2024
Abstract
Current diffusion models could assist in creating training datasets for Deep Neural Network (DNN)-based person detectors by producing high-quality, realistic, and custom images of non-existent people and objects, avoiding privacy issues. However, these models have difficulties in generating images of people in a fully controlled way. Problems may occur such as abnormal proportions, distortions in body or face, extra limbs, or elements that do not match the input text prompt. Moreover, biases related to factors like gender, clothing type and colors, ethnicity or age can also limit the control over the generated images. Both generative AI models and DNN-based person detectors need large sets of annotated images that reflect the diverse visual appearances expected in the application context. In this paper we explore the capabilities of state-of-the-art text-to-image a diffusion models for person image generation and propose a methodology to exploit their usage for training DNN-based person detectors. For the generation of virtual persons, this includes variations in the environment, such as illumination or background, and people characteristics, such as body pose, skin tones, gender, age, clothing types and colors, as well as multiple types of partial occlusions with other objects (or people). Our method leverages explainability techniques to gain more understanding of the behaviour of the diffusion models and the relation between inputs and outputs to improve the diversity of the person detection training dataset. Experimental results using the WiderPerson benchmark of a YOLOX detection model trained with the proposed methodology show the potential use of this approach.
BIB_text
title = {Exploring the Limits of Diffusion Models to Generate Person Detection Training Datasets},
pages = {132060T},
keywds = {
Person Detection; Stable Diffusion; Synthetic Dataset
}
abstract = {
Current diffusion models could assist in creating training datasets for Deep Neural Network (DNN)-based person detectors by producing high-quality, realistic, and custom images of non-existent people and objects, avoiding privacy issues. However, these models have difficulties in generating images of people in a fully controlled way. Problems may occur such as abnormal proportions, distortions in body or face, extra limbs, or elements that do not match the input text prompt. Moreover, biases related to factors like gender, clothing type and colors, ethnicity or age can also limit the control over the generated images. Both generative AI models and DNN-based person detectors need large sets of annotated images that reflect the diverse visual appearances expected in the application context. In this paper we explore the capabilities of state-of-the-art text-to-image a diffusion models for person image generation and propose a methodology to exploit their usage for training DNN-based person detectors. For the generation of virtual persons, this includes variations in the environment, such as illumination or background, and people characteristics, such as body pose, skin tones, gender, age, clothing types and colors, as well as multiple types of partial occlusions with other objects (or people). Our method leverages explainability techniques to gain more understanding of the behaviour of the diffusion models and the relation between inputs and outputs to improve the diversity of the person detection training dataset. Experimental results using the WiderPerson benchmark of a YOLOX detection model trained with the proposed methodology show the potential use of this approach.
}
date = {2024-09-17},
}