Data-centric design and training of deep neural networks with multiple data modalities for vision-based perception systems
The advances in computer vision and machine learning have revolutionized the ability to build systems that process and interpret digital data, enabling them to mimic human perception and paving the way for a wide range of applications. In recent years, both disciplines have made significant progress, fueled by advances in deep learning techniques. Deep learning is a discipline that uses deep neural networks (DNNs) to teach machines to recognize patterns and make
predictions based on data. Deep learning-based perception systems are increasingly prevalent in diverse fields, where humans and machines collaborate to combine their strengths. These fields include automotive, industry, or medicine, where enhancing safety, supporting diagnosis, and automating repetitive tasks are some of the aimed goals.
However, data are one of the key factors behind the success of deep learning algorithms. Data dependency strongly limits the creation and success of a new DNN. The availability of quality data for solving a specific problem is essential but hard to obtain, even impracticable, in most developments. Data-centric artificial intelligence emphasizes the importance of using high-quality data that effectively conveys what a model must learn. Motivated by the challenges and necessity of data, this thesis formulates and validates five hypotheses on the acquisition and impact of data in DNN design and training.
Specifically, we investigate and propose different methodologies to obtain suitable data for training DNNs in problems with limited access to large-scale data sources. We explore two potential solutions for obtaining data, which rely on synthetic data generation. Firstly, we investigate the process of generating synthetic training data using 3D graphics-based models and the impact of different design choices on the accuracy of obtained DNNs. Beyond that, we propose a methodology to automate the data generation process and generate varied annotated data by replicating a 3D custom environment given an input configuration file. Secondly, we propose a generative adversarial network (GAN) that generates annotated images using both limited annotated data and unannotated in-the-wild data. Typically, limited annotated datasets have accurate annotations but lack realism and variability, which can be compensated for by the in-the-wild data. We analyze the suitability of the data generated with our GAN-based method for DNN training.
This thesis also presents a data-oriented DNN design, as data can present very different properties depending on their source. We differentiate sources based on the sensor modality used to obtain the data (e.g., camera, LiDAR) or the data generation domain (e.g., real, synthetic). On the one hand, we redesign an image-oriented object detection DNN architecture to process point clouds from the LiDAR sensor and optionally incorporate information from RGB images. On the other hand, we adapt a DNN to learn from both real and synthetic images while minimizing the domain gap of learned features from data.
We have validated our formulated hypotheses in various unresolved computer vision problems that are critical for numerous real-world vision-based systems. Our findings demonstrate that synthetic data generated using 3D models and environments are suitable for DNN training. However, we also highlight that the design choices during the generation process, such as lighting and camera distortion, significantly affect the accuracy of the resulting DNN. Additionally, we show that a simulation 3D environment can assist in designing better sensor setups for a target task.
Furthermore, we demonstrate that GANs offer an alternative means of generating training data by exploiting labeled and existing unlabeled data to generate new samples that are suitable for DNN training without a simulation environment.
Finally, we show that adapting DNN design and training to data modality and source can increase model accuracy. More specifically, we demonstrate that modifying a predefined architecture designed for images to accommodate the peculiarities of point clouds results in state-of-the-art performance in 3D object detection. The DNN can be designed to handle data from a single modality or leverage data from different sources. Furthermore, when training with real and synthetic data, considering their domain gap and designing a DNN architecture accordingly improves model accuracy.