Data analysis and machine learning approaches for time series pre- and post- processing pipelines
Directors: Marco Quartulli (Vicomtech) Basilio Sierra Araujo (University)
In the industrial domain, different kinds of sensor devices capture data continuously and constantly monitor the operation of the machines in real-time. The cost and size of sensor have reduced dramatically in recent years, making the digitisation of machines and processes affordable. In the context of the Industrial Internet of Things IIoT concerns come from both the large quantity and the sometimes low quality of data that it is typically messy, as observations can be noisy, missing or lost in communications. These limitations can lead to results that negatively impact business decisions.
This research focuses on time series data, which poses unique challenges due to the need to properly take into account autocorrelation, trends, seasonality, and gaps. Moreover, as time series data is often continuously generated, it is important that cleaning algorithms support near real-time operation. Furthermore, as the data evolves, the cleaning strategy needs to change in an adaptive and incremental way, in order to avoid having to start the cleaning process from scratch each time.
The objective of this thesis is to verify the possibility of applying machine learning-inspired process flows to data pre-processing steps. For that purpose, this work proposes methods that are capable of selecting optimal pre-processing strategies and providing insight into the errors made. The proposed methods generate pre- and post-processing models which are trained using available historical data, by minimising empirical loss functions.
Specifically, this dissertation studies time series compression, feature joining, observation imputation and surrogate model generation processes. In each of them, the optimal selection and combination of multiple strategies is pursued. This approach is defined according to data characteristics and user-defined system properties and limitations.
The general results indicate that the proposed approach identifies optimal pre- and post-processing strategies for univariate time series on a window-by-window basis, showing its capability to adapt to the current signal window. Specific details of the data generation process, of the dependency on other internal or external variables, or even of noise can affect the pre-processing selection results. Controlling the error in the process is critical in order to detect model drifts and, as a consequence, to retrain the generated model to maintain data quality. Implementation results have allowed ensuring data quality control in real-world project scenarios.