Time series clustering and data augmentation techniques to improve the forecast of Dengue cases in paraguay with deep learning
MetadataShow full item record
AdviserSchaerer Serra, Christian Emilio
Date of publishing2020
Type of publicationmaster thesis
Dengue fever is a public health problem and accurate forecasts can help govern ments to take the best preventive actions. As the volume of data provided contin uously increases, machine learning and deep learning (DL) models have become an attractive approach. However, it is difficult to perform accurate predictions in areas with fewer cases. In this work, traditional approaches such as LARS LASSO Re gression (LR), Random Forest (RF), Support Vector Regression (SVR) vs DL mod els based on Long Short-Term Memory (LSTM) are compared, considering weekly Dengue incidence and climate, in 217 cities in Paraguay. Several cities may present heterogeneous behaviors and poor accuracy, to miti gate this problem, two approaches are proposed: clustering and data augmentation. First, clustering analysis between time series was performed, based on silhouette scores for measuring how well observations are clustered. Results indicate that hi erarchical clustering combined with correlation is the most appropriate approach. Then several LSTM models are compared on subgroups of similar time series. Sec ond, several data augmentation techniques were applied, and the synthetic time series obtained was used as input to train models, the results indicate that the syn thetic series obtained with Bayesian estimation technique are the one that improved the performance of the model. The Root Mean Square Error (RMSE) confirms that the LSTM clustered mod els improve the accuracy in 19.48 ± 18.80% and LSTM with Bayesian based data augmentation improves 16.86±16.57%. The main contribution of this work are two techniques that can improve the performance of time series models by combining information from similar time-series and weather data.