I completely agree with your perspective. It is a reality that deep learning models might offer certain advantages over classical statistical models. We are building benchmarks and comparisons to clarify when the more complex models are better.
We also want to show with this experiment the importance of creating benchmarks. In many use cases, practitioners choose more sophisticated models because they think this will give them better accuracy. The main idea is that robust benchmarks should always be created.
We noted that the paper only compares NeuralProphet against Prophet and does not include standard time series datasets (such as M-competitions). So we decided to test the model against simpler models (ETS in this case) using the StatsForecast library (https://github.com/Nixtla/statsforecast/).
I think the problem arises from the datasets used to evaluate the performance of the models. In the case of Prophet's paper, only one time series is used (The number of events created on Facebook). We can conclude from the results comparing AutoARIMA vs. Prophet (https://github.com/Nixtla/statsforecast/tree/main/experiment..., using the same datasets as in the ETS vs. NeuralProphet experiment) that ETS is also better than Prophet. Regarding NeuralProphet vs. Prophet, the results are not conclusive for these datasets.
The pipeline we have developed improves the state of the art in the markets you mention in the following aspects:
1. It is a fully automated end-to-end pipeline for forecast generation. The pipeline considers preprocessing such as missing value imputation, feature generation (static and dynamic), forecast generation, and also a module to validate forecasts on important time series competition data sets.
2. Users can deploy the pipeline in their cloud quickly. We use terraform (https://github.com/Nixtla/nixtla/tree/main/iac/terraform/aws), so it is very simple to deploy the pipeline on AWS. We are working to release versions of terraform on other clouds such as Azure and Google Cloud.
3. Users can use their own models. Just create a fork of the repo and make the appropriate modifications to include any model the user wants to deploy. On our side, we are working to include Deep Learning models with the nixtlats library (https://github.com/nixtla/nixtlats/) that we also developed.
About benchmarking using statistical models, we highly recommend using statsforecast (https://github.com/Nixtla/statsforecast) that we created. It is designed to be highly efficient in fitting statistical models on millions of time series. More complex models can be built on the results to get a positive Forecast Value Added.
We missed that, sorry. At the moment, for forecasting the pipeline uses the mlforecast library (https://github.com/nixtla/mlforecast) that builds upon sklearn, xgboost and lightgbm.
In addition, we are about to include state-of-the-art Deep Learning models from the nixtlats library (https://github.com/nixtla/nixtlats/).
We agree that in most cases prophet is not a good benchmark; however, we wanted to use it because it is one of the most used libraries in forecasting. For that reason, we also tested the solution against AWS Forecast obtaining better results.
Besides the better performance and scalability, the pipeline we created considering all the stages of time series forecasting: preprocessing (e.g. missing value imputation), creation of static and dynamic features, forecast generation, and finally evaluation using data sets of important competencies. (https://github.com/Nixtla/tsfeatures)
On the deployment side, the entire pipeline can be quickly deployed in the user's cloud using terraform. This allows for less development time. (https://github.com/Nixtla/nixtla)
We also want to show with this experiment the importance of creating benchmarks. In many use cases, practitioners choose more sophisticated models because they think this will give them better accuracy. The main idea is that robust benchmarks should always be created.