I currently have a kdb+ database with ~1mil rows of financial tick data. What is the best way to break up this time-series financial data into train/dev/test sets for ML?
This paper suggests the use of k-fold cross-validation, which partitions the data into complimentary subsets. But it’s from Spring-2014 and after reading it I’m still unclear on how to implement it in practice. Is this the best solution or is something like hold-out validation more appropriate for financial data? I found this paper as well on building a Neural Network in Kdb+ but I didn’t see any practical real world examples for dividing the dataset into appropriate categories.
I currently have a kdb+ database with ~1mil rows of financial tick data. What is the best way to break up this time-series financial data into train/dev/test sets for ML?
This paper suggests the use of k-fold cross-validation, which partitions the data into complimentary subsets. But it’s from Spring-2014 and after reading it I’m still unclear on how to implement it in practice. Is this the best solution or is something like hold-out validation more appropriate for financial data? I found this paper as well on building a Neural Network in Kdb+ but I didn’t see any practical real world examples for dividing the dataset into appropriate categories.
I know data augmentation and mixing to be great approaches if you have a small data set or for classification problems, e.g. using linear or softmax regression on cat pics to decide if it’s a cat or not. But I’m not sure how this approach would work with time-series data? It seems like the order of rows (days/prices) plays an important factor in training the model’s output, which if that’s the case, mixing the data would result in an exploding loss function and destroy your model.
Augmenting time-series data is not something I’m familiar. Do you have other examples how to do this?