Distribution of time-series data into train/dev/test sets for ML

marrowgari · February 21, 2018, 10:41pm

I currently have a kdb+ database with ~1mil rows of financial tick data. What is the best way to break up this time-series financial data into train/dev/test sets for ML?

This paper suggests the use of k-fold cross-validation, which partitions the data into complimentary subsets. But it’s from Spring-2014 and after reading it I’m still unclear on how to implement it in practice. Is this the best solution or is something like hold-out validation more appropriate for financial data? I found this paper as well on building a Neural Network in Kdb+ but I didn’t see any practical real world examples for dividing the dataset into appropriate categories.

Thank you.

effbiae · February 22, 2018, 12:35am

kx has developed embedpy. this allows q to call python, including ML libraries like Tensorflow as in example here:

https://github.com/KxSystems/embedPy/blob/master/tests/tensorflow.t

if having python in q opens up some options, look here.

quintanar4011 · February 22, 2018, 8:03am

Hi,

1 mil is a big enough number (though this depends on what exactly you want to do), most benchmark datasets are usually smaller.

Otherwise you can use data augmentation, data mixing (construct examples like alpha*ex1+(1-alpha)*ex2), use a pretrained model and etc.

WBR, Andrey Kozyrev.

???, 22 ??? 2018 ?., 2:01:33 UTC+3 ??? marrowgari ???:

I currently have a kdb+ database with ~1mil rows of financial tick data. What is the best way to break up this time-series financial data into train/dev/test sets for ML?

This paper suggests the use of k-fold cross-validation, which partitions the data into complimentary subsets. But it’s from Spring-2014 and after reading it I’m still unclear on how to implement it in practice. Is this the best solution or is something like hold-out validation more appropriate for financial data? I found this paper as well on building a Neural Network in Kdb+ but I didn’t see any practical real world examples for dividing the dataset into appropriate categories.

Thank you.

marrowgari · February 22, 2018, 10:17pm

Thanks for the reply, Andrey.

I know data augmentation and mixing to be great approaches if you have a small data set or for classification problems, e.g. using linear or softmax regression on cat pics to decide if it’s a cat or not. But I’m not sure how this approach would work with time-series data? It seems like the order of rows (days/prices) plays an important factor in training the model’s output, which if that’s the case, mixing the data would result in an exploding loss function and destroy your model.

Augmenting time-series data is not something I’m familiar. Do you have other examples how to do this?

effbiae · February 23, 2018, 8:35am

> cat or not

a picture of a cat is a rectangle of triples

tick data is a sequence of triples, quadruples, quintuples or wider

but simpler nonetheless

heydi1 · February 23, 2018, 9:15am

need to transform your time series(s) to stationary processes.

Then there are a number of way to perform cross valuation specific for time series data, one typical uses:

https://www.sciencedirect.com/science/article/pii/S0304407600000300

Regards

Xi

heydi1 · February 23, 2018, 9:45am

Also Data augmentation for time series is usually done using MCMC technique, but again it depends on your use case.

Xi

Topic		Replies	Views
looking for a paid tutor/consultant ... Community Support kdb-and-q	0	1	September 11, 2010
Deep neural networks and kdb Community Support kdb-and-q	7	4	June 28, 2018
Optimal Database Structure Community Support kdb-and-q	8	1	January 7, 2015
Could Someone Give Me Guidance on Integrating KDB+/q with Machine Learning? Community Support imported , kdb-ai	1	10	April 11, 2024
How is Kdb+ used in finance (Introduction into kdb+ tick);Friday, August 12, 2016;Hong Kong Community Support kdb-and-q	2	1	July 27, 2016

Distribution of time-series data into train/dev/test sets for ML

Related topics