Optimal Database Structure

Hey all,

New to kdb+ and have a few questions on optimal database structure. 

I have a daily point-in-time dataset of financial metrics for a large universe of stocks. Some of these metrics are updated daily, others are only updated once in a while (a few times per year). I need to query this database efficiently in the following manner: on day d, which stocks match a given set of criteria (say metric1 > 10 and metric2 < 5). In this example, metric1 is updated daily and metric2 is updated quarterly.

Here are the two different approaches I can think of:

Option 1

A straightforward approach would be to have a table that simply repeats the last value for metrics that are not updated frequently (in this case, metric2 is repeated unless its value changes). This makes for simple queries, but will require a lot of storage space (unless kdb+ optimizes away repeated values? Can it?).

date               security   |    metric1       metric2     …


2010.01.01    AAPL      |   11.42            32.05       …

Option 2

Another option I see is to only store the updates:

date               security   metric     value


2010.01.01    AAPL      metric2   32.05

2010.03.25    AAPL      metric2   33.12

This approach uses less space, but the query complexity increases - not sure about the performance but I assume it will also be slower. To obtain my results, I need to get the latest value for each security&metric pair, then filter down that intermediate dataset based on the query. I am still unsure on how to efficiently build this query (or set of queries). How do I efficiently fetch the most recent data point for each security&metric pair? Then how do I query this intermediate result? Do I have create a pivot table?

I hope I explained my situation clearly - if not please let me know and I will elaborate. My questions is which of these options, if any, is a sensible approach to my problem? Is there a better structure that I am overlooking? 

My primary concern is speed, but since the dataset is rather large, storage space might be an issue. I prefer Option 1 but I am afraid this type of table will take up too much space, unless kdb+ can optimize away the repeated values (for storage purposes).

Any insight will be much appreciated. Thanks in advance!

  • Rob

Hi Rob

I guess there are a couple of approaches.  First of all, it might depend on how you are partitioning your database - standard is daily partitions, but you can also do month, year and integer partitions.  Perhaps in your case yearly partitions might be appropriate if you have to query across multiple days at once.  If you were to use yearly partitions and Option 1 below and apply compression, then you will get smaller storage on the more repetitive files.  For example, if you create a small table with two metrics taken from different size domains and compare the compression stats (I don’t want to assume anything about sort order at this stage):

q)n:10000000

q)t:(metric1:n?500000?100f;metric2:n?1000?100f)

q)`:t1/ set t

`:t1/

q).z.zd:16 2 6

q)`:t2/ set t

`:t2/


aq$ du -sh *

153M t1

80M t2

aq$ du -sh t2

80M t2

aq$ du -sh t2/*

56M t2/metric1

24M t2/metric2

The disadvantage with yearly partitions and compression is that you might hit 32bit limits quicker than if you were using daily partitions (assuming you are not a licenced customer).

For option 2, to get the last value for each pair you can do:

select by security, metric from table where date<D

To sub filter, you can do something like 

select from (select by security, metric from table where date<D) where (metric=metric1) and value\>10, (metric=metric2) and value>5

You could also pivot the sub result and pull directly from there : http://code.kx.com/wiki/Pivot

(BTW you shouldn’t use value for a column name as it is a reserved word)

Another approach would be to split the data into 2 tables.  Have one daily table (which may be daily partitioned, or still month/year partitioned) and one non-daily table which is stored down as a flat splay.  The non-daily table could either be stored in the name/value pair format of option 2, or the square format of option 1.  You could then access them independently and join the results at run time. 

Thanks

– 
Mark Rooney
Financial Software Developer
AQUAQ Analytics

Sorry, the last sub filtering example should be something like this:

select from (select from (select by security,metric from t where date<D) where ((metric=metric1) and val&gt;10)or (metric=metric2) and val>5) where 2=(count;i) fby security

or easier to extend:

comparisons:([metric:metric1metric2]comp:10 5)

select from ((select by security,metric from t where date<D) ij comparisons) where val>comp,(count comparisons)=(count;i) fby security

or you could just use the Pivot.

Hello Mark,

Thanks for your quick response! I will study your reply and test it out on my end. Since I have daily data, I too believe yearly partitions make sense.

I am indeed planning on using the 32-bit version for my project. How big would my yearly dataset need to be for me to start running into issues with the 32-bit version? 

Finally, just in case this could make a difference, my data will be ordered by date and new data will be appended on a daily (end of day) basis. 

Thanks again for your help,

Rob

You could also use temporal data http://code.kx.com/wiki/Cookbook/TemporalData

t: (sym:8#ab;met:8#m1m1m2m2;date:2014.01.01 + til 8;v:til 8)

tt: s#3!symmetdate xasc t

`date xasc tt

sym met date      | v

------------------| -

a   m1  2014.01.01| 0

b   m1  2014.01.02| 1

a   m2  2014.01.03| 2

b   m2  2014.01.04| 3

a   m1  2014.01.05| 4

b   m1  2014.01.06| 5

a   m2  2014.01.07| 6

b   m2  2014.01.08| 7

tt ((a;m1;2014.01.03);(b;m1;2014.01.03))

v

0

1

note that you will need to apply sort attribute every time you load the table from disk (since I think sort attribute is not saved). 

I have never used this with partition table. But if you need to partition the table, you will need to remove and reapply key every time you save/load as well. Also, you might need the initial value at the start of the partition too. 

Hi Rob

In terms of size of the data set it depends upon how you are accessing it.  With yearly partitions the files are bigger, so if you are doing operations which access the full column (rather than being able to run within smaller date partitions) then you might hit problems. 

The thing I mentioned about compression doesn’t seem to apply any more with 3.2 (which is great!).  With 3.1 and previous (don’t know about exact versions) accessing compressed data usually used more memory, but that seems to have gone now e.g. 

q)n:10000000 q)t1:t2:(date:raze 100000#'date$til 100;m1:n?10000;m2:n?10000;m3:n?10000) q).Q.dpft[:.;2000;date;t1] `t1

q).z.zd:15 2 6 q).Q.dpft[:.;2000;date;t2] t2

check size, load:

aq$ du -sh 2000/*

267M 2000/t1

74M 2000/t2

3.1:

q)\ts select from t1 where date=2000.01.30                                                                                                                                                                

16 4720384

q)\ts select from t2 where date=2000.01.30                                                                                                                                                                

19 7235544

q)\ts select from t1 where m1>8000                                                                                                                                                                        

296 75498880

q)\ts select from t2 where m1>8000                                                                                                                                                                        

1490 356025352

3.2:

q)\ts select from t1 where date=2000.01.30 16 4720384

q)\ts select from t2 where date=2000.01.30 20 4719984

q)\ts select from t1 where m1>8000 266 75498864

q)\ts select from t2 where m1>8000 1649 75987440

So maybe you can just use yearly partitions, put a p attribute on date (which you can easily re-set after you append new data) and compression. 

The other approach is to split the tables and then join at run time: 

q)n:10000000 q)s:-100000 ? 5 q)t1:([]date:raze 100000#'date$til 100;sym:raze 100#'s;m1:n?10000) q).Q.dpft[:.;2000;date;t1] t1

q)t2:(date:raze 100000#'date$til 100;sym:raze 100#'s;m2:n?10000;m3:n?10000) asc -10000 ? n q).Q.dpft[:.;;date;t2] t2

aq$ du -sh *

153M 2000

588K sym

248K t2

To join the full table:

aj[`sym`date;select from t1;update `g#sym from select from t2]

(or do something smarter with as to how you actually want to query the data)

If the t2 table (the lower frequency data) is small enough, then you could look at storing it with a `p attribute on sym rather than date, which will make these time joins quicker for your use case.  The issue would be when you append new data you have to re-sort the t2 table, so assuming that isn’t too time/memory intensive you should be fine. 

Thanks

– 
Mark Rooney
Financial Software Developer
AQUAQ Analytics

Sorry, there appear to have been some formatting/text wrapping problems in my previous post. I have pasted the relevant code snippets below (hopefully without formatting issues this time!).

q)n:10000000

q)t1:t2:(date:raze 100000#'`date$til 100;m1:n?10000;m2:n?10000;m3:n?10000)

q).Q.dpft[`:.;2000;`date;`t1]

`t1

q).z.zd:15 2 6

q).Q.dpft[`:.;2000;`date;`t2]

`t2

check size, load:

aq$ du -sh 2000/*

267M    2000/t1

 74M     2000/t2

3.1:

q)\ts select from t1 where date=2000.01.30     

16 4720384

q)\ts select from t2 where date=2000.01.30

19 7235544

q)\ts select from t1 where m1>8000

296 75498880

q)\ts select from t2 where m1>8000

1490 356025352

3.2:

q)\ts select from t1 where date=2000.01.30

16 4720384

q)\ts select from t2 where date=2000.01.30

20 4719984

q)\ts select from t1 where m1>8000

266 75498864

q)\ts select from t2 where m1>8000

1649 75987440

So maybe you can just use yearly partitions, put a p attribute on date (which you can easily re-set after you append new data) and compression.

The other approach is to split the tables and then join at run time: 

q)n:10000000

q)s:-100000 ? `5

q)t1:(date:raze 100000#'`date$til 100;sym:raze 100#'s;m1:n?10000)

q).Q.dpft[`:.;2000;`date;`t1]

`t1

q)t2:(date:raze 100000#'`date$til 100;sym:raze 100#'s;m2:n?10000;m3:n?10000) asc -10000 ? n

q).Q.dpft[:.;;date;t2]

`t2

aq$ du -sh *

153M    2000

588K    sym

248K    t2

Hello Komsit,

Very interesting - I was not aware of this feature. I will look into it and report back if I have any questions.

Cheers,

Rob

Mark,

Thanks again for your quick reply. 

I have been running some tests thanks to your examples. 

For highly repetitive data (which will probably make up half of my columns or so) the compression ratio is quite impressive. For 30M data points, I went from 344MB to 18MB with your .z.zd settings. Given this impressive result, I think I might just go with the simplest approach - 1 large table that has a point for each date/security/metric combination, repeating values when no new data is available. 

This will facilitate my queries, especially since I want to handle dynamic queries that may potentially contain data transformation functions (moving averages, etc).

Thanks again for your help,

Rob