Hey all,
New to kdb+ and have a few questions on optimal database structure.
I have a daily point-in-time dataset of financial metrics for a large universe of stocks. Some of these metrics are updated daily, others are only updated once in a while (a few times per year). I need to query this database efficiently in the following manner: on day d, which stocks match a given set of criteria (say metric1 > 10 and metric2 < 5). In this example, metric1 is updated daily and metric2 is updated quarterly.
Here are the two different approaches I can think of:
Option 1
A straightforward approach would be to have a table that simply repeats the last value for metrics that are not updated frequently (in this case, metric2 is repeated unless its value changes). This makes for simple queries, but will require a lot of storage space (unless kdb+ optimizes away repeated values? Can it?).
date security | metric1 metric2 …
2010.01.01 AAPL | 11.42 32.05 …
…
Option 2
Another option I see is to only store the updates:
date security metric value
2010.01.01 AAPL metric2 32.05
2010.03.25 AAPL metric2 33.12
…
This approach uses less space, but the query complexity increases - not sure about the performance but I assume it will also be slower. To obtain my results, I need to get the latest value for each security&metric pair, then filter down that intermediate dataset based on the query. I am still unsure on how to efficiently build this query (or set of queries). How do I efficiently fetch the most recent data point for each security&metric pair? Then how do I query this intermediate result? Do I have create a pivot table?
I hope I explained my situation clearly - if not please let me know and I will elaborate. My questions is which of these options, if any, is a sensible approach to my problem? Is there a better structure that I am overlooking?
My primary concern is speed, but since the dataset is rather large, storage space might be an issue. I prefer Option 1 but I am afraid this type of table will take up too much space, unless kdb+ can optimize away the repeated values (for storage purposes).
Any insight will be much appreciated. Thanks in advance!
- Rob