Partition HDB by date vs sym

Hello,

If I have just OHLCV data going back 30 years, is it still better to partition this by date?  It seems like it would be a waste of file I/O with so many dates, and guessing it would be slower to query by symbol over the entire history.  Would it be better if I partitioned this by symbol?

Thanks

a db can not be partitioned on sym (unless you convert the values to integers).  it must be an integer value such as date, month, year, int.

sounds like you might like to partition by month, but remember that each day you load new values, you’ll need to re-load the whole month.  also note that your ‘where’ clause should always begin by filtering on month.

q)n:100000
q)t:date xasc ([]date:n?.z.D;sym:n?3;o:n?1f;h:n?1f;l:n?1f;c:n?1f;v:n?100)
q)g:t group “m”$tdate q){[p;t].Q.dpft[:/tmp/db;p;sym;ohlcv set t]}'[key g;value g];
q)\l /tmp/db
q)meta ohlcv

c     t f a
month m    
sym   s   p
date d    
o     f    
h     f    
l     f    
c     f    
v     j    
q)select from ohlcv where month=“m”$.z.D
month   sym date       o         h         l         c         v

2019.06 aal 2019.06.07 0.6585121 0.2368865 0.2398769 0.743995  70
2019.06 abj 2019.06.16 0.945153  0.9333705 0.5127791 0.1373071 95
2019.06 abm 2019.06.05 0.5951452 0.4736208 0.4860415 0.5169919 87
2019.06 adb 2019.06.18 0.1304718 0.4402366 0.5900309 0.3641042 69
2019.06 aea 2019.06.07 0.5544269 0.9710216 0.2554513 0.3101449 34
2019.06 aff 2019.06.14 0.1325721 0.9624151 0.6051182 0.5902899 86
2019.06 aga 2019.06.18 0.4927291 0.1165071 0.513648  0.1028709 15
2019.06 agd 2019.06.12 0.540583  0.1415103 0.1403603 0.2423614 57
2019.06 aif 2019.06.03 0.841483  0.4736343 0.409631  0.4257846 0
2019.06 ajc 2019.06.04 0.2016999 0.368853  0.6672423 0.8280397 86
2019.06 ajf 2019.06.18 0.5192833 0.6322922 0.7383055 0.1426414 75
2019.06 ajh 2019.06.03 0.3839794 0.5397272 0.4614627 0.3412326 63
2019.06 ajk 2019.06.17 0.5619156 0.4685905 0.8980506 0.4295941 53
2019.06 akg 2019.06.18 0.9107939 0.6785924 0.7272209 0.3877076 38
2019.06 akh 2019.06.14 0.4304974 0.7199357 0.6680404 0.8985443 30
..

Hi Roni,
Assuming that you’re querying the entire data then partitioned by int (where ints map to syms) makes sense as the data is already sorted by sym and the number of files io operations is minimised. Can you elaborate on the number of syms present and the daily volume of records coming in?

David

Oh ok, thought I could partition by sym but partitioning by month will work.  Thanks for this.  I should have roughly 8k syms and one record per sym daily.

Thanks much appreciated

Roni Hoffman
1-519-859-1082

Hi again Roni just following up. 

Whatever method of partitioning you’re going to settle on really depends on what sort of queries you’re likely to be running.
You obviously don’t want to be loading in data you aren’t actually interested in with regards to memory and speed of queries.

 
If your queries are focused within a particular window of time and you’re considering all syms then doing some sort of temporal partitioning makes sense.
If your main focus is for a particular sym then I’d partition by “symbol” (int) . If this is the case then you could do the following to partition by “sym” (with some modification for your own data set)

Create a sym-int map for your syms:

map:s!til count s:distinct tablesym`

Add a new column to your table then that contains the int value for the sym in that record:

update sym_int: map sym from table

Then save down the data to the separate int partitions:

{.Q.dpft[:hdb;x;sym_int;table]} each til count map`

As your data coming in is simply being appended to your daily partition you might want to create another database that’s partitioned by “sym” (int) and transfer the data over, depending on what sort of resources you have.