Finding the oldest record in a large partitioned table

Hi all - enjoying this mailgroup a lot - thanks for all the useful input! Two questions:

  1. Given the below, is there a faster solution to find the oldest record?

I’m assuming a large partitioned table and a maximum timeframe of 50 years.

  1. Is there a prettier way to find out if there’s a record for a given date?

(I’m still finding out about grepping the directory structure on disk, i.e. do we have 

a non-empty file /tmp/db/1995.01.01/t/date)


create:{[size;date]

sym: `$“:/tmp/db/”,(string date),“/t/”;

t:( date:x+00:30*til size; price:size?10);

sym set t

}

/ 10000 directories - 110MB - uncomment to generate:

/ create[100] each .z.D - til 10000

/ is there a prettier way?

check:{0 < first exec cnt from select cnt: count date from t where date=x}

/ x is the search window which will be repeatedly halved

find:{

/ generate halfway marks for search

steps: reverse {2*x}[x>;1];

/ if date exists, turn left, else return right

{$[check x;x-y;x+y]}/[.z.D;steps]}

system(“l /tmp/db”)

show “binary search”

\t find[100*365]

/ 2 ms on my Mac

show “select min date from t”

\t select min date from t

/ 480 ms on my Mac


Thanks,

Will

Hi,

The counts of partitions can be retrieved usign the .Q namespace, of which there is an example here: http://code.kx.com/q/ref/dotq/#qpn-partition-counts

For a table t, .Q.cn will get the partition counts, which will then be visible in .Q.pn.

The counts for each date can be viewed with .Q.pv!.Q.pn`t.

You should be able get the oldest partition containing values with: .Q.pv first where 0<.Q.pn`t.

The dictionary: (.Q.pv!0<.Q.pn`t) should allow you to find out if a partition is populated for table t.

Regards,

Thomas

AquaQ Analytics

Hi Thomas,

Thanks for pointing that out, that seems like a great option. In the environment I need this I won’t be able to run these functions - consequences of specific large scale systems.

The above works well, however, is there a cleaner way to phrase this query? The exec.. select … approach comes from ‘Q for Mortals’ to work around a limitation in partitioned tables.

check:{0 < first exec cnt from select cnt: count date from t where date=x}

Thanks,

Will

Since you are querying HDB’s… 

Try “select cnt:count i=0 from t where date=x”   -> This uses the same concept of .Q.cn and .Q.pn internally and maps to find the partitions in a second. Please note, this gives exceptional performance, but only for HDB’s.

Regards

Sujoy