does kdb+ support aggregation on non-time fields?

Hi guys,

Looking to find some info if KDB+ supports aggregation/roll-up/group-by on non-time fields? 

My use case: historic price data, which needs to be compared. 

Most queries should happen on the last known price-point of a particular ‘product’ and than roll-up over various dimensions such as product-category and/or brand. 

I’ve seen druid.io which seems to be marketed for this case, but it seems a minimal production cluster runs to be pretty costly (ymmv). Moreover, it uses zookeeper which I’ve had a lot of problems with in the past. 

Instead, I thought that if I could do this with KDB+ 32bit, possibly splitting the load over multiple processes (which seems possible), I could get away with a much lighter setup. I’ve read somewhere that it is possible to only keep the latest price-point in RAM (which is what most queries use) and do the occasional historic queries from disk/ssd. This should keep me within the 4GB ram * #32 bit processes. 

Lastly, perhaps it’s useful to mention that the dimensions along which data should be aggregated are known up front. Not sure if kbd+ has something like materialized views that could help here.

Would this work? Any suggestions more than welcome.

Best,

Geert-Jan

Hi Geert-Jan,

You can get the last row from a table for each combination of product, category and brand with a query like this:

select by product,category,brand from table

If you wanted the average price over each product,category and brand you could do:

select avg price by product,category,brand from table

A subscriber to a tickerplant in a kdb+ tick setup that keeps the last record by sym and exchange in a table would look like this:

upd:{[t;x] .[t;();,;select by sym,ex from x]}

h:hopen `$“:”,.z.x 0 

h(“.u.sub”;;

So you could do something similar, defining the upd function to be:

upd:{[t;x] .[t;();,;select by product,category,brand from x]}

Thanks

Ben

Thanks Ben - that’s easier than I thought. 

The terse syntax is something to get used to though. 

Looking through the cookbook this might just be the new db to learn for me!

As a ‘stress-test’ would you mind sharing if the following more involved queries would be possible as well in your opinion? 

Not looking for actual implementations - unless you feel like it ;)

given: 

  • dictionary of format: ‘timestamp | tenantId | productId | productCat | brand | price’

  • current tenant is identified as tenantId=1

query in pseudo-code: 

  • show nr of products where current tenant is cheapest (as checked by last price) checked over all tenants grouped by brand

  • show products where current tenant is cheapest (as checked by last price) grouped by tenant (other than self) limit 10 products per tenant start at product #20

  • show avg = (last(price) over all tenants)), relativePriceAsPercentage = ((last(price) of tenantId=1) - avg / avg) order by relativePriceAsPercentage desc limit 10

Thanks a lot,

Geert

so you want to start with some sample data…

q)f:{asc (timestamp:x?2000.01.01D;tenantId:x?10;productId:(x?10)+10*b;brand:b:x?10;price:(x?1000)%100)}

q)f 10

timestamp tenantId productId brand price

------------------------------------------------------------

2000.01.18D14:22:04.754586070 7 10 1 6.41

2000.01.30D11:54:47.978075444 0 47 4 3.62

2000.06.02D05:57:18.775929808 6 38 3 3.78

2000.09.09D14:38:41.996048836 0 17 1 6.06

2000.09.11D10:20:44.273744524 5 63 6 3.84

2001.03.17D02:46:34.172586504 8 44 4 7.38

2001.07.03D16:27:15.278282312 0 29 2 3.04

2002.01.03D12:01:25.593561080 0 25 2 0.5

2003.01.31D01:14:28.967411664 9 73 7 5.9

2003.04.15D14:31:42.939721936 7 76 7 7.06

q)t:f 1000 /thousand row table

and i’ll hand the baton over to the next poster for the queries as my favourite show is on…

best,

jack.

Hi Geert-Jan,

Those queries should all be possible; using the table that effbiae posted:

  • show nr of products where current tenant is cheapest (as checked by last price) checked over all tenants grouped by brand

q)select count productId by brand from (update minprice:min price by productId from select by productId,tenantId from t) where tenantId=1,price=minprice

brand| productId

-----| ---------

3    | 1

7    | 2

8    | 1

This gets a table with the last prices for each combination of productId and tenantId, and adds a column with the minimum price by productId to it. Then it selects where the tenantId is 1 and the price is the minimum, groups them by brand and counts them

  • show products where current tenant is cheapest (as checked by last price) grouped by tenant (other than self) limit 10 products per tenant start at product #20

I’m not sure if this is exactly what you’re looking for here, but this will get for each tenant the first 10 productId’s greater than 20 where that tenant is the cheapest for the product

q)select 10 sublist productId by tenantId from select from (update minprice:min price by productId from select by productId,tenantId from t where productId>=20) where price=minprice

tenantId| productId

--------| -----------------------------

0       | 41 42 45 81 88

1       | 36 73 75 83

2       | 22 34 50 84 87 91 95 99

3       | 28 30 47 53 55 58 60 78

4       | 25 29 57 67 68 79 80 93 98

5       | 23 37 43 61 64 70 72 90 96 97

  • show avg = (last(price) over all tenants)), relativePriceAsPercentage = ((last(price) of tenantId=1) - avg / avg) order by relativePriceAsPercentage desc limit 10

q)10 sublist `relprice xdesc select productId,price,avprice,relprice:(price-avprice)%avprice from select from (update avprice:avg price by productId from select by productId,tenantId from t) where tenantId=1

productId price avprice relprice


32        8.54  3.928   1.174134

2         8.4   3.908   1.149437

10        9.58  4.728   1.026227

33        9.65  5.067   0.90448

65        9.29  4.893   0.8986307

11        7.43  3.921   0.8949248

14        9.89  5.297   0.8670946

80        7.68  4.165   0.8439376

79        6.96  3.802   0.8306155

28        9.75  5.365   0.8173346

This adds the average price column on to a table of the last prices for each combination of productId and tenantId, then selects from this where the tenantId is 1, and calculates the relative price from the result. 10 sublist `relprice xdesc … sorts the table by relprice and takes the first 10

Thanks

Ben

This is awesome! Thanks.

Hard time that I commit some time to start learning q and kdb. 

Cheers,

Geert

> learning q and kdb. 

it takes some effort (unless you’re a genius) but the effort is well worth it.  

learning kdb and where it came from can be a rewarding journey.

Definitely. Plowing through ‘Q for mere mortals’ as we speak :)