slow aggregate multiple columns - 10x slower than J

I’m evaluating kdb against an existing J application and was surprised to see how slow an aggregate of multiple columns is - both in memory and hdb

The application is aggregating approximately 51 million rows down to 13 million customer product combinations

The kdb code takes approximately 31 seconds vs 3 seconds with J

Is there a faster way to do the sum in kdb?

/test 1 - using symbols

n: 13000000;

cust: n?`8;

prod: n?`8;

v: n?100

a:(cust:cust; prod:prod ;v:v)

q)\t select sum(v) by cust, prod from a

31058

/test 2 - using strings, very slow

n: 13000000;

cust: string n?`8;

prod: string n?`8;

v: n?100

a:(cust:cust; prod:prod ;v:v)

q)\t select sum(v) by cust, prod from a

116745

comparison J code

n=:13000000

cust=: _8[\ a. {~ (65+?(8*n)#26)

prod=: _8[\ a. {~ (65+?(8*n)#26)

v=: ?.n#100

agg=: 3 : 0 

keys=:i.~ |: i.~ every (cust;prod)

c=.((~.keys) { cust)

p=.((~.keys) { prod)

s=.keys +//. v

c;p;s

)

NB. 3.57 seconds

6!:2 ‘r=.agg 0’

3.57139

 ({.@$) every r

13000000 13000000 13000000

After further testing, I was able to get it down to about 4x difference instead of 10x:

\t {sum each x[v][group[flip (x[cust]; x[`prod])]]}(select v, cust, prod from a)

12887

This is getting warmer:

\t {sum each a.v[(x?x)]} (flip (a.cust;a.prod))

7342

Still, J is at 3.5 seconds.

Does your data really look like that?  The test data I generated had 13m unique combinations … In which case what you want is just a! 

If your data is predominantly unique, you could maybe do something like 

q)\t r:(custprod xkey a inds) + select sum v by cust,prod from a til[count a] except inds:(select cust,prod from a) ? d:distinct select cust,prod from a          

5861

i.e. grab one index of each distinct item and add to it the sum by cust,prod of the remaining items.  There might be better ways

No, my actual data is about 5x that size (around 50 million rows) but has 13 million distinct. The sample data was used to just recreate the performance of the grouping

That sped up the in-memory query:

q)\t r:(custprod xkey a inds) + select sum v by cust,prod from a til[count a] except inds:(select cust,prod from a) ? d:distinct select cust,prod from a

6809

on my actual data:

q)view::select cust, prod, v from tbl where month within (2014.01m;2014.06m)                                                   

q)\t r:(custprod xkey view inds) + select sum v by cust,prod from view til[count `view] except inds:(select cust,prod from `view) ? d:distinct select cust,prod from `view 

12672                                                                                                                                                                          

compared to:

q)\t select sum(v) by cust, prod from `view

9944

q)count d

6437008

Parallel processes doesn’t speed it up. Each process is actively working and completes within a few seconds, but the bulk of the time is spent in the master merging

On Thursday, March 26, 2015 at 11:05:35 AM UTC-4, joebo wrote:

q)view::select cust, prod, v from tbl where month within (2014.01m;2014.06m)                                                   

q)\t r:(custprod xkey view inds) + select sum v by cust,prod from view til[count `view] except inds:(select cust,prod from `view) ? d:distinct select cust,prod from `view 

12672                                                                                                                                                                          

compared to:

q)\t select sum(v) by cust, prod from `view

9944

Do you need the view? 

select sum v by cust,prod from tbl where month within  2014.01 2014.06m

Will execute in parallel across partitions which will provide a win, particularly if you are using SSD/segmented or data is hot in file cache.

On my creaky spinning rust box I get almost a 2x speedup using 2 cores

tbh, I think you found a very good solution yourself(the second one in your second post)…

Focusing on on disk data, as your “actual” data is on disk…

q -s 10

n: 13000000;

cust: n?`8;

prod: n?`8;

v: n?100

a:(cust:cust; prod:prod ;v:v)

month:n?2014.01m + til 6

a:(month:month;cust:cust; prod:prod ;v:v)

(sv/:(hsym each$string distinct month),:a) set’ .Q.en[`:.;]each {select from a where month=x} each distinct month

\l .

`a in .Q.pt

1b

//sum by

\ts:3 r1:select sum v by cust,prod from a where month within  2014.01 2014.06m

302857 2046823856

/avg of 100952.3

//mod yours to work with ondisk 

\ts:3 r2:select cust,prod,v:sum each v[{x?x}flip(cust;prod)] from a where month within  2014.01 2014.06m

4947 805309584

/avg of 1649 - 1.6 seconds - 61X speedup

//fby  - never realized it was so much faster than select by

\ts:3 r3:select cust,prod,v from a where month within  2014.01 2014.06m,v=(sum;v) fby (cust;prod)

11225 335547120

/avg of 3741.667 - 27X speedup - you may also want to consider this due to size of the memory usage of yours

//check if results are the same

(mmm[0]~mmm[1])&mmm[0]~mmm[2]

1b

dont forget to play about with attr’s and you should see speed up - but I don’t think you can as the 13mm is a subset of the full data anyway

HTH and interesting question.

Sean

//fby  - never realized it was so much faster than select by

\ts:3 r3:select cust,prod,v from a where month within  2014.01 2014.06m,v=(sum;v) fby (cust;prod)

11225 335547120

/avg of 3741.667 - 27X speedup - you may also want to consider this due to size of the memory usage of yours

fby is only working here due to the synthetic dataset that is all uniques. Any cust,prod key resulting in a v with >1 elements will be omitted from the result..

So when operating on the full 50M set which has non-uniques the answer will be incorrect.

eg.

//fby returns empty table 

t:( cust: 1 1 2 2; prod: 0 0 1 1; v: 1 2 3 4);

select sum v by cust,prod from t;

select cust,prod,v from t where v=(sum;v) fby (cust;prod);

//fby “works” 

t:( cust: 1 2 3 4; prod: 0 0 1 1; v: 1 2 3 4);

select sum v by cust,prod from t;

select cust,prod,v from t where v=(sum;v) fby (cust;prod);

Fair enough…point taken. I should have considered that. I also just realised joebo’s does the same thing, so I was just result matching.

Nonetheless, rearrange the query and omit the “where v”…

distinct update v:(sum;v) fby (cust;prod) from select cust,prod,v from a where month within 2014.01 2014.06m

I thought fby should simply sum the requested groupings when using fby, but it seems to sum the groups per partition. Therefore the nested select is required. It could be an awful lot quicker without it

Anyway, this gives 10X speedup over “select sum v by” ..and this time deals with duplicate combinations.

Thanks,
Sean