Efficient top n query

Hello!

I am trying to find an efficient way of expressing the following SQL query in q:

SELECT passenger_count, trip_distance, total_amount FROM trips ORDER BY total_amount DESC LIMIT 100;

The best I’ve been able to come up with:

`total_amount xdesc select passenger_count, trip_distance, total_amount from trips


This works, but is probably inefficient because without the limit clause the entire dataset has to be sorted. Additionally, the query allocates a lot of memory which causes table data to be evicted so all subsequent queries have to read from disk again (which takes about 8 minutes). When run in ClickHouse, this query uses very little memory and completes within 3s after the first run.

Are there any better ways of expressing this query that would lead to more efficient execution?

Thanks,

Clemens

If I understand correctly the LIMIT statement in SQL specifies the number of rows to be returned that matches the select criteria. You can achieve the same thing in kdb by using ‘100#’. For example your query will look something like `total_amount xdesc 100#select passanger_count, trip_distance, total_amount from trips. This will select the first 100 rows that matches you’re criteria and will only sort the result rather than the entire dataset. Please let me know if this is what you were looking for.

Thanks Alex! Not sure why I didn’t manage to find the # operator myself ;)
What I’m trying to do is to select the 100 rows where total_amount is largest (out of all rows).

So your suggested query only needs a small adjustement:

100#`total_amount xdesc select passenger_count, trip_distance, total_amount from trips

However, this seems to have the same performance/memory usage problems as my original attempt.

I’m guessing kdb doesn’t perform any query rewriting optimizations (at least for this query) so the 100# operator is applied only after performing the sort of the entire dataset.

The way I was hoping this query would be executed is by evaluating the 100#`total_amount xdesc jointly using a constant amount of memory by e.g. keeping a running total of the 100 largest elements in a min heap.

I suppose this kind of operation might just not be supported?

I just wonder, can any RDBMS do that without solving the whole dataset? I think the best they can do is collection some statistics and pick a better sorting algorithm?

You’re right that this query still requires scanning the entire dataset but it doesn’t necessarily require a sort/copy of the entire dataset if you maintain a running total of just the 100 largest elements during query execution.
ClickHouse claims to be able to do this (https://clickhouse.yandex/docs/en/query\_language/queries/#order-by-clause) and in fact doesn’t show any increase in memory usage when running this query. kdb+ grows memory size to the maximum available on my system (60GB).

the ‘select’ operator allows a few extra parameters:

select[n;order]

documented here:

https://code.kx.com/wiki/Reference/select

with some sample data, we can see that the query is faster than sorting the whole table, and produces the same results:

q)trips:flip passenger_counttrip_distancetotal\_amount!10000000?/:(100;100f;100f) q)\ts a:100#total_amount xdesc select passenger_count, trip_distance, total_amount from trips
2399 536872368
q)\ts b:select[100;>total_amount] passenger_count, trip_distance, total_amount from trips
1602 536871520
q)a~b
1b

Many thanks, this looks like it might do what I want!
The [n] operator does not work with memory mapped partitioned tables, which persuaded me from trying it before.

But presumably I could just load the relevant columns into an in-memory table and then run the query on that.

My first attempt at doing so:

trips2:(passenger_count:(select passenger_count from trips);total_amount:(select total_amount from trips);trip_distance:(select trip_distance from trips))

Unfortunately this fills up all my memory, and all my swap, and then dies.

Same goes for this second attempt:

pc:select passenger_count from trips

ta:select total_amount from trips

td:select trip_distance from trips

trips2:(passenger_count:pc;trip_distance:td;total_amount:ta)

The original dataset contains 1.43 billion rows and a rough back of the envelope calculation gives me (4 + 1 + 4)byte x 1.43 x 10^9 = 12GiB, so I don’t really know where that extra memory is coming from.

The select queries on the individual columns work as expected, the memory increase only comes when I create the trips2 table.

Even this runs out of memory which makes me think I’m just making some noob mistake:

ta:select total_amount from trips

trips2:(total_amount:ta)

Just running this causes ~4GB in memory growth, which is about 3-4 times higher than I would have expected:

pc:select passenger_count from trips

About 6GB for this which is closer but also strange:

ta:select total_amount from trips

So I hypothesized that select might return data in some results data format that is not super efficient/suitable for creating tables.

Running type gives me:

q)type pc

98h

q)type ta

98h

I haven’t quite figured out what this means yet, but the original types were X and E so this is a bit surprising.

Anyway, it’s getting late so I will resume my investigation tomorrow if I can find the time.

Thanks all for the help so far.

I see. It would probably be useful to use a combination of sublist and iasc to achieve what you want. For example: 

select passenger_count, trip_distance, total_amount from trips where i in 100 sublist iasc total_amount (or 100 sublist `total_amount xasc select passenger_count, trip_distance, total_amount from trips where i in 100 sublist iasc total_amount if the table is partitioned)


iasc returns the indices needed to sort the list argument and sublist will take the first 100.


Let’s assume we have this table:

trips:(passenger_count:1000000?500i;trip_distance:1000000?1000f;total_amount:1000000?10000f)


The query from above will return:


passenger_count trip_distance total_amount

------------------------------------------

279 692.8919 0.008079223

201 347.8761 0.02150191

334 485.4019 0.02427492

176 676.7511 0.07542549

300 762.484 0.07903203

299 119.5263 0.0825664

106 153.9039 0.1044618



Which is the same as if we would do:


q)100#`total_amount xasc select passenger_count, trip_distance, total_amount from trips

passenger_count trip_distance total_amount

------------------------------------------
279 692.8919 0.008079223
201 347.8761 0.02150191
334 485.4019 0.02427492
176 676.7511 0.07542549
300 762.484 0.07903203
299 119.5263 0.0825664
106 153.9039 0.1044618


q)\ts `total_amount xasc select passenger_count, trip_distance, total_amount from trips where i in 100 sublist iasc total_amount

124 25167104


q)\ts 100#`total_amount xasc select passenger_count, trip_distance, total_amount from trips

158 29361632


But slightly more efficient.


Regards

Alex



Magnificient, thank you for this wizardry!
This is the query I ended up using:

100 sublist `total_amount xdesc select passenger_count, trip_distance, total_amount from trips where i in 100 sublist idesc total_amount

It runs within 14s and 2GB of resident RAM which beats my original query by orders of magnitude.

Try and separate the retrieve of total_amount and the rest of the table. All of total_amount is needed, but only 100 records of the whole table.

Here’s some clues as to why you’re seeing that memory footprint: https://www.aquaq.co.uk/q/adventure-in-retrieving-memory-size-of-kdb-object/

idesc (aka > monadic) will sort the column and return indexes. These indexes can be used to retrieve table rows. idesc wants to sort all which is expensive, we only need the top vs the rest.

Here’s an experiment to get better performance for this use case: do a binary search that converges on the threshold, then pick the rows that are above that threshold. It seems to give a 3-4 times speed up and 30 times less memory. This approach can also run in parallel which helps speed a bit, but costs a lot more memory. It’s not fully optimal.

topN:{[list;n]

/ generate a series like .. 32 16 8 4 2 1

bits:reverse 2 xexp til ceiling 2 xlog max[list]-min[list];

/ estimate the threshold where at least n values are larger

/ cmp:{[n;list;x]n>sum .Q.fc[x<]list}[n;list]; / uncomment for parallel

cmp:n>sum list>;

threshold:{$[x r:y+z;y;r]}[cmp]/[0;bits];

/ harvest and get the precise top

n#key desc result!list result:where list>threshold

};


trips:flip passenger_counttrip_distance`total_amount!10000000?/:(100;100f;100f)


q)\ts show result:topN[trips`total_amount;100]

/409 17826400

q)\ts show expected:100#idesc trips`total_amount;

/1505 536871552


The indexes can be used to retrieve rows:


q)trips result

Will

Nice one!
Your query didn’t quite work for me because my trips is a partitioned memory mapped table (and there were some other type mismatches), but I was able to construct this (less idiomatic) query based on your idea:

cnt:count trips

minmax:select min total_amount, max total_amount from trips

tamin:first minmax`total_amount

tamax:first minmax`total_amount1

low:0

mid:cnt%10

hight:cnt

estimate:((tamax - tamin) % 10) + tamin

cnt2:count select total_amount from trips where total_amount > estimate

while[(cnt2<100)|(cnt2>100000);estimate:(((tamax - tamin)*mid) % cnt) + tamin;cnt2:count select total_amount from trips where total_amount > estimate;if[cnt2<100;high:mid;mid:mid - ((mid - low)*0.9)];if[cnt2>100000;low:mid;mid:mid + ((high - mid)%2)]]

result:select total_amount, passenger_count, trip_distance from trips where total_amount > estimate

It’s also not quite a binary search (change the % 10 and * 0.9 to % 2 to make it so), those adjustments were necessary to ensure reasonably fast convergence on my dataset which is strongly skewed, with most values small but a few very large.

So this won’t work in general as is, but I believe there are fairly robust methods that could ensure fast convergence on all but the most degenerate datasets by utilizing information about the value at the current pivot which could be computed concurrently with cnt2. (I will leave that as an exercise to the reader :P)

I’ve also omitted the final sort, but that shouldn’t affect performance since the result set is small.

Runtime is now down to 5.2s:

q)))\ts cnt:count trips;minmax:select min total_amount, max total_amount from trips;tamin:first minmaxtotal_amount;tamax:first minmaxtotal_amount1;low:0;mid:cnt%10;hight:cnt;estimate:((tamax - tamin) % 10) + tamin;cnt2:count select total_amount from trips where total_amount > estimate;while[(cnt2<100)|(cnt2>100000);estimate:(((tamax - tamin)*mid) % cnt) + tamin;cnt2:count select total_amount from trips where total_amount > estimate;if[cnt2<100;high:mid;mid:mid - ((mid - low)*0.9)];if[cnt2>100000;low:mid;mid:mid + ((high - mid)%2)]];result:select total_amount, passenger_count, trip_distance from trips where total_amount > estimate

5204 204304

 

There are probably further optimizations to be made, but I think this will suffice for my purposes.

Many thanks,

Clemens

Nice work, well done :)

I’ve experimented a bit more - assuming a normal distribution the initial estimate can be improved. Luckily Kx has a function posted that approximates the z-score for a given probability.

Adding multithreading (-s 20) gave a more pronounced speed improvement now - up to 4x. I figure that the internal parallelisms kick in both for the statistics and population measurements.

/ run with -s 20

/ https://en.wikipedia.org/wiki/Standard_deviation

/ https://statistics.laerd.com/statistical-guides/standard-score-3.php

/ from https://github.com/KxSystems/kdb/blob/master/stat.q

pi:acos -1

xn:{$[.5>x;0-.z.s 1-x;.92>x;

(x2.50662823884+l-18.61500062529+l41.39119773534+l-25.44106049637)%1+l*-8.47351093090+l23.08336743743+l-21.06224101826+3.13082909833l:xx-:.5;

0.3374754822726147+l0.9761690190917186+l0.1607979714918209+l0.0276438810333863+l0.0038405729373609+l0.0003951896511919+l0.0000321767881768+l0.0000002888167364+0.0000003960315187l:log 0-log 1-x]}

nor:{$[x=2n:x div 2;raze sqrt[-2log n?1f]/:(sin;cos)@:(2pi)*n?1f;-1_.z.s 1+x]}


topN:{[table;field;n]

/ estimate threshold assuming a normal distribution

/ use functional form to allow parameterization

stats:first ?[table;();0b;`count`avg`max!((count;field);(avg;field);(max;field))];

est:stats[avg] + xn 1 - n % statscount;


cmp:{[table;field;n;est]

n>first ?[table;enlist(>;field;est);0b;enlist[`x]!enlist(count;`i)]`x

}[table;field;n];


/ double window to the left until happy, account for overestimation

threshold:{[top;x]top- 2*abs top-x}[stats[`max]]/[cmp;est];

/ harvest

?[table;enlist(>;field;threshold);0b;()]

};


generatePartition:{

-1"generating partition for ",-3!x;

n:5000000;

trips set update total_amount:nor[n] from flip sympassenger_counttrip_distance!n?/:(`5;100;100f);

.Q.dpft[`:db;x;`sym;`trips]

};


/ generatePartition each 2015.01.01+til 10;

\l db

\ts show t:total_amount xdesc topN[trips;`total_amount;100]

570 4494000

Will