Merge joins and other performance related questions

I often find myself in need to join two tables which are sorted and unkeyed.

Do i understand correctly, there’s no O(n) operation to do it? Merge join would do well, but it’s not here, or I missed it. I have to do it in C now.

Also, i often do v[where v within (a;b)] on large enough v where resulting value can only be 10% or less. During evaluation, the intermediate large vector is built (result of within function), then it is passed to where function, then indexed application to v is done. Things are even worse in case of v[where (v>a) and (v<b)] and similar requests.

I wrote where_within in C, it resulted in 10x speedup in my usual case, but maybe I miss something important.

Also I cannot imagine the implementation of distinct, look at this case:

q)t:20000000?20000
q)ts:`$string t
q)\t distinct ts
58
q)\t sum t
26

The distinct is almost O(n). How can it be that good?

What hash set implementation do they use for symbols?

I understand, work is done on interned pointers. I could only invent hash map where i use pointer>>4 as a hash, and then just modulo internal table size (large enough), and using next slots if current is busy. I don’t know how to make this approach generic, apart from making internal table size large enough (3 times the initial array, for example). Any general thoughts on “distinct” implementation?

Thank you.

P.S. I understand and appreciate the minimalist approach of kdb+ though.

  1. What attributes are you applying to the tables in question, this should give you a performance boost

q)n:10000000

q)t1:(sym:n?`3;col1:n?10.0;col2:n?.Q.a)

q)t2:(sym:n?`3;col3:n?10;col4:n?100.0)  )

q)\t t1 lj 1!t2

131

q)update g#sym from t1

`t1

q)update g#sym from t2

`t2

q)\t t1 lj 1!t2

86