small challenge for the q gods

given a large list of string phrases, /hundreds of thousandsk:(“how fast”;" “;“perform this”;“job”;…)and given another, much larger, list of string phrases, /millionsv:(“how fast”;“can q”;“perform this task”;”?“;”!“;…)how does one, with maximum performance, calculate the sum of alloccurrences of each k within all of v?for example, in the above case, given k[0 1 2 3] and v[0 1 2 3 4], itshould yield:“how fast” 1” " 4"perform this" 1"job" 0alternatively, v could be viewed as a single delimited stringto complicate things, the chars “*” are found within the elements ofboth k,vhave tried something like the following:v0:" "sv vt:flipphrasenum!(k;{count ss[v0;x]}peach k)using various roundabout methods for dealing with “*” in k,vneed to significantly improve performance

i think this is not a nice problem for q if the builtin ss is not good enough for you
you should write this directly in C
see the multi-pattern search part at http://en.wikipedia.org/wiki/Rabin-karp
? Attila

hi atilla, k.os.tao,

if q is not so strong here, lex (or flex) will do a good
job at matching these sort of patterns. from q, you can
generate a lex file (call it l.l) that looks like:

%{
long n=3D0;
main()
{for(;yylex()>0;);
printf(“counted %ld matches\n”,n);
return 0;
}
%}
%%
“how fast” n++;
“perform this” n++;
“job” n++;
.
/**** EOF *****/

and build your scanner and run with test input:
$ flex -o n.c n.l && gcc n.c -o n -lfl && echo “how fast does it job
question?” |./n

counted 2 matches

ta, jack.

attila, jack - thank you for your suggestionsi am enjoying exploring this task in q and am learning a lot in theprocess.i have defined various methods that work, with big speed improvements,but which fail on larger lists due to the 32-bit address space limit.a couple additional questions:1. is there a faster way to get the following result? {(key g)!count each value g:group x} where x is a large list2. what is the best way to serialize to a large dictionary ofdictionaries to disk?It is too large to generate it within memory all at once.I am hoping to do something analogous to splaying and/or partitioningwith tables.On Dec 13, 11:22?pm, Jack Andrews <effb…> wrote:> hi atilla, k.os.tao,>> if q is not so strong here, lex (or flex) will do a good> job at matching these sort of patterns. ?from q, you can> generate a lex file (call it l.l) that looks like:>> %{> long n=0;> main()> {for(;yylex()>0;);> ?printf(“counted %ld matches\n”,n);> ?return 0;}>> %}> %%> “how fast” n++;> “perform this” n++;> “job” n++;> .> / EOF* />> and build your scanner and run with test input:> $ flex -o n.c n.l && gcc n.c -o n -lfl && echo “how fast does it job> question?” |./n>> counted 2 matches>> ta, jack.>> On Mon, Dec 13, 2010 at 10:41 PM, Attila Vrabecz>> <attila.vrab…> wrote:>> > i think this is not a nice problem for q if the builtin ss is not good enough for you> > you should write this directly in C> > see the multi-pattern search part athttp://en.wikipedia.org/wiki/Rabin-karp> > ? Attila>> > On Sat, Dec 11, 2010 at 3:52 AM, k.os.tao <k.os…> wrote:>> >> given a large list of string phrases, ?/hundreds of thousands>> >> k:(“how fast”;" “;“perform this”;“job”;…)>> >> and given another, much larger, list of string phrases, ?/millions>> >> v:(“how fast”;“can q”;“perform this task”;”?“;”!";…)>> >> how does one, with maximum performance, calculate the sum of all> >> occurrences of each k within all of v?>> >> for example, in the above case, given k[0 1 2 3] and v[0 1 2 3 4], it> >> should yield:>> >> “how fast” ? 1> >> " " ? 4> >> “perform this” ? 1> >> “job” ? 0>> >> alternatively, v could be viewed as a single delimited string>> >> to complicate things, the chars “[]" are found within the elements of> >> both k,v>> >> have tried something like the following:>> >> v0:" "sv v> >> t:flipphrasenum!(k;{count ss[v0;x]}peach k)>> >> using various roundabout methods for dealing with "” in k,v>> >> need to significantly improve performance>> >> –> >>

Submitted via Google Groups</k.os…></attila.vrab…></effb…>

To: personal-kdbplus@googlegroups.com
X-Mailer: Apple Mail (2.1081)

q)count each group x

is not really faster, but less code
and by just looking at this it should be relatively obvious
that it won’t be easy to find a faster expression if any

an idea would be doing something like this
q)l:(int$())!int$()
q)l+:1

but that is actually slower

you cannot splay dictionaries
but you could just write down a two-column table and use that
On 14 Dec 2010, at 21:15, k.os.tao wrote:

> attila, jack - thank you for your suggestions
>
> i am enjoying exploring this task in q and am learning a lot in the
> process.
> i have defined various methods that work, with big speed improvements,
> but which fail on larger lists due to the 32-bit address space limit.
>
> a couple additional questions:
>
> 1. is there a faster way to get the following result?
> {(key g)!count each value g:group x} where x is a large list
>
> 2. what is the best way to serialize to a large dictionary of
> dictionaries to disk?
> It is too large to generate it within memory all at once.
> I am hoping to do something analogous to splaying and/or partitioning
> with tables.
>
>
> On Dec 13, 11:22 pm, Jack Andrews <effb…> wrote:
>> hi atilla, k.os.tao,
>>
>> if q is not so strong here, lex (or flex) will do a good
>> job at matching these sort of patterns. from q, you can
>> generate a lex file (call it l.l) that looks like:
>>
>> %{
>> long n=0;
>> main()
>> {for(;yylex()>0;);
>> printf(“counted %ld matches\n”,n);
>> return 0;}
>>
>> %}
>> %%
>> “how fast” n++;
>> “perform this” n++;
>> “job” n++;
>> .
>> / EOF* /
>>
>> and build your scanner and run with test input:
>> $ flex -o n.c n.l && gcc n.c -o n -lfl && echo “how fast does it job
>> question?” |./n
>>
>> counted 2 matches
>>
>> ta, jack.
>>
>> On Mon, Dec 13, 2010 at 10:41 PM, Attila Vrabecz
>>
>> <attila.vrab…> wrote:
>>
>>> i think this is not a nice problem for q if the builtin ss is not =
good enough for you
>>> you should write this directly in C
>>> see the multi-pattern search part =
athttp://en.wikipedia.org/wiki/Rabin-karp
>>> Attila
>>
>>> On Sat, Dec 11, 2010 at 3:52 AM, k.os.tao <k.os…> =
wrote:
>>
>>>> given a large list of string phrases, /hundreds of thousands
>>
>>>> k:(“how fast”;" “;“perform this”;“job”;…)
>>
>>>> and given another, much larger, list of string phrases, /millions
>>
>>>> v:(“how fast”;“can q”;“perform this task”;”?“;”!";…)
>>
>>>> how does one, with maximum performance, calculate the sum of all
>>>> occurrences of each k within all of v?
>>
>>>> for example, in the above case, given k[0 1 2 3] and v[0 1 2 3 4], =
it
>>>> should yield:
>>
>>>> “how fast” 1
>>>> " " 4
>>>> “perform this” 1
>>>> “job” 0
>>
>>>> alternatively, v could be viewed as a single delimited string
>>
>>>> to complicate things, the chars “[]" are found within the elements =
of
>>>> both k,v
>>
>>>> have tried something like the following:
>>
>>>> v0:" "sv v
>>>> t:flipphrasenum!(k;{count ss[v0;x]}peach k)
>>
>>>> using various roundabout methods for dealing with "
” in k,v
>>
>>>> need to significantly improve performance
>>
>>>> –
>>>> You received this message because you are subscribed to the Google =
Groups “Kdb+ Personal Developers” group.
>>>> To post to this group, send email to =
personal-kdbplus@googlegroups.com.
>>>> To unsubscribe from this group, send email to =
personal-kdbplus+unsubscribe@googlegroups.com.
>>>> For more options, visit this group =
athttp://groups.google.com/group/personal-kdbplus?hl=en.
>>
>>> –
>>> You received this message because you are subscribed to the Google =
Groups “Kdb+ Personal Developers” group.
>>> To post to this group, send email to =
personal-kdbplus@googlegroups.com.
>>> To unsubscribe from this group, send email to =
personal-kdbplus+unsubscribe@googlegroups.com.
>>> For more options, visit this group =
athttp://groups.google.com/group/personal-kdbplus?hl=en.
>
> –
> You received this message because you are subscribed to the Google =
Groups “Kdb+ Personal Developers” group.
> To post to this group, send email to =
personal-kdbplus@googlegroups.com.
> To unsubscribe from this group, send email to =
personal-kdbplus+unsubscribe@googlegroups.com.
> For more options, visit this group at =
http://groups.google.com/group/personal-kdbplus?hl=en.
>


</k.os…></attila.vrab…></effb…>

correction: depending on the density of x
l+:1 might be actually faster
but not by much

ie
x:1000000?100
vs
x:1000000?10000
? Attila

attila, thanks for the code shortening suggestion…its amazing howmuch one coding change can simplify the entire routine once factoredin properlyOn Dec 14, 4:23?pm, Attila Vrabecz <attila.vrab…> wrote:> q)count each group x>> is not really faster, but less code> and by just looking at this it should be relatively obvious> that it won’t be easy to find a faster expression if any>> an idea would be doing something like this> q)l:(int$())!int$()> q)l+:1>> but that is actually slower>> you cannot splay dictionaries> but you could just write down a two-column table and use that> On 14 Dec 2010, at 21:15, k.os.tao wrote:>> > attila, jack - thank you for your suggestions>> > i am enjoying exploring this task in q and am learning a lot in the> > process.> > i have defined various methods that work, with big speed improvements,> > but which fail on larger lists due to the 32-bit address space limit.>> > a couple additional questions:>> > 1. is there a faster way to get the following result?> > ? {(key g)!count each value g:group x} where x is a large list>> > 2. what is the best way to serialize to a large dictionary of> > dictionaries to disk?> > It is too large to generate it within memory all at once.> > I am hoping to do something analogous to splaying and/or partitioning> > with tables.>> > On Dec 13, 11:22 pm, Jack Andrews <effb…> wrote:> >> hi atilla, k.os.tao,>> >> if q is not so strong here, lex (or flex) will do a good> >> job at matching these sort of patterns. ?from q, you can> >> generate a lex file (call it l.l) that looks like:>> >> %{> >> long n=0;> >> main()> >> {for(;yylex()>0;);> >> ?printf(“counted %ld matches\n”,n);> >> ?return 0;}>> >> %}> >> %%> >> “how fast” n++;> >> “perform this” n++;> >> “job” n++;> >> .> >> / EOF* />> >> and build your scanner and run with test input:> >> $ flex -o n.c n.l && gcc n.c -o n -lfl && echo “how fast does it job> >> question?” |./n>> >> counted 2 matches>> >> ta, jack.>> >> On Mon, Dec 13, 2010 at 10:41 PM, Attila Vrabecz>> >> <attila.vrab…> wrote:>> >>> i think this is not a nice problem for q if the builtin ss is not good enough for you> >>> you should write this directly in C> >>> see the multi-pattern search part athttp://en.wikipedia.org/wiki/Rabin-karp> >>> ? Attila>> >>> On Sat, Dec 11, 2010 at 3:52 AM, k.os.tao <k.os…> wrote:>> >>>> given a large list of string phrases, ?/hundreds of thousands>> >>>> k:(“how fast”;" “;“perform this”;“job”;…)>> >>>> and given another, much larger, list of string phrases, ?/millions>> >>>> v:(“how fast”;“can q”;“perform this task”;”?“;”!";…)>> >>>> how does one, with maximum performance, calculate the sum of all> >>>> occurrences of each k within all of v?>> >>>> for example, in the above case, given k[0 1 2 3] and v[0 1 2 3 4], it> >>>> should yield:>> >>>> “how fast” ? 1> >>>> " " ? 4> >>>> “perform this” ? 1> >>>> “job” ? 0>> >>>> alternatively, v could be viewed as a single delimited string>> >>>> to complicate things, the chars “[]" are found within the elements of> >>>> both k,v>> >>>> have tried something like the following:>> >>>> v0:" "sv v> >>>> t:flipphrasenum!(k;{count ss[v0;x]}peach k)>> >>>> using various roundabout methods for dealing with "” in k,v>> >>>> need to significantly improve performance>> >>>> –> >>>>

Submitted via Google Groups</k.os…></attila.vrab…></effb…></attila.vrab…>

yes, getting a good handle on the looping behavior of dictionaries and tables is very helpful in reducing code size

for a function f and a dictionary d, “f each d” matches “(key d)!f each value d” – a dictionary of the keys and the function applied to the values

for a table t, “f each t” matches “f each(cols t)!/:flip get flip t” – a table of the same column names with the function applied to each of the “row dictionaries”

(“f each’t” thus matches “(cols t)!/:f each’flip get flip t”, etc.)

keys#dictionary is another handy one most people find quite late – for a dictionary d and a list of keys k, “k#d” matches “k!d k”

aaron, thanks for the tips, they are greatly appreciated.the more i code in q, the more i want to learn k. i found an old kreference manual (v2, from 1998!), is there a more recent version? ifnot, any idea if the old one is still useful, or has the syntaxchanged too much since then?On Dec 15, 9:09?pm, Aaron Davies <aaron.dav…> wrote:> yes, getting a good handle on the looping behavior of dictionaries and tables is very helpful in reducing code size>> for a function f and a dictionary d, “f each d” matches “(key d)!f each value d” – a dictionary of the keys and the function applied to the values>> for a table t, “f each t” matches “f each(cols t)!/:flip get flip t” – a table of the same column names with the function applied to each of the “row dictionaries”>> (“f each’t” thus matches “(cols t)!/:f each’flip get flip t”, etc.)>> keys#dictionary is another handy one most people find quite late – for a dictionary d and a list of keys k, “k#d” matches “k!d k”>> On Dec 15, 2010, at 12:30 PM, k.os.tao wrote:>>>> > attila, thanks for the code shortening suggestion…its amazing how> > much one coding change can simplify the entire routine once factored> > in properly>> > On Dec 14, 4:23 pm, Attila Vrabecz <attila.vrab…> wrote:> >> q)count each group x>> >> is not really faster, but less code> >> and by just looking at this it should be relatively obvious> >> that it won’t be easy to find a faster expression if any>> >> an idea would be doing something like this> >> q)l:(int$())!int$()> >> q)l+:1>> >> but that is actually slower>> >> you cannot splay dictionaries> >> but you could just write down a two-column table and use that> >> On 14 Dec 2010, at 21:15, k.os.tao wrote:>> >>> attila, jack - thank you for your suggestions>> >>> i am enjoying exploring this task in q and am learning a lot in the> >>> process.> >>> i have defined various methods that work, with big speed improvements,> >>> but which fail on larger lists due to the 32-bit address space limit.>> >>> a couple additional questions:>> >>> 1. is there a faster way to get the following result?> >>> ? {(key g)!count each value g:group x} where x is a large list>> >>> 2. what is the best way to serialize to a large dictionary of> >>> dictionaries to disk?> >>> It is too large to generate it within memory all at once.> >>> I am hoping to do something analogous to splaying and/or partitioning> >>> with tables.>> >>> On Dec 13, 11:22 pm, Jack Andrews <effb…> wrote:> >>>> hi atilla, k.os.tao,>> >>>> if q is not so strong here, lex (or flex) will do a good> >>>> job at matching these sort of patterns. ?from q, you can> >>>> generate a lex file (call it l.l) that looks like:>> >>>> %{> >>>> long n=0;> >>>> main()> >>>> {for(;yylex()>0;);> >>>> ?printf(“counted %ld matches\n”,n);> >>>> ?return 0;}>> >>>> %}> >>>> %%> >>>> “how fast” n++;> >>>> “perform this” n++;> >>>> “job” n++;> >>>> .> >>>> / EOF* />> >>>> and build your scanner and run with test input:> >>>> $ flex -o n.c n.l && gcc n.c -o n -lfl && echo “how fast does it job> >>>> question?” |./n>> >>>> counted 2 matches>> >>>> ta, jack.>> >>>> On Mon, Dec 13, 2010 at 10:41 PM, Attila Vrabecz>> >>>> <attila.vrab…> wrote:>> >>>>> i think this is not a nice problem for q if the builtin ss is not good enough for you> >>>>> you should write this directly in C> >>>>> see the multi-pattern search part athttp://en.wikipedia.org/wiki/Rabin-karp> >>>>> ? Attila>> >>>>> On Sat, Dec 11, 2010 at 3:52 AM, k.os.tao <k.os…> wrote:>> >>>>>> given a large list of string phrases, ?/hundreds of thousands>> >>>>>> k:(“how fast”;" “;“perform this”;“job”;…)>> >>>>>> and given another, much larger, list of string phrases, ?/millions>> >>>>>> v:(“how fast”;“can q”;“perform this task”;”?“;”!";…)>> >>>>>> how does one, with maximum performance, calculate the sum of all> >>>>>> occurrences of each k within all of v?>> >>>>>> for example, in the above case, given k[0 1 2 3] and v[0 1 2 3 4], it> >>>>>> should yield:>> >>>>>> “how fast” ? 1> >>>>>> " " ? 4> >>>>>> “perform this” ? 1> >>>>>> “job” ? 0>> >>>>>> alternatively, v could be viewed as a single delimited string>> >>>>>> to complicate things, the chars “[]" are found within the elements of> >>>>>> both k,v>> >>>>>> have tried something like the following:>> >>>>>> v0:" "sv v> >>>>>> t:flipphrasenum!(k;{count ss[v0;x]}peach k)>> >>>>>> using various roundabout methods for dealing with "” in k,v>> >>>>>> need to significantly improve performance>> >>>>>> –> >>>>>>

Submitted via Google Groups</k.os…></attila.vrab…></effb…></attila.vrab…></aaron.dav…>

charset=us-ascii
Message-Id: <382924BE-D903-48A5-A24A-7D332142C3F0@gmail.com>
Cc: Kdb+ Personal Developers
X-Mailer: iPhone Mail (8C148)
From: Attila Vrabecz <attila.vrabecz>
Subject: Re: [personal kdb+] Re: small challenge for the q gods
Date: Thu, 16 Dec 2010 20:17:52 +0000
To: “personal-kdbplus@googlegroups.com

It is close enough, there is nothing else - and it is very well written
k is much more fun and beatiful than q=20
You will have fun

Attila</attila.vrabecz>

> the more i code in q, the more i want to learn k. i found an old k
> reference manual (v2, from 1998!), is there a more recent version? if
> not, any idea if the old one is still useful, or has the syntax
> changed too much since then?

it’s vaguely useful, but a lot of things have changed in k4

your best bet is to start reading q.k and working out k equivalents to
everything you write in q

(and conversely whenever you break inside a k function stop and try to
figure out exactly how it works)

some things convert very easily, e.g. “count each group x” from above is “#:'=x”

some things are harder :)

attila, jack, aaron:thanks for your comments and suggestions. i hope everyone enjoyed thewinter solstice festivities.regarding this task, after restructuring the approach I have been ableto cut the run-time down from 5 1/2 hours (when using brute force ssin parallel in q) to about 50 seconds (using the below function f),and absent dreaded wsfull decapitation. the original brute forceattempt made by a friend in .net was estimated to take 2 weeks tocomplete.in the above benchmark numbers i am calculating all starting indicesfor exact matches of approximately 300,000 keyphrases (word sequenceswith counts 1-30) within a 100MB xml document which, after strippingout the xml values, is equivalent to a single string (or symbolvector) containing about 20 million words.find the function below:f:{((key d)i)!r i:where not(0#0N)~/:r:{(1+neg count x)+{y where y in1+x}over x}peach value d:(x where all each x in key gk)#gk:(kwhere(k:distinct raze x)in key gv)#gv:group y};/f produces a dictionary of (keyphrases from x where exact matchesexist in y)->(lists of starting indices for the matches), where /x is the lookup keyphrase list, as a list of symbol (word) lists,and /y is the lookup value list, as a single list of symbols (words)<< i believe the above result to be correct, although i have notrigorously checked its correctness. >>the original brute force method in q looked like:{v ss x}peach k;/k is a list of strings of words separated by spaces (kephrases)/v is a single string of words separated by spaces (lookup values)i am sure the function f could be optimized further with codeimprovements, logical refinements, further use of parallelcalculation, etc.running a similar routine over more cores (here 4 were used) and whereparallel dictionary lookup does not lead to wsfull decapitation, itcould run much, much faster.<< a tangent: how cool would it be if one could natively run koperations over nvidias massively multicore gpu cards? gpeach anyone?other (inferior, in my opinion) interpreters (e.g. mathematica,matlab) have some gpu functionality in place. >>my conclusion thus far: the k language is pure genius. it allows oneto perform abstract mathematical operations naturally and conciselywith tremendousperformance and scalability and a near zero system footprint. whileits primary usage seems to be in finance for high performance datafeedand database operations, I see no reason why it could not be used foranything - data mining, scientific research, etc.. modifying the abovefunction f to do regular expressions and general functional stringoperations should be fairly straightforward.Happy New Year everyone,`k.osOn Dec 16, 9:17?pm, Aaron Davies <aaron.dav…> wrote:> > the more i code in q, the more i want to learn k. i found an old k> > reference manual (v2, from 1998!), is there a more recent version? if> > not, any idea if the old one is still useful, or has the syntax> > changed too much since then?>> it’s vaguely useful, but a lot of things have changed in k4>> your best bet is to start reading q.k and working out k equivalents to> everything you write in q>> (and conversely whenever you break inside a k function stop and try to> figure out exactly how it works)>> some things convert very easily, e.g. “count each group x” from above is “#:'=x”>> some things are harder :)</aaron.dav…>