RE: [personal kdb+] kdb+tick with schemaless events

User-Agent: Workspace Webmail 5.14.0Message-Id: <20150429065429.85f80dae80d1d2f2e266ec6278e6cbe8.368683ce6a.wbe@email07.europe.secureserver.net>From: “David Demner (AquaQ)” <david.demner>To: personal-kdbplus@googlegroups.comSubject: RE: [personal kdb+] kdb+tick with schemaless eventsDate: Wed, 29 Apr 2015 06:54:29 -0700Mime-Version: 1.0

Forgot to include an example of what the performance impact might be, even when selecting a smaller part of the table:


q)n:1000000;t:`sym xasc([]time:n?0D; sym:n?200; data:n#enlist(1 2!(1 2;1 2)))
q)`:t/ set 0#t;
q)`:t/ upsert t;
q)@[`:t/;`sym;`p#]
q)\ts:100 select time from t where sym=9 /cost of selecting one simple column for one sym
31 66144
q)\ts:100 select time,sym from t where sym=9 /minimal incremental cost to adding a second column
31 131680
q)\ts:100 select time,sym,data from t where sym=9 /massive incremental cost when adding the (not much larger on disk) complex column
55684 168389072


Ah, you’re right. ?It reads the whole table into memory when selecting on the complex column.

I suppose this is a reason to just serialize the complex column manually with -8!'. ?I modified your benchmark and can now get efficient random access to all columns. ?I wonder why kdb doesn’t do this automatically?

q)n:1000000;t:sym xasc([]time:n?0D; sym:n?200; data:n#enlist(1 2!(1 2;1 2))) q):t/ set 0#t;
q):t/ upsert t; q)@[:t/;sym;p#];
q)\l .
q)\ts:100 select time,sym from t where sym=9
9 131776
q)\ts:100 select time,sym,data from t where sym=9
53227 172583568

q)n:1000000;t:update -8!'data from sym xasc([]time:n?0D; sym:n?200; data:n#enlist(1 2!(1 2;1 2))) q):t/ set t;
q)\l .
q)\ts:100 select time,sym from t where sym=9
10 131776
q)\ts:100 select time,sym,-9!'data from t where sym=9
201 1682656

29.04.2015, 16:57, “David Demner (AquaQ)” <david.demner>:
> ?Forgot to include an example of what the performance impact might be, even when selecting a smaller part of the table:
>
> ?q)n:1000000;t:sym xasc([]time:n?0D; sym:n?200; data:n#enlist(1 2!(1 2;1 2)))<br>&gt; ?q):t/ set 0#t;
> ?q):t/ upsert t;<br>&gt; ?q)@[:t/;sym;p#]
> ?q)\ts:100 select time from t where sym=9 /cost of selecting one simple column for one sym
> ?31 66144
> ?q)\ts:100 select time,sym from t where sym=9 /minimal incremental cost to adding a second column
> ?31 131680
> ?q)\ts:100 select time,sym,data from t where sym=9 /massive incremental cost when adding the (not much larger on disk) complex column
> ?55684 168389072
>> ?</david.demner>

charset=“utf-8”

X-Mailer: Microsoft Outlook 15.0
Thread-Index: AQFw4f2vUdJMynxoddcxdOeWTMBQEgGoV2+SnhZ7RfA=
Content-Language: en-us

Hm, yeah… The difference is the serialized column is a nested binary =
list (that kdb+ can access randomly) vs a complex object that kdb+ has =
to do a full column scan on.

q)n:1000000;t:sym xasc([]time:n?0D; sym:n?200; data:n#enlist(1 2!(1 2;1 = 2))); :t/ set 0#t;:t/ upsert t; @[:t/;sym;p#]
t -> complex object
04/29/2015 02:13 PM 22 .d
04/29/2015 02:13 PM 73,000,008 data
04/29/2015 02:13 PM 8,007,800 sym
04/29/2015 02:13 PM 8,000,016 time
4 File(s) 89,007,846 bytes

q)n:1000000;t:sym xasc([]time:n?0D; sym:n?200; data:-8!'n#enlist(1 2!(1 = 2;1 2))); :t2/ set 0#t;:t2/ upsert t; @[:t2/;sym;p#]
t2 -> nested list (note the data# file)
04/29/2015 02:13 PM 22 .d
04/29/2015 02:13 PM 8,000,016 data
04/29/2015 02:13 PM 81,000,000 data#
04/29/2015 02:13 PM 8,007,800 sym
04/29/2015 02:13 PM 8,000,016 time
5 File(s) 105,007,854 bytes

> I wonder why kdb doesn’t do this automatically?
The serialized version consumes 16 more bytes per table row maybe that’s =
why kx wouldn’t want to do it automatically? Other than that, though, it =
looks like the gains are pretty substantial.

-----Original Message-----
From: personal-kdbplus@googlegroups.com =
[mailto:personal-kdbplus@googlegroups.com] On Behalf Of Josh Myzie
Sent: Wednesday, April 29, 2015 9:21 AM
To: personal-kdbplus@googlegroups.com
Subject: Re: [personal kdb+] kdb+tick with schemaless events

Ah, you’re right. It reads the whole table into memory when selecting =
on the complex column.

I suppose this is a reason to just serialize the complex column manually =
with -8!'. I modified your benchmark and can now get efficient random =
access to all columns. I wonder why kdb doesn’t do this automatically?

q)n:1000000;t:sym xasc([]time:n?0D; sym:n?200; data:n#enlist(1 2!(1 2;1 = 2))) q):t/ set 0#t; q):t/ upsert t; q)@[:t/;sym;p#]; q)\l .
q)\ts:100 select time,sym from t where sym=3D9
9 131776
q)\ts:100 select time,sym,data from t where sym=3D9
53227 172583568

q)n:1000000;t:update -8!'data from sym xasc([]time:n?0D; sym:n?200; = data:n#enlist(1 2!(1 2;1 2))) q):t/ set t; q)\l .
q)\ts:100 select time,sym from t where sym=3D9
10 131776
q)\ts:100 select time,sym,-9!'data from t where sym=3D9
201 1682656

29.04.2015, 16:57, “David Demner (AquaQ)” <david.demner>:
> Forgot to include an example of what the performance impact might be, =
even when selecting a smaller part of the table:
>
> q)n:1000000;t:sym xasc([]time:n?0D; sym:n?200; data:n#enlist(1 2!(1=20<br>&gt; 2;1 2)))<br>&gt; q):t/ set 0#t;
> q):t/ upsert t;<br>&gt; q)@[:t/;sym;p#]
> q)\ts:100 select time from t where sym=3D9 /cost of selecting one=20
> simple column for one sym
> 31 66144
> q)\ts:100 select time,sym from t where sym=3D9 /minimal incremental=20
> cost to adding a second column
> 31 131680
> q)\ts:100 select time,sym,data from t where sym=3D9 /massive=20
> incremental cost when adding the (not much larger on disk) complex=20
> column
> 55684 168389072
>> </david.demner>