iterating over variable-length records

I have a binary file composed of variable-length records with each record starting with a 4 byte record type identifier followed by a 4-byte length (little-endian) then the data.
I’m trying to find the record boundaries. Using this code:

nextTag:{[data;ptr]ptr+8+0x00 sv reverse data[ptr+4+til 4]};tags:1_nextTag[data]\[count[data]>;0];

with a 16MB file containing 750k records, it takes over 2 seconds.

However the equivalent C++ code:

char *ptr = &data[0]; char *endp = ptr+flen; vector<char*> records; while (ptr < endp) { records.push_back(ptr); ptr += 8+*((uint32_t*)(ptr+4)); }

does the same in the matter of milliseconds.

Is there a way to do it faster in Q?

I cannot think of any faster q code, but your C++ code can be trivially modified to define a function that can be called from q:

#include<stdint.h>#include "k.h"K1(tags){ J i = 0; K r = ktn(KJ,0); while (i < xn) { ja(&r,&i); i+=8+*((uint32_t*)(xG+i+4)); } R r;}

Compile this as described at <http://code.kx.com/q/interfaces/using-c-functions/\>, and you can use the resulting DLL as follows:

n:750000m:100tags::./tags 2:(tags;1)data:raze{0xDEADBEEF,(reverse 0x00 vs "i"$x),x?0xFF}each i:n?m\t j:tags data(1_j)~-1 _ sums i + 8

The timing shows ~ 20 ms and the last line check that the result is correct.

I thought about that too. My problem with DLLs is that you can’t rebuild a DLL as long as any application is using it, so I have to close every q process that has the DLL loaded.

  1. június 19., hétf? 21:31:41 UTC+1 id?pontban Alexander Belopolsky a következ?t írta:

I cannot think of any faster q code, but your C++ code can be trivially modified to define a function that can be called from q:

#include<stdint.h>#include "k.h"K1(tags){ J i = 0; K r = ktn(KJ,0); while (i < xn) { ja(&r,&i); i+=8+*((uint32_t*)(xG+i+4)); } R r;}

Compile this as described at <http://code.kx.com/q/interfaces/using-c-functions/>, and you can use the resulting DLL as follows:

n:750000m:100tags::./tags 2:(tags;1)data:raze{0xDEADBEEF,(reverse 0x00 vs "i"$x),x?0xFF}each i:n?m\t j:tags data(1_j)~-1 _ sums i + 8

The timing shows ~ 20 ms and the last line check that the result is correct.

There several ways to handle rebuilding the DLLs:

  1. Don’t.  Make your code perfect the first time you save it. :-)
  2. Save different versions under different names by incorporating the version number in the DLL name.
  3. Give each process its own private copy of the DLL.

While it is possible to modify q code of a running process, it is rarely a good idea.  The added benefit of the restriction on rebuilding the DLLs is that it promotes sound software development practices.

On 19 Jun 2017, at 21:31, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:

I cannot think of any faster q code, but your C++ code can be trivially modified to define a function that can be called from q:

we can do some simple tricks

by moving some constants out of the loop

but it wont be a huge improvement

q)\ts tags:1_nextTag[data][count[data]>;0];

1586 28777936

q)t:{z+8+0x00 sv y z+x}4+reverse til 4

q)\ts 1_t[data][count[data]>;0]

836 28777776

if everything would be 4-byte aligned you could parse all at once with 1:

then all would be left is some simpler pointer chasing

q)d:2_data

q)\ts (enlist"i";enlist 4)1:d

140 67109328