Creating virtual table for KDB-X Parquet DB

Is there a way to create a virtual table for a directory structure that looks like -

/path/to/data

Inside that directory I have partition on column A (which has let's say 1000 partitions)

and in each of those column A partition, I have another 1000 partitions on column B

And then if there are 100 such datasets then will scale up even more

From the examples in documentation, I saw that for each partition pair (A,B) need to map the exact file path. Trying to see if there is a way to avoid that/or do it easily and quickly (as the number of files will be 1M+ in this data)?

1 Answer

1

If you had a folder structure like:

$ tree .
.
├── A1
│   ├── B1
│   │   ├── part1.parquet
│   │   └── part2.parquet
│   └── B2
│       ├── part1.parquet
│       └── part2.parquet
└── A2
    ├── B1
    │   ├── part1.parquet
    │   └── part2.parquet
    └── B2
        ├── part1.parquet
        └── part2.parquet

6 directories, 8 files

You can build information using the folder structure:

q)files:([] file:system"find . -name *.parquet")
q)files:update split:"/" vs/:file from files
q)files:update file:hsym `$file, columnA:`$.[;(::;1)] split,columnB:`$.[;(::;2)] split from files

q)files
file                   split                          columnA columnB
---------------------------------------------------------------------
:./A2/B2/part2.parquet ,"." "A2" "B2" "part2.parquet" A2      B2
:./A2/B2/part1.parquet ,"." "A2" "B2" "part1.parquet" A2      B2
:./A2/B1/part2.parquet ,"." "A2" "B1" "part2.parquet" A2      B1
:./A2/B1/part1.parquet ,"." "A2" "B1" "part1.parquet" A2      B1
:./A1/B2/part2.parquet ,"." "A1" "B2" "part2.parquet" A1      B2
:./A1/B2/part1.parquet ,"." "A1" "B2" "part1.parquet" A1      B2
:./A1/B1/part2.parquet ,"." "A1" "B1" "part2.parquet" A1      B1
:./A1/B1/part1.parquet ,"." "A1" "B1" "part1.parquet" A1      B1

And then use the information to build the virtual table:

q).pq:use`kx.pq
q).pq.t:use`kx.pq.t
q)t:.pq.t.mkP (`columnA`columnB#files)!.pq.pq each exec file from files

q)select from t
columnA columnB c_bool c_int8
-----------------------------
A2      B2      0      5
A2      B2      0      6
A2      B2      1      7
A2      B2      0      8
A2      B2      0      9
A2      B2      1      1
A2      B2      1      2
A2      B2      0      3
A2      B2      1      4
A2      B1      0      5
A2      B1      0      6
A2      B1      1      7
A2      B1      0      8
A2      B1      0      9
A2      B1      1      1
A2      B1      1      2
A2      B1      0      3
A2      B1      1      4
A1      B2      0      5
A1      B2      0      6
A1      B2      1      7
A1      B2      0      8
A1      B2      0      9
A1      B2      1      1
A1      B2      1      2
A1      B2      0      3
A1      B2      1      4
A1      B1      0      5
A1      B1      0      6
A1      B1      1      7
A1      B1      0      8
A1      B1      0      9
A1      B1      1      1
A1      B1      1      2
A1      B1      0      3
A1      B1      1      4

To convert this to a more dynamic reusable pattern you could create a utility in a file buildVT.q:

.pq:use`kx.pq
.pq.t:use`kx.pq.t

buildVT:{[path;levels]
 files:([] file:system"find ",(1 _ string path)," -name *.parquet");
 files:update split:"/" vs/:file from files;
 lv:1+count where "/"=string path;
 levels:(),levels;
 files:![files;();0b;(`file,levels)!enlist[({hsym `$x};`file)],{(({`$.[y;(::;x)]};x);`split)} each lv+til count levels];
 .pq.t.mkP (levels#files)!.pq.pq each exec file from files
 };

Then call buildVT passing the root path and names for the nested levels:

q)\l buildVT.q
q)t:buildVT[`:/home/rocuinneagain/pqHDB;`columnA`columnB]
q)select from t
columnA columnB c_bool c_int8
-----------------------------
A2      B2      0      5
A2      B2      0      6
A2      B2      1      7
A2      B2      0      8
A2      B2      0      9
...

If you have a large number of files rather than running buildVT each time you could save the files table out to a .csv and use that on startup for faster creation of the virtual table.