Loading new-line separated JSON docs

Hi,

I am trying to load orc files [1] into kdb.
My initial strategy is as follows:


item:flip columnnames!“SIIISHSSIIISSHHHHHHHHHHHHHHHPPDHHPPSSSHISSISISISHIHSHHHHPPHHHPSPPS”$:();

loadfile:{[file]
cmd:" " sv (“./ORC-1.5.3-Linux/bin/orc-contents”;1_string file); / returns 1 JSON document per line
tmp:.j.k each system cmd;
tmp:columnnames xcol tmp;
item:item,tmp;
}


loadfile each filelist;


I wanted to ask 2 things:


1/ Is there a canonical way to preserve the schema for each json document? My simplistic , doesn’t seem to work that way.


2/ I understand using named pipes [2] instead could improve performance. I fail however to use .Q.fps in combination with .j.k. .. all the samples I saw so far are about using CSV and they are using the (“”!“,”) syntax. I fail to understand how I could use .j.k instead ..


Cheers, Mathias


[1] https://orc.apache.org

[2] https://code.kx.com/q/cookbook/named-pipes/

I’m not sure what you mean by preserving the schema, if each document has a different schema how do you plan to store them if they don’t conform to a common schema?

On your point about .Q.fps, the easiest way to figure out what to do with the stream of data is to extract out a chunk and inspect it, like so:

.Q.fps[{$[not global in key .;global set x;()]};:/path/to/myFIFO]

Then you can inspect the global to see what a chunk of the stream looks like. 

Terry

Thanks Terry! Your function set me on the right path!
This is where I am at the moment:

item:flip columnnames!“SIIISHSSIIISSHHHHHHHHHHHHHHHPPDHHPPSSSHISSISISISHIHSHHHHPPHHHPSPPS”$:();


.Q.fps[{ y:.j.k peach x; z:columnnames xcol y; item upsert z};:fifo]

It turns out though that .j.k only returns C or f types ..

And thus I am not able to upsert into my `item definition.

What is the best way to align the types of z and my target table?


Cheers, Mathias


On Thursday, December 20, 2018 at 11:04:09 PM UTC+9, TerryLynch wrote:

I’m not sure what you mean by preserving the schema, if each document has a different schema how do you plan to store them if they don’t conform to a common schema?

On your point about .Q.fps, the easiest way to figure out what to do with the stream of data is to extract out a chunk and inspect it, like so:

.Q.fps[{$[not global in key .;global set x;()]};:/path/to/myFIFO]

Then you can inspect the global to see what a chunk of the stream looks like. 

Terry

Something  like  this might work for you,though it may  not  be the most efficient

 

q)tmp:.j.k .j.j orig:(col1:aabb`cc;col2:11 22 33i;col3:.z.D-til 3;col4:1 2 3h)

q)clms:col1col2col3col4;

q)types:“SIDH”;

q)t:flip clms!types$:();

q)`t  upsert   { @[x;y; { $[0h=type y;x$y;(“h”$.Q.t?lower x)$y] } z] } /[tmp;cols tmp;types];

q)t~orig

1b

 

 

/or in your example (assuming you’ve defined a types variable):

`item  upsert   { @[x;y; { $[0h=type y;x$y;(“h”$.Q.t?lower x)$y] } z] } /[z;cols z;types]

This was super helpful!

I ended up using only a subset of the fields present in the original JSON using:

.Q.fps[{ y:.j.k peach x; z:columnnames xcol y; item upsert (select int$shop_id, long$price from z )};:fifo]

// shop_id and price are both parsed as f

Which was working fine until I wanted to cast a string field in the JSON doc into a long:

.Q.fps[{ y:.j.k peach x; z:columnnames xcol y; item upsert (select int$shop_id, “J”$string_in_json_but_actually_a_long, long$price from z )};:fifo]

It seems that I am still not understanding some fundamentals of q ..

In theory the “J”$ should work. Are you 100% certain that every value in the “string_in_json_but_actually_a_long” column is a string of type 10h? Using meta doesn’t tell the whole story, it only reports the types of the first row. You would need to run “distinct type each z`string_in_json_but_actually_a_long” to be certain. 

Values that are “null” in json will actually get parsed as null floats (0n) even if the column is supposed to be a string column. I suspect your column might have a mixture of strings and null floats and the null floats are erroring your attempt to cast using “J”$


Terry


Try a more careful cast like this instead:

select {$[10h=type x;“J”$x;0Nj]}each string_in_json_but_actually_a_long from z

Terry