I have a bunch of tables on the file system. I need to read all of them into memory and merge them together.
It seems like using raze or (uj), I will inevitable read all of them, then merge them together. That will at least double the memory usage.
How do I merge them efficiently?
What if these tables are organized as played table, would that change anything?
It depends on what you want to do with the “merged” table.
The following are talking about loading from CSV, the ideas behind merging large data sets, however, should be applicable to your case as well:
http://code.kx.com/wiki/Cookbook/LoadingFromLargeFiles
http://code.kx.com/wiki/Cookbook/LoadingFromLargeFilesAndSplaying
Hello,
Would it be possible for you to provide some more information? Are
all the tables on disk? What size are they? What memory
constraints do you have?
You suggested that splaying the tables might help. When you load a
splayed table, the table is mapped to memory, rather than being
loaded into RAM. However upon joining, they will be loaded into
RAM. I understand that your concern is loading all the tables for
the join and the memory cost associated with that.
It may be suitable to join one table at a time, then upsert into
your final splayed table on disk. I’ve outlined how you might
approach this below.
Turn on immediate garbage collection:
\g 1
For instance, create an empty splayed table, with the same schema
as the end goal table, t:
t::(a:$();b:
int$();c:int$();d:
int$();e:int$())<br class='""'>
:hdb/ujtab/ set .Q.en[:hdb] t</font><br class='""'> <br class='""'> Create a few sample tables to be joined:<br class='""'> <font class='""' face='"Courier' new courier monospace>ta:([]a:
qw
e;b:10
11 12i;c:1 2 3i)
tb:(a:r
ty;d:18 16 15i;e:112 221 332i)</font><br class='""'> <br class='""'> Join each table (uj) and upsert into your splayed table on disk (enumerating with .Q.en, assuming you have a sym column):<br class='""'> <font class='""' face='"Courier' new courier monospace>{
:hdb/ujtab/
upsert .Q.en[`:hdb] t uj value x} each `ta`tb
With immediate gc turned on, kdb will free up memory as each table
is joined and should keep memory usage to a minimum.
Feel free to follow up with more details and I might be able to
offer a more optimised solution.
Thanks,
Matthew
AquaQ Analytics