Most memory efficient way of merging tables

KdbNoob · April 6, 2016, 4:13am

I have a bunch of tables on the file system. I need to read all of them into memory and merge them together.

It seems like using raze or (uj), I will inevitable read all of them, then merge them together. That will at least double the memory usage.

How do I merge them efficiently?

What if these tables are organized as played table, would that change anything?

Flying1 · April 7, 2016, 2:45am

It depends on what you want to do with the “merged” table.

The following are talking about loading from CSV, the ideas behind merging large data sets, however, should be applicable to your case as well:

http://code.kx.com/wiki/Cookbook/LoadingFromLargeFiles

http://code.kx.com/wiki/Cookbook/LoadingFromLargeFilesAndSplaying

Matthew_McAuley · April 7, 2016, 4:56pm

Hello,

Would it be possible for you to provide some more information? Are
all the tables on disk? What size are they? What memory
constraints do you have?

You suggested that splaying the tables might help. When you load a
splayed table, the table is mapped to memory, rather than being
loaded into RAM. However upon joining, they will be loaded into
RAM. I understand that your concern is loading all the tables for
the join and the memory cost associated with that.

It may be suitable to join one table at a time, then upsert into
your final splayed table on disk. I’ve outlined how you might
approach this below.

Turn on immediate garbage collection:

\g 1

For instance, create an empty splayed table, with the same schema
as the end goal table, t:

t::(a:$();b:int$();c:int$();d:int$();e:int$())  :hdb/ujtab/ set .Q.en[:hdb] t Create a few sample tables to be joined: ta:([]a:qwe;b:10

11 12i;c:1 2 3i)

tb:(a:rty;d:18 16 15i;e:112 221 332i) Join each table (uj) and upsert into your splayed table on disk (enumerating with .Q.en, assuming you have a sym column): {:hdb/ujtab/
upsert .Q.en[`:hdb] t uj value x} each `ta`tb

With immediate gc turned on, kdb will free up memory as each table
is joined and should keep memory usage to a minimum.

Feel free to follow up with more details and I might be able to
offer a more optimised solution.

Thanks,

Matthew

AquaQ Analytics

Topic		Replies	Views
(uj) Splayed tables Community Support kdb-and-q	0	5	March 17, 2017
How to query union of in-memory and splayed on-disk tables Community Support kdb-and-q	3	11	July 11, 2017
re: [personal kdb+] generic join on splayed and mapped tables on any column Community Support imported , kdb-and-q	0	8	December 3, 2012
Joining CSV files tens of GBs large Community Support kdb-and-q	1	13	April 10, 2014
merge/copy partitioned tables efficiency Community Support kdb-and-q	2	19	November 27, 2023

Most memory efficient way of merging tables

Related topics