How to query union of in-memory and splayed on-disk tables

dan9 · July 8, 2017, 10:42pm

I am trying to incorporate new data into my database of splayed tables without downtime.
My new idea is to have an in-memory database with the same schema to hold “todays”
data and do my queries on the unions of pairs of tables one from the today database
and one from the on-disk database.

Overnight I would write a new on-disk database which incorporates the data from the
today database (up to midnight). In the mean time I continue to serve queries using the
today database and the old on-disk database.

When the new on-disk database is ready, the records from before midnight in the
in memory database are deleted, and the new on-disk database replaces the old.

The problem I a worried about is that taking the union of my in-memory and on-disk
tables would produce an intermediate result at least as large as the on-disk table
which is entirely in memory.

Part of the problem is that some of the today records will need to replace equally keyed
records in the on-disk database, so the union is really an upsert, and the on-disk database
needs to have a key.

More generally, when you are using on-disk table to produce intermediate results,
how can you avoid having those intermediate results entirely in memory ?

Flying1 · July 9, 2017, 2:32pm

Ideally, your queries from the on-disk database shall not be simply a “select from …” where everything from the on-disk database is mapped into memory and returned as it.

If your query is to perform some computations based on certain columns on your on-disk database, there shall be some ways to combined your in-memory version of the database with the on-disk version – not from the raw “select from…” perspective, but from the perspective of the final computational result.

This is actually a similar idea to the so-called map-reduce methodology, where you map both your on-disk and in-memory copies of the databases to some intermediate data structure (hopefully much smaller than the raw datasets), and combine them in a reduction step for the final result.

dan9 · July 10, 2017, 9:42pm

So what you mean is that I need to perform the filtering and aggregation independently on the in-memory and on-disk datasets,
and then provide a way to combine the two aggregates. For example if I was computing an average, I would compute the sum
and count independently on the two data, and then compute a final result as the ratio of the total sum over the total count.

Thats what I thought, we hope for a better way (posting new question).

scott_quigley1 · July 11, 2017, 9:10am

Hi Dan,

What you are describing here is a fairly standard approach for a kdb+ set up - you have new (usually todays) data in memory and old (historic, prior to today) data on disk. The two separate processes are generally called the RDB and HDB respectively. You can use a gateway process to query across them and produce a combined view. As Flying said, the approach to the gateway query should be

aggregate[sub aggregation from RDB; sub aggregation from HDB]

rather than

aggregate [raw data from RDB; raw data from HDB]

Merging the two datasets is a common problem, which necessitates some coordination. Writing to two different versions of the data and switching over between them (Flyings suggestion) is a good one as long as you can have two copies of the data. If youare dealing with partitioned data, then you can build new partitions separately without having to replicate the whole dataset, and move them into place when done. Until the new on disk data is ready the data should be kept in-memory so clients can still accessit. We put together some code to help with this, its described here:

https://www.aquaq.co.uk/q/avoiding-end-of-day-halts-with-torq/

Topic		Replies	Views
Most memory efficient way of merging tables Community Support kdb-and-q	2	22	April 7, 2016
how to incorporate new data into an on-disk kdb database without Community Support kdb-and-q	6	45	July 6, 2017
(uj) Splayed tables Community Support kdb-and-q	0	8	March 17, 2017
Understanding KDBs memory and disk utilization Community Support kdb-and-q	5	57	July 22, 2015
As-of join on a splayed/partitioned table? Community Support kdb-and-q	2	22	March 19, 2018

KDB-X

KDB.AI

Insights Enterprise

OneTick Market Data

Delta

Dashboards

VS Code Extension

Accelerators

How to query union of in-memory and splayed on-disk tables

KDB-X

How to query union of in-memory and splayed on-disk tables

Related topics