How to query union of in-memory and splayed on-disk tables

I am trying to incorporate new data into my database of splayed tables without downtime.
My new idea is to have an in-memory database with the same schema to hold “todays”
data and do my queries on the unions of pairs of tables one from the today database
and one from the on-disk database.

Overnight I would write a new on-disk database which incorporates the data from the
today database (up to midnight). In the mean time I continue to serve queries using the
today database and the old on-disk database.

When the new on-disk database is ready, the records from before midnight in the
in memory database are deleted, and the new on-disk database replaces the old.

The problem I a worried about is that taking the union of my in-memory and on-disk
tables would produce an intermediate result at least as large as the on-disk table
which is entirely in memory.

Part of the problem is that some of the today records will need to replace equally keyed
records in the on-disk database, so the union is really an upsert, and the on-disk database
needs to have a key.

More generally, when you are using on-disk table to produce intermediate results,
how can you avoid having those intermediate results entirely in memory ?

Ideally, your queries from the on-disk database shall not be simply a “select from …” where everything from the on-disk database is mapped into memory and returned as it.

If your query is to perform some computations based on certain columns on your on-disk database, there shall be some ways to combined your in-memory version of the database with the on-disk version – not from the raw “select from…” perspective, but from the perspective of the final computational result.

This is actually a similar idea to the so-called map-reduce methodology, where you map both your on-disk and in-memory copies of the databases to some intermediate data structure (hopefully much smaller than the raw datasets), and combine them in a reduction step for the final result.

So what you mean is that I need to perform the filtering and aggregation independently on the in-memory and on-disk datasets,
and then provide a way to combine the two aggregates. For example if I was computing an average, I would compute the sum
and count independently on the two data, and then compute a final result as the ratio of the total sum over the total count.

Thats what I thought, we hope for a better way (posting new question).

Hi Dan,

What you are describing here is a fairly standard approach for a kdb+ set up - you have new (usually today’s) data in memory and old (historic, prior to today) data on disk.  The two separate processes are generally called the RDB and HDB respectively. You can use a gateway process to query across them and produce a combined view.  As Flying said, the approach to the gateway query should be 

aggregate[sub aggregation from RDB; sub aggregation from HDB]

rather than 

aggregate [raw data from RDB; raw data from HDB]

Merging the two datasets is a common problem, which necessitates some coordination.  Writing to two different versions of the data and switching over between them (Flying’s suggestion) is a good one as long as you can have two copies of the data.  If youare dealing with partitioned data, then you can build new partitions separately without having to replicate the whole dataset, and move them into place when done.  Until the new on disk data is ready the data should be kept in-memory so clients can still accessit.  We put together some code to help with this, it’s described here: 

https://www.aquaq.co.uk/q/avoiding-end-of-day-halts-with-torq/