Joining CSV files tens of GBs large

thecarsonlittle · April 8, 2014, 12:05am

I regularly do data-analysis tasks, where I just want to slurp in a few CSV files, join them together, with the theta function being usually inner and left/right joins, filter out some rows further, and finally output results

Currently I uses SAS for this.

Sounds fairly simple, until I tell you that the CSV files can be tens of GBs and there will be a minimum of 3 such tables that will need to be loaded (joined)

Hardware usually has 8GB of free physical RAM (the OS and other programs consume the rest) and an almost idle quadcore CPU

Can kdb+ CARRY OUT THESE JOINs without requiring me to use a DB like Postgres/ParAccel?

pressjonny0 · April 10, 2014, 8:13am

Hi Carson

Short answer is yes. From your description, it sounds like you probably wouldn?t have enough memory available to do it all in memory (i.e. you would have to write to disk and analyse from disk perhaps piece-by-piece).

There is a small tutorial here http://code.kx.com/wiki/Cookbook/LoadingFromLargeFiles which allows you to download and create some example large csv files, load them in in chunks, write them out to disk, then re-sort at the end. You could adapt some of this for your purposes.

Thanks

Jonny

Topic		Replies	Views
Trouble With Huge CSVs Community Support kdb-and-q	3	6	July 18, 2021
Trouble With Huge CSVs Community Support imported , kdb-and-q	2	4	July 11, 2021
Memory Management Community Support kdb-and-q	3	18	December 17, 2012
Transferring Large Data From Server to Client Community Support kdb-and-q	4	16	March 18, 2020
Most memory efficient way of merging tables Community Support kdb-and-q	2	13	April 7, 2016

Joining CSV files tens of GBs large

Related topics