how to speed up writing partitioned table from text data files

Hi all,

I’m planning on using 32-bit kdb+/q for a scientific application, namely reading, storing, and manipulating data from quartz crystal microbalance (QCM) experiments. I’m running the m32 and l32 distributions and am not planning on moving to w32.

Our current QCM controller writes its data to a .tsv log with 14 columns where only 3 columns are relevant (i.e. time, crystal frequency, and temperature).*

At the moment my code reads the data (~50-200 million rows) into memory and I do my analysis there. This will not be possible on the 32 bit distribution for longer experiments that routinely cause 'wsfull errors during analysis, and future setups will have 100x or more rows. Robustness is desired here and not loading my data into memory will allow the code to be run on what will likely be lower-end PCs.

I’ve read the “kdb+ for Mortals” and “kx/Cookbook” entries on .Q.fs and related partitioned table functions to avoid being spoon-fed, but I am unsure about its performance.


Why is .Q.fs so slow when writing to disk compared to just reading with 0: for my (currently small) data? Is it because I’m reading from and writing to the same drive? How might I overcome this to achieve better performance for reading these data files?


q) colnames: times1ls1ms2ls2mfreqs2fs1rs2rs1ths2thtempT2foo;q) \t .Q.fs[flip colnames!("sssssfffffffff";"\t")0:x}]$“qcm/sample.tsv"1338q) \t .Q.fs[0N!flip colnames!(“sssssfffffffff”;”\t")0:x}]$"qcm/sample.tsv"2862q) \t .Q.fs[{:t upsert flip colnames!(“sssssfffffffff”;“\t”)0:x}]$"qcm/sample.tsv"74494q) \t ("sssssfffffffff";"\t")0:$"qcm/sample.tsv"1314`

Thank you in advance for reading. Any clarification or help is appreciated.

Here’s a sample of the format of the .tsv files for reference, the preamble and column headings are dropped with tail.**

~$ tail -n +3 qcm/sample.tsv3:11:46 PM None None None None 5986009.536 -3 0 0 0 0 121.1 03:11:46 PM None None None None 5986009.536 -3 0 0 0 0 121.1 03:11:46 PM None None None None 5986009.536 -3 0 0 0 0 121.1 03:11:46 PM None None None None 5986009.517 -3 0.08 0 0.00017 0 121.1 03:11:46 PM None None None None 5986009.508 -3 0.04 0 0 0 121.1 03:11:46 PM None None None None 5986009.503 -3 0.02 0 0 0 121.1 0

*There is no easy way to change this, and the plan is to develop a new QCM controller that will feed into kdb+tick to write directly to partitioned tables, so I would rather not attempt to clean up the data beforehand. For the moment, we will have to read in all 14 columns.

**The headers are not easy to call in q, so I just replace them with my own column name list later on.

why dont you just take the three columns you want? that way it might even fit memory (just put spaces for the unused one for types)

appending with `:t upsert is slow as it keeps rewriting the whole file

you could append to a splayed table `:t/ (but you might have to call .Q.en) with .Q.fs which does not have this problem

and you could probably change the default 128k size to something much bigger with .Q.fsn

hope that helps,

   Attila

Hi Goran,

I’m facing the same problem you described. Writing to disk is incredibly slow. Have you figured out a way around this? 

Content-ID: <7CA5C39317FED74CBE4E01BD5CD8E5CC@GBRP123.PROD.OUTLOOK.COM>

Couple of points here;

  • You can exclude columns from being loaded in by not defining a type and leaving white space.

q)flip timefreqtemp!("s&nbsp; &nbsp; f &nbsp; &nbsp; f&nbsp; ";",")0:$“sample.csv”

time        freq    temp 


03:11:46 PM 5986010 121.1

03:11:46 PM 5986010 121.1

03:11:46 PM 5986010 121.1

03:11:46 PM 5986010 121.1

03:11:46 PM 5986010 121.1

03:11:46 PM 5986010 121.1

  • `:t upset table is saving it down as a flat file, for large datasets it should be save as splayed, with a leading forward slash.

http://code.kx.com/wiki/Cookbook/SplayedTables

  • You should remove the PM from your time column and cast it to a time before saving it down (instead of currently saving as a symbol).

q)update (“T”$-3_)'[time] from flip timefreqtemp!("*&nbsp; &nbsp; f &nbsp; &nbsp; f&nbsp; ";",")0:$“sample.csv”

time         freq    temp 


03:11:46.000 5986010 121.1

03:11:46.000 5986010 121.1

03:11:46.000 5986010 121.1

03:11:46.000 5986010 121.1

03:11:46.000 5986010 121.1

03:11:46.000 5986010 121.1

Thanks,

Caolan