Hi,
I need to process over 1500 TAQ [each over 500MB] files to read and store trades/quotes for 1 day. The usual process is to roll over each file, read it and store the data, but this method is very slow. Also I need to keep the memory consumption in mind.
Option 1 -
I made a function read_taq_file, which will read & process the file passed to it and return the dataset. Can i use this function with peach and say, 16 slaves?
files:all_taq_files
loop 16 files at a time
data:(./) read_taq_file peach files
store data for these 16 processed files on to disk to free memory
end loop
Will this method work?
Option 2 -
Another (more efficient) way would be to start multiple instances, all reading, processing and storing data simultaneously. The problem is conflict while writing data. It is a partitioned database with enumerations. How do i implement locking mechanism?
Yup, i tried parallel reads. The disk works fine in terms of i/o capacity. I couldn’t try writes because i haven’t yet figured out how to avoid conflicts while writing to the same files.
But still for now, I would ignore the problem or maybe reduce the slaves to 8.
You might find this article useful: http://code.kx.com/wiki/Cookbook/LoadingFromLargeFiles (see Parallel Loading section close to the bottom). But unless your files don’t overlap in terms of partitioned tables their data belongs to parallel loading will be hard to implement. If this is a one-off operation I’d leave it running overnight and be done with it. I remember loading approximately 1TB of CSV files into kdb+ database in less than 12 hours without even using slaves.
If the data overlaps you can try to identify “clusters” of files like this:
File 1, File 2 -> day 1
File 3 -> day 1, day 2
File 4 -> day 2
File 5 -> day 2, day 3
File 6, File 7 -> day 3
etc.
Then you can process clusters in parallel and then process the overlapping bits sequentially as the last step. In the example above you can process File 1 and File 2 while simultaneously processing File 6 and File 7 etc.
You cant easily avoid write conflicts. The file that really matters though is the sym file - kdb+ creates a lock on that when it writes to it, so you can enumerate without conflict.
Assuming the cpu:io ratio is high (i.e. majority of time is spent parsing the data rather than reading/writing) then you could run it in parallel by writing to multiple temp directories. For example:
assume N slave processes
each slave reads a set of files and writes the resultant splayed table(s) to a temporary location e.g. :temp0/tabname, :temp1/tabname `:tempN/tabname. The tables are all enumerated against the main sym file in the HDB directory.
when all slaves are complete there is a separate process to stitch the tables together into the final result set i.e. read in the table from each temp directory and upsert it to the HDB directory.
Obviously there is a bit more i/o here (as the data is being read twice and written twice) but it would allow you to load in parallel without conflict. It can also be done in relatively low memory (e.g. the files can be read in chunks/large in-memory tables can be avoided). If the data for an instrument is all contained in the same file (and assuming you are going down the standard `p#sym approach) then you dont have to worry about a big on-disk sort at the end.
Another similar approach would be to have multiple readers and a single writer - the reader processes read the files and pass chunks of data as async messages across to the writer process. Ideally you would pass all the data for a single instrument in the same message so you can avoid a sort at the end. This approach would be less i/o but at a guess more memory unless the writer process can keep up with all the data being thrown at it by the readers. This is probably a bit more like the peach version you have mentioned, but personally I would rather use multi-process rather than multi-slave for something like this because it gives a bit more control and flexibility. You should probably look at .z.pd as well (http://code.kx.com/wiki/Reference/peach#Peach_using_multiple_processes_.28Distributed_each.29)
I went through the respective links and I also liked the idea of multiple temp files. Will try both and update for sure. And yes, will definitely implement locking for sym file.
Do you really need to process single file in parallel? Different files would go to different partitions/tables I think. Since you have over 1500 files maybe its enough to make sure single process writes single file and no other process is touching this file. This is actually very simple to achieve and since as Jonny mentioned sym fils is locked automatically you only need to make sure that processes do not work on the same file.
I’ve done this in the past using global "lock dir " + status files. The algorithm would work as follows(for each process) :
Try “mkdir lock.dir”
a) if this fails wait a bit and try again - repeat until success
2. if lock succeeded select any(or according to any rules) file that does not have “status file” yet
Create "status file " for selected file
Release the lock “rmdir lock.dir”
process the file + write data
Start from 1 again.
I’ve seleceted “mkdir lock.dir” as my locking mechanism as it is atomic on the OS level. Obviously all processes should try to create the same “lock.dir”.
It has one additional feature that turned out handy: if you need to, you can pause processing just by creating the “lock.dir” manually.
kind regards
wieslaw
Dnia 22 kwiecień 2016 o 14:31 Jonny Press <pressjonny0@gmail.com> napisał(a):
Sounds good
Just to be clear - you don’t need to implement locking on the sym file - kdb+ does it automatically