Create a splayed table in parallel?

It is easy to create csv files in parallel – they are not related to each other. you can launch as many processes as you want and write csv files to the file system. 

splayed tables, however, are tricky since multiple sub-directories share the same “sym” file. It;s unsafe to have multiple processes writing to the same “sym” file. 

How do we get around this?

http://code.kx.com/wiki/Cookbook/LoadingFromLargeFiles#Parallel\_Loading

q will attempt to acquire a lock on the sym file when it’s enumerating against it, so theoretically you should be able to have multiple processes enumerating against the same sym file in parallel.

q).z.i

27864

q):sym?a

sym$a

strace shows:

$ strace -p 27864

Process 27864 attached - interrupt to quit

read(0, “:sym?a\n”, 4080) = 9

stat(“sym”, 0x7fff6e0fc5b0) = -1 ENOENT (No such file or directory)

unlink(“sym$”) = -1 ENOENT (No such file or directory)

open(“sym$”, O_RDWR|O_CREAT, 0666) = 3

write(3, “\377\1\v\0\0\0\0\0”, 8) = 8

close(3) = 0

rename(“sym$”, “sym”) = 0

stat(“sym#”, 0x7fff6e0fc560) = -1 ENOENT (No such file or directory)

open(“sym”, O_RDWR|O_CREAT, 0666) = 3

read(3, “\377\1\v\0\0\0\0\0”, 8) = 8

lseek(3, 0, SEEK_SET) = 0

fcntl(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_CUR, start=0, len=0}) = 0

lseek(3, 2, SEEK_SET) = 2

fstat(3, {st_mode=S_IFREG|0644, st_size=8, …}) = 0

https://www.gnu.org/software/libc/manual/html_node/File-Locks.html

However, if your sym file is on NFS this might not be safe: http://0pointer.de/blog/projects/locking.html

Thanks Martin. Unfortunately I am on an NFS. is there a good way around it? 

As shown in the diagram, the enumService should perform locking properly when NFS locking might not be safe. Is this overkill? Are there any simpler method? Thx.

Sorry Yan I might not understand you. Do you mean loading in parallel on NFS is still safe? 

The following “multiple parser + single writer” design should be simpler than the previous enumService design. The writer kdb processes incoming IPC messages in FIFO order. This is parallel parsing and sequential writing.

Tom Martin: If I have 2 kdb instances on the same host enumerating against /nfs/hdb/sym at the same time, will the locking work properly?

I think it’s implementation specific and depends on your OS.

http://man7.org/linux/man-pages/man2/fcntl.2.html

**Record locking and NFS** Before Linux 3.12, if an NFSv4 client loses contact with the server for a period of time (defined as more than 90 seconds with no communication), it might lose and regain a lock without ever being aware of the fact. (The period of time after which contact is assumed lost is known as the NFSv4 leasetime. On a Linux NFS server, this can be determined by looking at_/proc/fs/nfsd/nfsv4leasetime_, which expresses the period in seconds. The default value for this file is 90.) This scenario potentially risks data corruption, since another process might acquire a lock in the intervening period and perform file I/O. Since Linux 3.12, if an NFSv4 client loses contact with the server, any I/O to the file by a process which "thinks" it holds a lock will fail until that process closes and reopens the file. A kernel parameter,_nfs.recover\_lost\_locks_, can be set to 1 to obtain the pre-3.12 behavior, whereby the client will attempt to recover lost locks when contact is reestablished with the server. Because of the attendant risk of data corruption, this parameter defaults to 0 (disabled).

On Sunday, August 23, 2015 at 11:58:14 PM UTC+1, Yan Yan wrote:

Tom Martin: If I have 2 kdb instances on the same host enumerating against /nfs/hdb/sym at the same time, will the locking work properly?