Fast file reading

roryk · June 4, 2024, 4:41am

https://learninghub.kx.com/forums/topic/fast-file-reading

I was testing the speed of loading in a file with one word on each line, and found reading it as a CSV is faster and uses less memory than read0, despite being a more roundabout method. Out of interest, is there a reason why? And is there anything faster?

\ts:100 read0 words
2297 18874480
\ts:100 first (1#"*";" ")0: words
1003 14680704

megan_mcp · June 5, 2024, 11:27am

Hi @roryk

If the content of the file contains spaces, 0: only loads a subset of the data.

It could also just be the read vs mmap performance.

Hope this helps.

Thanks,

Megan

roryk · June 6, 2024, 5:59am

Hi, the file doesn't contain spaces, it's just one word per line. So the results are the same. If mmap is faster, is there a reason read0 doesn't use it, or is it just a potential optimisation that hasn't been implemented?

It doesn't really matter as only a couple of milliseconds for even a fairly large file, but it would be nice to have a deeper understanding of the performance of various operations.

megan_mcp · June 7, 2024, 12:00pm

Hi @roryk

I reached out to one of our developers on this and this was their response:

"Actually looks like read0 is using a load of memcmp calls & scanning for \n
where 0: is using memchr to find it in a single call:

q)\ts:1 (1#"*";"-")0:`:testf 
221 36800   
seconds  usecs/call     calls      function 
-------- ----------- --------- -------------------- 
0.095358          95      1003 memchr 
0.060194         106       564 memmove  
240799   0.000319 memchr("qwertyuiopasdfghjklzxcvbnm\nqwert"..., '\n', 13311)                              
= 0x7fa1eaf3d0d7 
240799   0.005258 memchr("qwertyuiopasdfghjklzxcvbnm\nqwert"..., '-', 26)                                  
= 0 
240799   0.005147 memmove(0x7fa1e6b121d0, "qwertyuiopasdfghjklzxcvbnm", 26)                                
= 0x7fa1e6b121d0

q) \ts:1 read0 `:testf
2161 52624

seconds usecs/call calls function
-------- ----------- --------- --------------------
1.332051 98 13505 memcmp

0.102397 97 1046 memmove

240799 0.000144 memcmp(0x4bf876, 0x7fa1e6b7f410, 1, 113) = 0xffffff93

240799 0.000152 memcmp(0x4bf876, 0x7fa1e6b7f411, 1, 119) = 0xffffffa5

240799 0.000172 memcmp(0x4bf876, 0x7fa1e6b7f412, 1, 101) = 0xffffff98

240799 0.000160 memcmp(0x4bf876, 0x7fa1e6b7f413, 1, 114) = 0xffffff96

240799 0.000144 memcmp(0x4bf876, 0x7fa1e6b7f414, 1, 116) = 0xffffff91

240799 0.000144 memcmp(0x4bf876, 0x7fa1e6b7f415, 1, 121) = 0xffffff95

240799 0.000144 memcmp(0x4bf876, 0x7fa1e6b7f416, 1, 117) = 0xffffffa1

240799 0.000154 memcmp(0x4bf876, 0x7fa1e6b7f417, 1, 105) = 0xffffff9b

240799 0.000162 memcmp(0x4bf876, 0x7fa1e6b7f418, 1, 111) = 0xffffff9a

240799 0.000236 memcmp(0x4bf876, 0x7fa1e6b7f419, 1, 112) = 0xffffffa9

240799 0.000150 memcmp(0x4bf876, 0x7fa1e6b7f41a, 1, 97) = 0xffffff97

240799 0.000149 memcmp(0x4bf876, 0x7fa1e6b7f41b, 1, 115) = 0xffffffa6

240799 0.000149 memcmp(0x4bf876, 0x7fa1e6b7f41c, 1, 100) = 0xffffffa4

240799 0.000222 memcmp(0x4bf876, 0x7fa1e6b7f41d, 1, 102) = 0xffffffa3

240799 0.000175 memcmp(0x4bf876, 0x7fa1e6b7f41e, 1, 103) = 0xffffffa2

240799 0.000313 memcmp(0x4bf876, 0x7fa1e6b7f41f, 1, 104) = 0xffffffa0

240799 0.000311 memcmp(0x4bf876, 0x7fa1e6b7f420, 1, 106) = 0xffffff9f

240799 0.000312 memcmp(0x4bf876, 0x7fa1e6b7f421, 1, 107) = 0xffffff9e

240799 0.000239 memcmp(0x4bf876, 0x7fa1e6b7f422, 1, 108) = 0xffffff90

240799 0.000149 memcmp(0x4bf876, 0x7fa1e6b7f423, 1, 122) = 0xffffff92

240799 0.000148 memcmp(0x4bf876, 0x7fa1e6b7f424, 1, 120) = 0xffffffa7

240799 0.000218 memcmp(0x4bf876, 0x7fa1e6b7f425, 1, 99) = 0xffffff94

240799 0.000156 memcmp(0x4bf876, 0x7fa1e6b7f426, 1, 118) = 0xffffffa8

240799 0.000148 memcmp(0x4bf876, 0x7fa1e6b7f427, 1, 98) = 0xffffff9c

240799 0.000208 memcmp(0x4bf876, 0x7fa1e6b7f428, 1, 110) = 0xffffff9d

240799 0.000326 memcmp(0x4bf876, 0x7fa1e6b7f429, 1, 109) = 0

240799 0.000270 memmove(0x7fa1e6b121d0, "qwertyuiopasdfghjklzxcvbnm", 26) = 0x7fa1e6b121d0

240799 0.000174 memmove(0x7fa1e6b75f78, "\300!\261\346\241\177\0\0", 8) = 0x7fa1e6b75f78

megan_mcp · June 7, 2024, 12:10pm

@roryk

I can follow up on this further if you would like to know why read0 doesn't use memchr (&/ mmap)?

roryk · June 9, 2024, 7:24pm

Hi Megan,
Thanks for the information!
It would be interesting to know the reason, but I don't want to take up too much of a dev's time for a random curiosity. If someone is free to look into it and would want to, then it would be interesting to know but no problem if not.

sujoy · June 16, 2024, 3:46am

I think, read0 (0::) is reading 1 line at a time, while 0: is picking up columns. 0: will be faster, if you exactly know which position you are looking at.

Topic		Replies	Views
Optimizating csv import Community Support kdb-and-q	12	2	March 30, 2015
Re: [personal kdb+] Read and save big csv file speed Community Support kdb-and-q	11	1	April 11, 2013
historic data storage and retrieval Community Support kdb-and-q	7	0	October 13, 2013
There is issue of loading a big CSV file? Community Support kdb-and-q	7	0	May 16, 2018
Memory mapped -23! Community Support kdb-and-q	5	0	March 8, 2018

Fast file reading

Related topics