https://learninghub.kx.com/forums/topic/compression-for-null-string-column
Hi,
We're seeing compressed null string columns take up more space on disk than expected. Would anyone be able to shine some light on this behaviour?
Example:
q)n:10000000;tab:([]time:n#.z.p;val:n?1000;str:n#enlist "");(:tab/;17;2;5) set tab
:tab/
q)-21!:tab/str
compressedLength | 14074225
uncompressedLength| 80004096
algorithm | 2i
logicalBlockSize | 17i
zipLevel | 5i
q)-21!$":tab/str#"
compressedLength | 24189
uncompressedLength| 20004096
algorithm | 2i
logicalBlockSize | 17i
zipLevel | 5i
According to this page, "the non-sharp file is a serialized q list of integers representing the lengths of each sublist of the original list."
For a null string column we'd expect the non-sharp file to just contain zeroes, which should compress better than what we're seeing.
Using 4.0 2020.06.18
Thanks,
Eoghan
Can you test against a newer version of 4.0?
My 4.1 gets much improved numbers:
q)(.z.K;.z.k)
4.1
2024.04.29
q)n:10000000;tab:([]time:n#.z.p;val:n?1000;str:n#enlist "");(`:tab/;17;2;5) set tab
`:tab/
q)-21!`:tab/str
compressedLength | 136807
uncompressedLength| 80004096
algorithm | 2i
logicalBlockSize | 17i
zipLevel | 5i
//Your compression 5.6x
q)80004096%14074225
5.684441
//Compression now 584x
80004096%136807
584.7953
q)-21!`$":tab/str#"
compressedLength | 93
uncompressedLength| 4098
algorithm | 2i
logicalBlockSize | 17i
zipLevel | 5i
I expect this entry in 4.0 README for 2022.04.15 is the version from where you will see the improvement:
2022.04.15
NEW
anymap write now detects consecutive deduplicated (address matching) toplevel objects, skipping them to save space
q)a:("hi";"there";"world");`:a0 set a;`:a1 set a@where 1000 2000 3000;(hcount`$":a0#")=hcount`$":a1#"
improved memory efficiency of writing nested data sourced from a type 77 file, commonly encountered during compression of files. e.g.
q)`:a set 500000 100#"abc";system"ts `:b set get`:a" / was 76584400 bytes, now 8390720
Thanks @rocuinneagain , will test.
FYI we'd also tested changing the type from string to symbol, the symbol column compresses at the same ratio as in your example
q)show c:count get`:eohara_dev/strCol
18809996
q)vals:sym?c#
q)sym
```
symbol$()
q)(
```
:eohara_dev/test;17;2;5)set vals
```
:eohara_dev/test
q)-21!
```
:eohara_dev/test
compressedLength | 257281
uncompressedLength| 150484064
algorithm | 2i
logicalBlockSize | 17i
zipLevel | 5i
q)150484064%257281
584.9016