Compression for null string column

https://learninghub.kx.com/forums/topic/compression-for-null-string-column

Hi,


We're seeing compressed null string columns take up more space on disk than expected. Would anyone be able to shine some light on this behaviour?


Example:


q)n:10000000;tab:([]time:n#.z.p;val:n?1000;str:n#enlist "");(:tab/;17;2;5) set tab

:tab/

q)-21!:tab/str

compressedLength | 14074225

uncompressedLength| 80004096

algorithm | 2i

logicalBlockSize | 17i

zipLevel | 5i

q)-21!$":tab/str#"

compressedLength | 24189

uncompressedLength| 20004096

algorithm | 2i

logicalBlockSize | 17i

zipLevel | 5i


According to this page, "the non-sharp file is a serialized q list of integers representing the lengths of each sublist of the original list."

For a null string column we'd expect the non-sharp file to just contain zeroes, which should compress better than what we're seeing.


Using 4.0 2020.06.18


Thanks,

Eoghan

Can you test against a newer version of 4.0?


My 4.1 gets much improved numbers:

q)(.z.K;.z.k)
4.1
2024.04.29
q)n:10000000;tab:([]time:n#.z.p;val:n?1000;str:n#enlist "");(`:tab/;17;2;5) set tab
`:tab/
q)-21!`:tab/str
compressedLength | 136807
uncompressedLength| 80004096
algorithm | 2i
logicalBlockSize | 17i
zipLevel | 5i
//Your compression 5.6x
q)80004096%14074225
5.684441
//Compression now 584x
80004096%136807
584.7953
q)-21!`$":tab/str#"
compressedLength | 93
uncompressedLength| 4098
algorithm | 2i
logicalBlockSize | 17i
zipLevel | 5i

I expect this entry in 4.0 README for 2022.04.15 is the version from where you will see the improvement:


2022.04.15
NEW
anymap write now detects consecutive deduplicated (address matching) toplevel objects, skipping them to save space
q)a:("hi";"there";"world");`:a0 set a;`:a1 set a@where 1000 2000 3000;(hcount`$":a0#")=hcount`$":a1#"
improved memory efficiency of writing nested data sourced from a type 77 file, commonly encountered during compression of files. e.g.
q)`:a set 500000 100#"abc";system"ts `:b set get`:a" / was 76584400 bytes, now 8390720

Thanks @rocuinneagain , will test.


FYI we'd also tested changing the type from string to symbol, the symbol column compresses at the same ratio as in your example


q)show c:count get`:eohara_dev/strCol
18809996
q)vals:sym?c#
q)sym
```
symbol$()
q)(
```
:eohara_dev/test;17;2;5)set vals
```
:eohara_dev/test
q)-21!
```
:eohara_dev/test
compressedLength  | 257281
uncompressedLength| 150484064
algorithm         | 2i
logicalBlockSize  | 17i
zipLevel          | 5i 

q)150484064%257281

584.9016