incomplete data load with 0:

Hello:

   I’m reading the blog http://www.benfrederickson.com/distance-metrics/ and try to use kdb+/q 32 bit on linux to do some experiment.

after download and unzip the dataset from http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-360K.html,

q)t1: (“SSSI”;“\t”) 0: `$“:usersha1-artmbid-artname-plays.tsv”

q)t1

00000c289a1829a808ac09c00daf10bc3c4e223b 00000c289a1829a808ac09c00daf10bc3c4e..

3bd73256-3905-4f3a-97e2-8b341527f805     f2fb0ff0-5679-42ec-a55c-15109ce6e320..

betty blowtorch                          die Ärzte                          ..

2137                                     1099                                ..

q)count t1 0

8668227

q)

but

> wc -l usersha1-artmbid-artname-plays.tsv 

17559530 usersha1-artmbid-artname-plays.tsv

so 0: is not reading full data.

I found at least that user 42cf1f37b26b59c9960362939af89ff938ae7d9e is not in t but in tsv.

I also tried

grep 42cf1f37b26b59c9960362939af89ff938ae7d9e usersha1-artmbid-artname-plays.tsv > 1.tsv

and then load from 1.tsv with no problem.

any idea on what’s going on?

Thanks in advance.

It’s a pretty large text file. You might have more luck using .Q.fs as detailed in: http://code.kx.com/wiki/Cookbook/LoadingFromLargeFiles<o:p></o:p>

<o:p> </o:p>

From: personal-kdbplus@googlegroups.com [mailto:personal-kdbplus@googlegroups.com] On Behalf Of Hao Deng
Sent: Tuesday, May 10, 2016 10:25 AM
To: Kdb+ Personal Developers <personal-kdbplus@googlegroups.com>
Subject: [personal kdb+] incomplete data load with 0:<o:p></o:p>

<o:p> </o:p>

Hello:<o:p></o:p>

   I’m reading the blog http://www.benfrederickson.com/distance-metrics/ and try to use kdb+/q 32 bit on linux to do some experiment.<o:p></o:p>

after download and unzip the dataset from http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-360K.html,<o:p></o:p>

<o:p> </o:p>

<o:p> </o:p>

q)t1: (“SSSI”;“\t”) 0: `$“:usersha1-artmbid-artname-plays.tsv”<o:p></o:p>

q)t1<o:p></o:p>

00000c289a1829a808ac09c00daf10bc3c4e223b 00000c289a1829a808ac09c00daf10bc3c4e..<o:p></o:p>

3bd73256-3905-4f3a-97e2-8b341527f805     f2fb0ff0-5679-42ec-a55c-15109ce6e320..<o:p></o:p>

betty blowtorch                          die Ärzte                          ..<o:p></o:p>

2137                                     1099                                ..<o:p></o:p>

q)count t1 0<o:p></o:p>

8668227<o:p></o:p>

q)<o:p></o:p>

<o:p> </o:p>

but<o:p></o:p>

> wc -l usersha1-artmbid-artname-plays.tsv <o:p></o:p>

17559530 usersha1-artmbid-artname-plays.tsv<o:p></o:p>

<o:p> </o:p>

so 0: is not reading full data.<o:p></o:p>

<o:p> </o:p>

I found at least that user 42cf1f37b26b59c9960362939af89ff938ae7d9e is not in t but in tsv.<o:p></o:p>

I also tried<o:p></o:p>

grep 42cf1f37b26b59c9960362939af89ff938ae7d9e usersha1-artmbid-artname-plays.tsv > 1.tsv<o:p></o:p>

and then load from 1.tsv with no problem.<o:p></o:p>

<o:p> </o:p>

any idea on what’s going on?<o:p></o:p>

<o:p> </o:p>

Thanks in advance.<o:p></o:p>

<o:p> </o:p>


Submitted via Google Groups

maybe there are unmatched quotes?
can remove them with

$cat usersha1-artmbid-artname-plays.tsv | tr “"” “'” > new.tsv


and then load new.tsv.

kdb+3.3 is sensitive to unmatched double-quotes

@charles: you are right. it works.

q)t1: usermbidartnameplays!(“SSSI”;“\t”)0: `$“:new.tsv”

q)count t1 `user

17559530

wc -l usersha1-artmbid-artname-plays.tsv 

17559530 usersha1-artmbid-artname-plays.tsv

Is it possible to change 0: function to raise an error or display an error message?
also is it possible to fix this double quote bug?

it’s a side-effect of allowing line returns inside quoted fields.
the behavior becomes configurable in the forthcoming 3.4.