Hi All,
I am looking at Reddit data to carry out some analysis on social networks.
I’m sourcing my data from the following site:
https://files.pushshift.io/reddit/comments/
(Thanks to jason)
The site contains a set of JSON files in zips of various types. I am trying to upload those files to a KDB database.
As an example, I unzip the JSON file in RC_2010_12.bz2. I can read the resultant text file (say using an application such as more) and it looks like regular JSON.
However, when I try loading the file, I get a report of illegal JSON inside the file.
I’m using .j.k raze read0`:RC_2010_12.txt
and the error I get is:
'illegal char { at 458
[0] .j.k raze read0`:RC_2010_12.txt
I did go to the 458th character in the file but everything around there looked ok.
Could you advise what the issue might be and how I might get around it?
Thanks and regards,
Simon
to be clear on this - although the message is that there is an illegal “{”, I see all brackets as correctly paired. Also, imports of this file to other database platforms occur without issue.
Thanks and regards,
Simon
Hi Simon,
that looks like jsonl not json
http://jsonlines.org/
so at the moment, you need to parse each line, i.e.
q)r:.j.k@/:read0`$“:RC_2010-12”
and then to collapse to a table:
q)k:distinct raze key each r
q)t:k#/:r
you’ll need to fix up ~3 columns as null is read as 0n, e.g.
q)update author_flair_text:count[i]#enlist"" from `t where author_flair_text~:0n
q)update author_flair_css_class:count[i]#enlist"" from `t where author_flair_css_class~:0n
q)update distinguished:count[i]#enlist"" from `t where distinguished~:0n
edited seems to be a mix of bools and ints.
hth,
Charlie
Thanks Charlie - I’ll give this a crack and report back how I go.
Really appreciate you taking the time.
Simon
for good order: this works! I’ll try and understand how it works now!
Media_embed looks funky - that’s probably because for the most part it was just an empty pair of curly brackets.
Thanks again Charlie - I’ve been struggling for weeks.