Search words inside list of list of words.

Hello,
I have a list of words :

q)words

“digby”

“morrell”

“born”

“10”

“october”

“1979”

“is”

,“a”

“former”

“australian”

“rules”

“footballer”

“who”

“played”

“with”

“the”

“kangaroos”

“and”

“carlton”

“in”

“football”

“league”

..

I also have a list of lists(each containing words):

q)kspl

(“digby”;“morrell”;“born”;“10”;“october”;“1979”;“is”;,“a”;“former”;"australia..

(“alfred”;,“j”;“lewy”;“aka”;“sandy”;“graduated”;“from”;“university”;“of”;"chi..

(“harpdog”;“brown”;“is”;,“a”;“singer”;“and”;“harmonica”;“player”;“who”;“has”;..

(“franz”;“rottensteiner”;“born”;“in”;“waidmannsfeld”;“lower”;“austria”;“on”;"..

(“henry”;“krvits”;“born”;“30”;“december”;“1974”;“in”;“tallinn”;“better”;"know..

(“sam”;“henderson”;“born”;“october”;“18”;“1969”;“is”;“an”;“american”;"cartoon..

(“aaron”;“lacrate”;“is”;“an”;“american”;“music”;“producer”;“recording”;"artis..

(“trevor”;“ferguson”;“aka”;“john”;“farrow”;“born”;“11”;“november”;“1947”;“is”..

(“grant”;“nelson”;“born”;“27”;“april”;“1971”;“in”;“london”;“also”;“known”;"as..

(“cathy”;“caruth”;“born”;“1955”;“is”;“frank”;,“h”;,“t”;“rhodes”;“professor”;"..

(“sophia”;“violet”;“sophie”;“crumb”;“born”;“september”;“27”;“1981”;“is”;“an”;..

(“jenn”;“ashworth”;“is”;“an”;“english”;“writer”;“she”;“was”;“born”;“in”;"1982..

(“jonathan”;“hoefler”;“born”;“august”;“22”;“1970”;“is”;“an”;“american”;"typef..

(“anthony”;“fitzhardinge”;“gueterbock”;“18th”;“baron”;“berkeley”;“and”;“obe”;..

(“david”;“chernushenko”;“born”;“june”;“1963”;“in”;“calgary”;“alberta”;“is”;,"..

(“joerg”;“steineck”;“is”;,“a”;“german”;“filmmaker”;“editor”;“and”;“graphic”;"..

(“fr”;“andrew”;“pinsent”;“born”;“19”;“august”;“1966”;“is”;“research”;"directo..

(“paddy”;“dunne”;“was”;,“a”;“gaelic”;“football”;“player”;“from”;“park”;“in”;"..

(“alexandros”;“mouzas”;“born”;“1962”;“is”;,“a”;“greek”;“composer”;“he”;"studi..

(“john”;“angus”;“campbell”;“born”;“march”;“10”;“1942”;“in”;“portland”;"oregon..

(“chris”;“batstone”;“was”;“the”;“20002002”;“lead”;“singer”;“of”;“thirdwave”;"..

(“ceiron”;“thomas”;“born”;“23”;“october”;“1983”;“is”;,“a”;“welsh”;“rugby”;"un..

the list of lists is shaped as follows :

q)kspl[0]   

“digby”     

“morrell”   

“born”      

“10”        

“october”   

“1979”      

“is”        

,“a”        

“former”    

“australian”

“rules”     

“footballer”

“who”       

“played”    

“with”      

“the”       

“kangaroos” 

“and”       

“carlton”   

“in”        

“football”  

“league”    

..          … and so on(created by - " " vs [a list of string lists])

I need to know the rows of ‘kspl’ that each word of ‘words’ appears in , so I do:

q)lc:{dr in kspl}each til count kspl;

q)lc

11111111111111111111111111111111111111111111111111111111111111111111111111111..

00000011000000110101000001011000000011000010100100000000001000000100100000000..

00000011000010110101000001001000000011000000100111000000001000000100000000000..

00100011000000110101000001001000000011010010100100000000001000000000000000000..

00100011000010110101000001001000000001000000100010000010000000001010000000000..

00101011000000110101000001001000000011000000100010000000001000000000000000000..

00110011000001110101000001101100000011000011100111000000000000000110000010000..

00100011000010110101000001101100000011000011100110010000001000100110000100000..

  • returns locations of each word of ‘words’ in each row of ‘kspl’. 

This is, however, extremely slow. How may I speed this up? I have 59071 rows in all and 580893 words in all. 

Thank you, 

Kumar

P.S : the calculation of lc above is akin to the document frequency calculation(df in tfidf).

does this kspl and dr look like yours?

 /try to generate data like yours

dr:gotherooscommadonsin2020 /words you’re interested in

kspl:string 0N 10#dr,580000?dr,100?`5 /10 word sentences
dr:string dr
dr in/: kspl
1111111111b

0100000000b

0000001000b

0000000100b

0000000000b

0000000000b

1000000100b

..

do you want “go go go in2020 comma” to ‘score’ 11111b?

/try to generate data like yours

dr:gotherooscommadonsin2020     /words you’re interested in

kspl:string 0N 10#dr,580000?dr,100?`5 /10 word sentences

dr:string dr

//there is another method, that I personally prefer which basically grabs distinct words and searches for it in the list grabbing the indexes

//whats more it creates a mapping of every word, rather than only your keywords

//sometimes you may want to add a keyword, rather than re-analyse previous text files, so it might be better to keep a full mapping of words-> rows, and just index into the map when needed

\ts old:dr!where each flip dr in/: kspl

100 2511904

\ts ind:sums count each kspl ; new:select distinct 1+ind bin i by kspl from (raze kspl)

234 46662784

//match?

(x:value old)~new@/:dr

1b

//now just find the words you want

1!(dr),'new@/:dr

//for comparison sake how does in/: perform on full word search

drfull:exec kspl from key map

\ts oldfull:drfull!where each flip drfull in/: kspl

577 14634000

//running on a sample file

//lets take a sample file —http://www.textfiles.com/law/citizen5.txt

t:read0`:C:/Users/Sean/Desktop/text.txt

pt:" " vs’ t;

//take 100 test words

dr:100?except[distinct raze pt;(,)“”]  /and any other keywords you don’t want

\ts old:dr!where each flip dr in/: pt

20 459280

fulldr:except[distinct raze pt;(,)“”]

q)\ts old:fulldr!where each flip fulldr in/: pt

958 16218512

\ts ind:sums count each pt ; new:select distinct 1+ind bin i by pt from (raze pt)

25 935040

/match?

(x:value old)~new@/:fulldr

1b

You would probably see larger improvements on bigger files too, and then more improvements as keyword grows, which I suspect you are/will be doing soon.

HTH,

Sean

Thanks very much, Jack and Sean. 
Jack - dr, in my case, contains around 580000 words. I’ve changed your dr , to (580000#dr) in my code. Seems to take around the same time. 

(And it is fine if “go go go in2020 comma” scores 11111b, as you asked, no worries. ). 

Sean - Let me test this code in detail, I’ll post back. 

Cheers, 

Kumar