https://learninghub.kx.com/forums/topic/embedpy-beautifulsoup
Hi All,
I am currently looking into scraping data using BeautifulSoup4 in embedPy.
My issue is that I am having difficulty accessing the list of lists produced by the method find_all from a BeautifulSoup object. I have ran the below code in python and all works as expected.
For simplicity, I have defined a document in Python over which to apply BeautifulSoup. If you have a look at the bottom of this post, there are some lines of code to create the sample html document I used with BeautifulSoup - it’s built line by line because I was having all kinds of trouble trying to define in a super long string using JupyterQ (another post its own right!)
// import the BeautifulSoup library
Bs4: p.import `bs4
// careful below the double speech marks need to be the straight ones not the right leaning monstrosities from Microsoft office.
bs: bs4[`BeautifulSoup;example_html;html.parser]
// at this point, running bs[`prettify;::] correctly returns the loaded html document.
// now run the file_all method to find a attributes (again watching for those speech marks).
rslt: bs[find_all;a;
href pykw 1b]
Now if I evaluate rslt` I get a list a list of 2 foreign elements. This agrees with the two found records I see when I run the same process in pure python.
Here I get stuck.
//I think the error is that when I ran the find_all method, I should have ran it as .p.qcallable:
rslt: .p.qcallable bs[find_all;a;
href pykw 1b]
However, when I do that, I am left with an object
Code.[code[foreign]]`.p.q2pargsenlist
I cant seem to manipulate this object at all.
Could you advise how I might use this function to end up with an object which Q sees as a list of 2 strings?
I include the test data I used below. It is built line by line because something seems to overflow in JupyterQ if you do it as a single string with n.
Regards,
Simon
-------------------------------------------------------------------
example_html: “”
example_html: example_html,“”
example_html: example_html,“Your Title Here”
example_html: example_html,“”
example_html: example_html,“”
example_html: example_html,“”
example_html: example_html,“”
example_html: example_html,“”
example_html: example_html,“
”
example_html: example_html,“<a href=”http://somegreatsite.com“>Link Name is a link to another nifty site”
example_html: example_html,“
This is a Header
”example_html: example_html,“
This is a Medium Header
”example_html: example_html,“Send me mail at <a href=“mailto:support@yourcompany.com”>support@yourcompany.com.”
example_html: example_html,“
This is a paragraph!
”example_html: example_html,“
”
example_html: example_html,“This is a new paragraph!
”
example_html: example_html,“This is a new sentence without a paragraph break, in bold italics.”
example_html: example_html,“This is an empty anchor”
example_html: example_html,“
example_html: example_html,“
”
example_html: example_html,“”
example_html: example_html,“”