A simple Python module to query html pages (or xml in general) using (almost) all available CSS selectors and rules, that doesn't bore you with weird objects, just plain old lists and dicts.
Why if we have BeautifulSoup? Because:
- bs4 doesn't support advanced selectors, as
a:not(.not-this-a)(not selector). - it gets more into the lxml performance range.
- I wanted to make something useful in Nim.
- Nim is a very flexible and powerful language that I am delving a bit deeper into.
- Nimquery is a great nim module/package/library that gives us the querying capabilities.
- Nimpy is an awesome nim module/package/library that builds a python native extension (think numpy or pandas) from a nim module.
-
Build it on your OS:
- Make sure you have nim and nimble installed and working
- Clone this repo
- Run
nimble bldto generate the sharedlib - Run
nimble tstto test it with a bundled python script - And you are good to go!
-
Build it on a docker container (for use with alpine or ubuntu containers):
- Be sure to have make and docker insalled and working
- Clone this repo
- Run
make buildto get the alpine version (for ubuntu, set LINUX = ubuntu) - Run
make testto test it with the bundled python script on the same container used to build - There you have your
nemo.sofile to put into your desired container!
-
Prebuilt binaries (macosx, alpine and ubuntu only, for the lazy ones):
import nemo # assuming this is in the module's path
queries = [
'body span a:not(.first-item)',
# all 'a's inside 'span's in 'body' that are not in '.first-item' class
'[href$=".pdf"]',
# all links to pdfs
'p, span'
# all of 'p's and 'span's
]
results = dict(nemo.find(some_html, queries))
# a dict mapping from the query-string to a list of the findings,
# where each finding is a dict with attributes and content on key 'text', like:
{
'body span a:not(.first-item)' : [{'tag':'a', 'text':'hi', 'class':'last-item'}],
'[href$=".pdf"]':[
{'tag':'a', 'href':'link-to-pdf'},
{'tag':'a', 'href':'link-to-other-pdf'}
],
'p, span':[
# loads of elements, or maybe none, who knows
]
}