- 
                Notifications
    You must be signed in to change notification settings 
- Fork 90
Use pysimdjson for parsing wat records #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @silentninja, thanks for the PR and testing it out.
Because of the observed incompatibilities, it looks like a drop-in use of simdjson isn't recommend.
If we use simdjson.loads(json_blob) or the parse method without recursion (simdjson.Parser().parse(json_blob, False), the performance gains are mostly lost.
I'm currently running a couple of performance tests and will report back about them on issue #41.
| if self.is_wat_json_record(record): | ||
| # WAT (response) record | ||
| record = json.loads(self.get_payload_stream(record).read()) | ||
| record = self.json_extractor.parse(self.get_payload_stream(record).read()) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, processing the JSON may raise an exception:
  File "/mnt/data/wastl/proj/cc/git/cc-pyspark/sparkcc.py", line 377, in iterate_records
    for res in self.process_record(record):
               ~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "/mnt/data/wastl/proj/cc/git/cc-pyspark/server_count.py", line 42, in process_record
    server_names.append(headers[header].strip())
                        ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'csimdjson.Array' object has no attribute 'strip'
Here a short snippet why this happens:
>>> import simdjson
>>> type(simdjson.Parser().parse('[1,2]'))
<class 'csimdjson.Array'>
>>> type(simdjson.Parser().parse('[1,2]', True))
<class 'list'>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would need to add an extra check:
... or isinstance(headers[header], simdjson.Array)
But then it's no drop-in replacement anymore.
| try: | ||
| import simdjson | ||
| self.json = simdjson.Parser() | ||
| self.parse = self.json.parse | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could write:
self.parse = lambda j: self.json.parse(j, True)
to force recursive parsing and avoid incompatibilities.
However, then one of the major performance benefits of the simdjson module fades away.
| if not l: | ||
| continue | ||
| if url_attr in l: | ||
| if url_attr is not None and url_attr in l: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation and thanks for testing this. Unfortunately, it's not the only incompatibility.
Fixes #41
Made changes to the examples to use pysimdjson for parsing wat records and avoid causing the error mentioned in TkTech/pysimdjson#122.