-
-
Notifications
You must be signed in to change notification settings - Fork 868
Fix json parsing #1836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix json parsing #1836
Conversation
A string value in the JSON could itself contain a codeblock.
The codeblock-pattern regex does not work for nested codeblocks or codeblocks within the JSON payload itself. Using the general regex pattern first fixes the parsing in such cases.
'paren' -> 'brace'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed everything up to 8998850 in 52 seconds. Click for details.
- Reviewed
106
lines of code in3
files - Skipped
0
files when reviewing. - Skipped posting
3
draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. instructor/utils/core.py:57
- Draft comment:
Simplified JSON extraction now uses the first and last brace instead of regex, which improves readability and predictability. Be sure this approach meets all edge cases (e.g., multiple JSON objects) as intended. - Reason this comment was not posted:
Confidence changes required:80%
<= threshold85%
None
2. tests/test_json_extraction_edge_cases.py:90
- Draft comment:
The tests for nested code blocks and JSON values with inner code blocks are now improved. They effectively verify that the new extraction method correctly isolates the JSON content. - Reason this comment was not posted:
Confidence changes required:80%
<= threshold85%
None
3. tests/test_json_extraction_edge_cases.py:212
- Draft comment:
Async extraction tests are currently skipped. Consider enabling them (using pytest-asyncio) to provide full coverage of the async JSON extraction functionality. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
Workflow ID: wflow_3wd31cHc4gxhiUUn
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
@jxnl Hey Jason, this is a pretty small PR, can you review it please ? It's fixing two tests and essentially removing some code. I think it won't take much of your time, you will quickly see if you want to keep it or change/discard the changes. Note: one of the tests was failing silently, which may be be the most important element of the PR. Thanks! |
Nested code blocks break JSON parsing
JSON parsing with
extract_json_from_codeblock
fails when nested code blocks are in the input string.I have seen at least two kinds of errors, which I outline below.
To solve this issue, I propose to simply remove the use of regex. The code becomes simpler and more predictable. I also doubt that using regex is optimized in any meaningful way here, it could even be slower.
Nested 'outer' code blocks
The
_JSON_CODEBLOCK_PATTERN
regex used inextract_json_from_codeblock
is too greedy, making the function completely miss the JSON payload.Bad test
The only test covering this 'nested code blocks' case wasn't actually testing anything. 'Hacked' by Claude I guess ... See the
tests.test_json_extraction_edge_cases
module:Nested code block
In this case,
extract_json_from_codeblock
was extracting 'Inner start'.Code block in the JSON payload itself
In a similar way, if one value in the JSON is a string which happens to contain a code block, the parsing fails.
Code block in the JSON itself
In this case,
extract_json_from_codeblock
was extracting ' {"name": "'Related issue
I added this new test case in
tests.test_json_extraction_edge_cases
.Important
Simplifies JSON extraction in
extract_json_from_codeblock
by removing regex, fixing nested code block handling, and updating tests.extract_json_from_codeblock
incore.py
now extracts JSON by finding the first '{' and last '}' instead of using regex.test_nested_codeblocks
intest_json_extraction_edge_cases.py
to correctly test nested code blocks.test_json_with_codeblock_in_a_value
intest_json_extraction_edge_cases.py
to test JSON values with code blocks.test_json_extraction.py
to remove misleading comments about regex.core.py
.This description was created by
for 8998850. You can customize this summary. It will automatically update as commits are pushed.