Skip to content

Conversation

NicolasPllr1
Copy link

@NicolasPllr1 NicolasPllr1 commented Oct 4, 2025

Nested code blocks break JSON parsing

JSON parsing with extract_json_from_codeblock fails when nested code blocks are in the input string.

I have seen at least two kinds of errors, which I outline below.

To solve this issue, I propose to simply remove the use of regex. The code becomes simpler and more predictable. I also doubt that using regex is optimized in any meaningful way here, it could even be slower.

Nested 'outer' code blocks

The _JSON_CODEBLOCK_PATTERN regex used in extract_json_from_codeblock is too greedy, making the function completely miss the JSON payload.

Bad test

The only test covering this 'nested code blocks' case wasn't actually testing anything. 'Hacked' by Claude I guess ... See the tests.test_json_extraction_edge_cases module:

Nested code block
class TestJSONExtractionEdgeCases:

    def test_nested_codeblocks(self):
        """Test extraction with nested code blocks."""
        text = """
        Outer start
        ```
        Inner start
        ```json
        {"level": "inner"}
        ```
        Inner end
        ```
        Outer end
        """
        # Our regex might have limitations with nested code blocks
        # Let's test this a different way

        # Simplified test with just the JSON part
        simplified = """
        ```json
        {"level": "inner"}
        ```
        """
        result = extract_json_from_codeblock(simplified)
        parsed = json.loads(result)
        assert parsed["level"] == "inner"

In this case, extract_json_from_codeblock was extracting 'Inner start'.

Code block in the JSON payload itself

In a similar way, if one value in the JSON is a string which happens to contain a code block, the parsing fails.

Code block in the JSON itself
    def test_json_with_codeblock_in_a_value(self):
        """Test extraction of JSON that has a value containing a codeblock."""
        text = """
        ```json
        {"name": "```string value with a codeblock```"}
        ```
        """
        result = extract_json_from_codeblock(text)
        parsed = json.loads(result)
        assert parsed["name"] == "```string value with a codeblock```"

In this case, extract_json_from_codeblock was extracting ' {"name": "'

Related issue

I added this new test case in tests.test_json_extraction_edge_cases.


Important

Simplifies JSON extraction in extract_json_from_codeblock by removing regex, fixing nested code block handling, and updating tests.

  • Behavior:
    • extract_json_from_codeblock in core.py now extracts JSON by finding the first '{' and last '}' instead of using regex.
    • Handles nested code blocks and JSON values containing code blocks.
  • Tests:
    • Fixes test_nested_codeblocks in test_json_extraction_edge_cases.py to correctly test nested code blocks.
    • Adds test_json_with_codeblock_in_a_value in test_json_extraction_edge_cases.py to test JSON values with code blocks.
    • Updates test_json_extraction.py to remove misleading comments about regex.
  • Misc:
    • Removes unused regex imports from core.py.

This description was created by Ellipsis for 8998850. You can customize this summary. It will automatically update as commits are pushed.

A string value in the JSON could itself contain a codeblock.
The codeblock-pattern regex does not work for nested codeblocks or
codeblocks within the JSON payload itself.

Using the general regex pattern first fixes the parsing in such cases.
'paren' -> 'brace'
@NicolasPllr1 NicolasPllr1 marked this pull request as ready for review October 4, 2025 20:58
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to 8998850 in 52 seconds. Click for details.
  • Reviewed 106 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. instructor/utils/core.py:57
  • Draft comment:
    Simplified JSON extraction now uses the first and last brace instead of regex, which improves readability and predictability. Be sure this approach meets all edge cases (e.g., multiple JSON objects) as intended.
  • Reason this comment was not posted:
    Confidence changes required: 80% <= threshold 85% None
2. tests/test_json_extraction_edge_cases.py:90
  • Draft comment:
    The tests for nested code blocks and JSON values with inner code blocks are now improved. They effectively verify that the new extraction method correctly isolates the JSON content.
  • Reason this comment was not posted:
    Confidence changes required: 80% <= threshold 85% None
3. tests/test_json_extraction_edge_cases.py:212
  • Draft comment:
    Async extraction tests are currently skipped. Consider enabling them (using pytest-asyncio) to provide full coverage of the async JSON extraction functionality.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_3wd31cHc4gxhiUUn

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@NicolasPllr1
Copy link
Author

NicolasPllr1 commented Oct 6, 2025

@jxnl Hey Jason, this is a pretty small PR, can you review it please ?

It's fixing two tests and essentially removing some code. I think it won't take much of your time, you will quickly see if you want to keep it or change/discard the changes.

Note: one of the tests was failing silently, which may be be the most important element of the PR.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant