Skip to content

Conversation

JunnYu
Copy link
Member

@JunnYu JunnYu commented Aug 13, 2021

修复 #875
原因:strip("##")会把原本有意义的#给删除!
例子:

text = "#1 it is a nice day 1##"
print(text.strip("##"))
# 打印结果:1 it is a nice day 1
print(text.replace("##",""))
# 打印结果:#1 it is a nice day 1

测试代码:

examples = [{'id': '56bf41013aeaaa14008c959c', 'title': 'Super_Bowl_50', 'context': '## a niThis was the first Super Bowl to feature a quarterback on both teams who was the #1 pick in their draft classes. Manning was the #1 selection of the 1998 NFL draft, while Newton was picked first in 2011. The matchup also pits the top two picks of the 2011 draft against each other: Newton for Carolina and Von Miller for Denver. Manning and Newton also set the record for the largest age difference between opposing Super Bowl quarterbacks at 13 years and 48 days (Manning was 39, Newton was 26).', 'question': 'In 2011, who was the first player to be chosen in the NFL draft?', 'answers': ['Newton', 'Newton', 'Newton'], 'answer_starts': [171, 171, 171], 'is_impossible': False}]
def prepare_validation_features(examples, tokenizer):
    contexts = [examples[i]['context'] for i in range(len(examples))]
    questions = [examples[i]['question'] for i in range(len(examples))]

    tokenized_examples = tokenizer(
        questions,
        contexts,
        stride=128,
        max_seq_len=512)

    # For validation, there is no need to compute start and end positions
    for i, tokenized_example in enumerate(tokenized_examples):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_example['token_type_ids']

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = tokenized_example['overflow_to_sample']
        tokenized_examples[i]["example_id"] = examples[sample_index]['id']

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples[i]["offset_mapping"] = [
            (o if sequence_ids[k] == 1 else None)
            for k, o in enumerate(tokenized_example["offset_mapping"])
        ]

    return tokenized_examples[0]
outputs = prepare_validation_features(examples,tokenizer)
for ids,offsetmap in zip(outputs["input_ids"],outputs["offset_mapping"]):
    if offsetmap:
        print(offsetmap,examples[0]["context"][offsetmap[0]:offsetmap[1]])
        print("="*20)

测试结果正确

@yingyibiao yingyibiao requested a review from smallv0221 August 13, 2021 03:30
@ZeyuChen ZeyuChen added the bug Something isn't working label Aug 13, 2021
@smallv0221
Copy link
Contributor

LGTM! Thank you for your contribution!

Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZeyuChen ZeyuChen merged commit 15a074b into PaddlePaddle:develop Aug 13, 2021
@JunnYu JunnYu deleted the fix#875 branch August 30, 2021 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants