fix #875 #878

JunnYu · 2021-08-13T03:28:54Z

修复 #875
原因：strip("##")会把原本有意义的#给删除！
例子：

text = "#1 it is a nice day 1##"
print(text.strip("##"))
# 打印结果：1 it is a nice day 1
print(text.replace("##",""))
# 打印结果：#1 it is a nice day 1

测试代码：

examples = [{'id': '56bf41013aeaaa14008c959c', 'title': 'Super_Bowl_50', 'context': '## a niThis was the first Super Bowl to feature a quarterback on both teams who was the #1 pick in their draft classes. Manning was the #1 selection of the 1998 NFL draft, while Newton was picked first in 2011. The matchup also pits the top two picks of the 2011 draft against each other: Newton for Carolina and Von Miller for Denver. Manning and Newton also set the record for the largest age difference between opposing Super Bowl quarterbacks at 13 years and 48 days (Manning was 39, Newton was 26).', 'question': 'In 2011, who was the first player to be chosen in the NFL draft?', 'answers': ['Newton', 'Newton', 'Newton'], 'answer_starts': [171, 171, 171], 'is_impossible': False}]
def prepare_validation_features(examples, tokenizer):
    contexts = [examples[i]['context'] for i in range(len(examples))]
    questions = [examples[i]['question'] for i in range(len(examples))]

    tokenized_examples = tokenizer(
        questions,
        contexts,
        stride=128,
        max_seq_len=512)

    # For validation, there is no need to compute start and end positions
    for i, tokenized_example in enumerate(tokenized_examples):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_example['token_type_ids']

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = tokenized_example['overflow_to_sample']
        tokenized_examples[i]["example_id"] = examples[sample_index]['id']

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples[i]["offset_mapping"] = [
            (o if sequence_ids[k] == 1 else None)
            for k, o in enumerate(tokenized_example["offset_mapping"])
        ]

    return tokenized_examples[0]
outputs = prepare_validation_features(examples,tokenizer)
for ids,offsetmap in zip(outputs["input_ids"],outputs["offset_mapping"]):
    if offsetmap:
        print(offsetmap,examples[0]["context"][offsetmap[0]:offsetmap[1]])
        print("="*20)

测试结果正确

smallv0221 · 2021-08-13T03:59:14Z

LGTM! Thank you for your contribution!

ZeyuChen

LGTM

fix 875

b686517

yingyibiao requested a review from smallv0221 August 13, 2021 03:30

yingyibiao assigned smallv0221 Aug 13, 2021

ZeyuChen added the bug Something isn't working label Aug 13, 2021

ZeyuChen approved these changes Aug 13, 2021

View reviewed changes

ZeyuChen merged commit 15a074b into PaddlePaddle:develop Aug 13, 2021

JunnYu deleted the fix#875 branch August 30, 2021 04:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix #875 #878

fix #875 #878

Uh oh!

JunnYu commented Aug 13, 2021

Uh oh!

smallv0221 commented Aug 13, 2021

Uh oh!

ZeyuChen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix #875 #878

fix #875 #878

Uh oh!

Conversation

JunnYu commented Aug 13, 2021

Uh oh!

smallv0221 commented Aug 13, 2021

Uh oh!

ZeyuChen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants