Skip to content

Conversation

joey12300
Copy link
Contributor

PR types

Bug fixes

PR changes

Others

Description

Fix faster wordpiece empty string input, like the following code

import six
import os
import numpy as np
import paddle
from psutil import cpu_count
from paddlenlp.transformers import AutoTokenizer
import json
import time


model_name_or_path="ernie-3.0-medium-zh"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_faster=True)

data = [[" "]]
ret = tokenizer(data,
                        max_length=512,
                        padding=True,
                        truncation=True,
                        is_split_into_words=True)
print(ret)

### error:
"""
terminate called after throwing an instance of 'std::logic_error'
  what():  The split of PreTokenizedString is empty, please call PreTokenizedString::Tokenize first before transform to Encoding.
Aborted
"""

@joey12300 joey12300 requested a review from ZeyuChen August 5, 2022 09:17
Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZeyuChen ZeyuChen merged commit 516f549 into PaddlePaddle:develop Aug 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants