Skip to content

Conversation

@DrownFish19
Copy link
Collaborator

@DrownFish19 DrownFish19 commented Aug 23, 2024

PR types

Bug fixes

PR changes

Others

Description

The new tokenizer_config.json now includes the added_tokens_decoder, and we load them in the PretrainedTokenizer _pre_init.

  1. 解决llama、gemma、mamba无法添加token的问题。
  2. 当前添加的token和原始的added_token_decoder最后都会保存在added_token_decoder:dict中,可下次加载并且序号不变。
  3. 当前added_token_decoder可被from_pretrained加载,保证tokenizer_config.json中序号不变。

@paddle-bot
Copy link

paddle-bot bot commented Aug 23, 2024

Thanks for your contribution!

@DrownFish19 DrownFish19 changed the title [tokenizer] fix added_tokens_decoder load [Tokenizer] fix added_tokens_decoder load Aug 23, 2024
@codecov
Copy link

codecov bot commented Aug 28, 2024

Codecov Report

Attention: Patch coverage is 94.87179% with 2 lines in your changes missing coverage. Please review.

Project coverage is 53.89%. Comparing base (9f6b486) to head (d6f2f38).
Report is 239 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/transformers/gemma/tokenizer.py 81.81% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8997      +/-   ##
===========================================
- Coverage    54.51%   53.89%   -0.63%     
===========================================
  Files          648      652       +4     
  Lines       103473   104388     +915     
===========================================
- Hits         56406    56255     -151     
- Misses       47067    48133    +1066     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

"""
return len(self.encoder)

def __len__(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mamba tokenizer的added_tokens_decoder中包含 [0,1]两个重复tokens,之前的计算方式会重复计算这两个token

"""Returns vocab size"""
return self.sp_model.get_piece_size()

def __len__(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

解决无法添加token的问题

"""Returns vocab size"""
return self.sp_model.get_piece_size()

def __len__(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

解决无法添加token的问题

@DrownFish19 DrownFish19 changed the title [Tokenizer] fix added_tokens_decoder load [Tokenizer] support added_tokens_decoder load Aug 28, 2024
@DrownFish19 DrownFish19 changed the title [Tokenizer] support added_tokens_decoder load [Tokenizer] Support for loading added_tokens_decoder Aug 28, 2024
Copy link
Member

@JunnYu JunnYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mamba OK

@DrownFish19 DrownFish19 merged commit 3e7c5ca into PaddlePaddle:develop Aug 28, 2024
@DrownFish19 DrownFish19 deleted the dev_20240823_fix_added_tokens_decoder_load branch August 28, 2024 12:38
Mangodadada pushed a commit to Mangodadada/PaddleNLP that referenced this pull request Sep 10, 2024
* fix added_tokens_decoder load

* fix decode

* fix saving and loading added_token_decoder

* fix mamba

* fix special_tokens_map_file load

* fix gemma tokenizer

* fix llama tokenzier

* revert llama tokenizer

* fix _decode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants