Badly Encoded Tokens/Mojibake

#55
by muchanem - opened

I've noticed that a lot of accent characters, e.g. the "é" in "también" are encoded wrong (it looks like they were forced from a Latin-1 encoding to UTF-8). Take a look with this code

from transformers import AutoTokenizer
llama_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
t = llama_tokenizer.convert_ids_to_tokens([29571])[0]
print(t)

Is anyone else encountering this issue? I've attached an image of the problematic encoding. Is there an easy fix? The issue exists at the model level, not within HuggingFace's code because if you download the tokenizer.json you'll see the same bad encodings.
image.png

muchanem changed discussion title from Badly Encoded Tokens to Badly Encoded Tokens/Mojibake

same problem.
In [5]: from transformers import AutoModelForCausalLM, AutoTokenizer
...:
...: model_id = '/mnt/disk15/wy/meta-llama/Meta-Llama-3-70B-Instruct/'
...: tokenizer = AutoTokenizer.from_pretrained(model_id)
...:
...: text = 'hello, 我是李华,来自中国'
...: tokens = tokenizer.tokenize(text)
...: print(f"Tokenized result: {tokens}")
...:
...: # 将token序列转换为ID序列
...: input_ids = tokenizer.convert_tokens_to_ids(tokens)
...:
...: # 解码ID序列,得到可读文本
...: decoded_text = tokenizer.decode(input_ids)
...: print(f"Decoded result: {decoded_text}")
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tokenized result: ['hello', ',', 'ĠæĪij', 'æĺ¯', 'æĿİ', 'åįİ', 'ï¼Į', 'æĿ¥èĩª', 'ä¸ŃåĽ½']
Decoded result: hello, 我是李华,来自中国

Sign up or log in to comment