Badly Encoded Tokens/Mojibake

#55

by muchanem - opened Apr 23

Apr 23

I've noticed that a lot of accent characters, e.g. the "é" in "también" are encoded wrong (it looks like they were forced from a Latin-1 encoding to UTF-8). Take a look with this code

from transformers import AutoTokenizer
llama_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
t = llama_tokenizer.convert_ids_to_tokens([29571])[0]
print(t)

Is anyone else encountering this issue? I've attached an image of the problematic encoding. Is there an easy fix? The issue exists at the model level, not within HuggingFace's code because if you download the tokenizer.json you'll see the same bad encodings.

muchanem changed discussion title from Badly Encoded Tokens to Badly Encoded Tokens/Mojibake Apr 23

RoacherM

Apr 28

same problem.
In [5]: from transformers import AutoModelForCausalLM, AutoTokenizer
...:
...: model_id = '/mnt/disk15/wy/meta-llama/Meta-Llama-3-70B-Instruct/'
...: tokenizer = AutoTokenizer.from_pretrained(model_id)
...:
...: text = 'hello, 我是李华，来自中国'
...: tokens = tokenizer.tokenize(text)
...: print(f"Tokenized result: {tokens}")
...:
...: # 将token序列转换为ID序列
...: input_ids = tokenizer.convert_tokens_to_ids(tokens)
...:
...: # 解码ID序列,得到可读文本
...: decoded_text = tokenizer.decode(input_ids)
...: print(f"Decoded result: {decoded_text}")
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tokenized result: ['hello', ',', 'ĠæĪĳ', 'æĺ¯', 'æĿİ', 'åįİ', 'ï¼Į', 'æĿ¥èĩª', 'ä¸ŃåĽ½']
Decoded result: hello, 我是李华，来自中国

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment