When untrained tokens play "catch me if you can" the Fishing For Margikarp paper is the detective:)
The playbook:
- Inspect token vocab & study encode/decode pattern.
- Brute-force on architecture-dependent indicators (same matrix in token embeddings and final layer) to identify untrained tokens.
- Then verify if identified tokens are out of distribution by prompting a target llm (with no tied threshold).
Quite a bait huh, Cohere:)