Add Unigram tokenizer needed by T5 and FLAN-T5 model families#8089
Conversation
llama : add Unigram tokenizer
llama : fix preventing crashes when precompiled_charsmap is not present
| vocab.n_precompiled_charsmap = gguf_get_arr_n(ctx, precompiled_charsmap_keyidx); | ||
| vocab.precompiled_charsmap = (char *) malloc(vocab.n_precompiled_charsmap); | ||
| memcpy((void*) vocab.precompiled_charsmap, gguf_get_arr_data(ctx, precompiled_charsmap_keyidx), vocab.n_precompiled_charsmap); |
There was a problem hiding this comment.
There's a memory leak here. Use std::vector<char> instead of:
uint32_t n_precompiled_charsmap = 0;
char * precompiled_charsmap = NULL;There was a problem hiding this comment.
Good catch! I replaced it with a vector as suggested, but I had to move endianness correction code from llm_tokenizer_ugm to llm_load_vocab - reference to vocab is const in tokenizer, so manipulating the precompiled_charsmap vector buffer would require const casts.
| // initialize score_sum to -FLT_MAX so it will be always lower than sums of token scores | ||
| std::vector<struct best_tokenization> tokenization_results(input_len + 1, {0, 0, -FLT_MAX}); | ||
| // at the beginning tokenization score is zero | ||
| tokenization_results[0] = { 0, 0, 0 }; |
There was a problem hiding this comment.
Is this supposed to be:
// initialize score_sum to -FLT_MAX so it will be always lower than sums of token scores
std::vector<struct best_tokenization> tokenization_results(input_len + 1, {vocab.special_unk_id, 0, -FLT_MAX});
// at the beginning tokenization score is zero
tokenization_results[0] = { vocab.special_unk_id, 0, 0 };Currently, the string of a single space character tokenizes to a single PAD token [0], while the AutoTokenizer returns an empty array of tokens in this case. With the change above, llama.cpp returns a single UNK token [2], which is still incorrect though. Or at least, it does not match the AutoTokenizer result
This is a second PR from a series of PRs adding support for T5 and FLAN-T5 models.
This PR adds implementation of the Unigram tokenizer used in T5 and FLAN-T5 models. It also adds T5 model architecture, tensors and model header parameters to allow testing the tokenizer with llama-tokenize command.