NatML
Search…
BERTWordpieceTokenizer
class NatSuite.MLX.Tokenizers.BERTWordpieceTokenizer : ITokenizer
This tokenizer tokenizes tokens that have been pre-tokenized by the BERTBasicTokenizer. It is used to further tokenize complex words.

Creating the Tokenizer

1
/// <summary>
2
/// Create a BERT wordpiece tokenizer.
3
/// </summary>
4
/// <param name="vocabulary">BERT vocabulary encoding dictionary.</param>
5
/// <param name="unknownToken">BERT unknown token</param>
6
/// <param name="maxCharsPerWord">Maximum characters per word.</param>
7
BERTWordpieceTokenizer (
8
IReadOnlyDictionary<string, int> vocabulary,
9
string unknownToken = "[UNK]",
10
int maxCharsPerWord = 200
11
);
Copied!
INCOMPLETE.

Tokenizing Text

1
/// <summary>
2
/// Tokenize a piece of text into its word pieces.
3
/// For example: input = "unaffable", output = ["una", "##ffa", "##ble"].
4
/// </summary>
5
/// <param name="text">A single token or whitespace separated tokens.</param>
6
/// <returns>A list of wordpiece tokens.</returns>
7
string[] Tokenize (string text);
Copied!
Refer to the Tokenizing Text section of the ITokenizer interface for more information.
Last modified 3mo ago