NatML
Search…
BERTTokenizer
class NatSuite.MLX.Tokenizers.BERTTokenizer : ITokenizer
This tokenizer is used for when making predictions on BERT and DistilBERT natural language models.
This class is part of the NatMLX extension library.

Creating the Tokenizer

1
/// <summary>
2
/// Create a BERT tokenizer.
3
/// </summary>
4
/// <param name="tokens">BERT vocabulary tokens.</param>
5
/// <param name="lowercase">Lowercase all tokens.</param>
6
BERTTokenizer (string[] tokens, bool lowercase = true);
Copied!
INCOMPLETE.
1
/// <summary>
2
/// Create a BERT tokenizer.
3
/// </summary>
4
/// <param name="vocabulary">BERT vocabulary encoding dictionary.</param>
5
/// <param name="lowercase">Lowercase all tokens.</param>
6
/// <param name="classificationToken">BERT classification token.</param>
7
/// <param name="separationToken">BERT separation token.</param>
8
/// <param name="unknownToken">BERT unknown token.</param>
9
BERTTokenizer (
10
IReadOnlyDictionary<string, int> vocabulary,
11
bool lowercase = true,
12
string classificationToken = "[CLS]",
13
string separationToken = "[SEP]",
14
string unknownToken = "[UNK]"
15
);
Copied!

Inspecting the Vocabulary

1
/// <summary>
2
/// Vocabulary mapping tokens to encodings.
3
/// </summary>
4
IReadOnlyDictionary<string, int> vocabulary { get; }
Copied!
INCOMPLETE.
1
/// <summary>
2
/// BERT classification token.
3
/// </summary>
4
string classificationToken { get; }
5
6
/// <summary>
7
/// BERT separation token.
8
/// </summary>
9
string separationToken { get; }
10
11
/// <summary>
12
/// BERT unknown token.
13
/// </summary>
14
string unknownToken { get; }
Copied!

Tokenizing Text

1
/// <summary>
2
/// Tokenize a piece of text into its BERT tokens.
3
/// </summary>
4
/// <param name="text">Input text.</param>
5
/// <returns>BERT tokens.</returns>
6
string[] Tokenize (string text);
Copied!
Refer to the Tokenizing Text section of the ITokenizer interface for more information.

Encoding Tokens

1
/// <summary>
2
/// </summary>
3
/// <param name="tokens"></param>
4
/// <returns></returns>
5
int[] Encode (string[] tokens, int size = 0);
Copied!
INCOMPLETE.
Last modified 3mo ago