com.microsoft - Tokenizer#

Tokenizer - 1#

Version

This version of the operator has been available since version 1 of domain com.microsoft.

Summary

Attributes

mark - INT (required) : Boolean whether to mark the beginning/end character with start of text character (0x02)/end of text character (0x03).
mincharnum - INT (required) : Minimum number of characters allowed in the output. For example, if mincharnum is 2, tokens such as “A” and “B” would be ignored
pad_value - STRING (required) : The string used to pad output tensors when the tokens extracted doesn’t match the maximum number of tokens found. If start/end markers are needed, padding will appear outside the markers.
separators - STRINGS : an optional list of strings attribute that contains a list of separators - regular expressions to match separators Two consecutive segments in X connected by a separator would be divided into two tokens. For example, if the input is “Hello World!” and this attribute contains only one space character, the corresponding output would be [“Hello”, “World!”]. To achieve character-level tokenization, one should set the ‘separators’ to [“”], which contains an empty string.
tokenexp - STRING : An optional string. Token’s regular expression in basic POSIX format (pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#ta g_09_03). If set, tokenizer may produce tokens matching the specified pattern. Note that one and only of ‘tokenexp’ and ‘separators’ should be set.

Inputs

Outputs

Type Constraints

Examples