com.microsoft - Tokenizer#
Tokenizer - 1#
Version
name: Tokenizer (GitHub)
domain: com.microsoft
since_version: 1
function:
support_level: SupportType.COMMON
shape inference: True
This version of the operator has been available since version 1 of domain com.microsoft.
Summary
Attributes
mark - INT (required) : Boolean whether to mark the beginning/end character with start of text character (0x02)/end of text character (0x03).
mincharnum - INT (required) : Minimum number of characters allowed in the output. For example, if mincharnum is 2, tokens such as “A” and “B” would be ignored
pad_value - STRING (required) : The string used to pad output tensors when the tokens extracted doesn’t match the maximum number of tokens found. If start/end markers are needed, padding will appear outside the markers.
separators - STRINGS : an optional list of strings attribute that contains a list of separators - regular expressions to match separators Two consecutive segments in X connected by a separator would be divided into two tokens. For example, if the input is “Hello World!” and this attribute contains only one space character, the corresponding output would be [“Hello”, “World!”]. To achieve character-level tokenization, one should set the ‘separators’ to [“”], which contains an empty string.
tokenexp - STRING : An optional string. Token’s regular expression in basic POSIX format (pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#ta g_09_03). If set, tokenizer may produce tokens matching the specified pattern. Note that one and only of ‘tokenexp’ and ‘separators’ should be set.
Inputs
X (heterogeneous) - T:
Outputs
Y (heterogeneous) - T:
Type Constraints
T in ( tensor(string) ): Input/Output is a string tensor
Examples