com.microsoft - Tokenizer#

Tokenizer - 1#

Version

  • name: Tokenizer (GitHub)

  • domain: com.microsoft

  • since_version: 1

  • function:

  • support_level: SupportType.COMMON

  • shape inference: True

This version of the operator has been available since version 1 of domain com.microsoft.

Summary

Attributes

  • mark - INT (required) : Boolean whether to mark the beginning/end character with start of text character (0x02)/end of text character (0x03).

  • mincharnum - INT (required) : Minimum number of characters allowed in the output. For example, if mincharnum is 2, tokens such as “A” and “B” would be ignored

  • pad_value - STRING (required) : The string used to pad output tensors when the tokens extracted doesn’t match the maximum number of tokens found. If start/end markers are needed, padding will appear outside the markers.

  • separators - STRINGS : an optional list of strings attribute that contains a list of separators - regular expressions to match separators Two consecutive segments in X connected by a separator would be divided into two tokens. For example, if the input is “Hello World!” and this attribute contains only one space character, the corresponding output would be [“Hello”, “World!”]. To achieve character-level tokenization, one should set the ‘separators’ to [“”], which contains an empty string.

  • tokenexp - STRING : An optional string. Token’s regular expression in basic POSIX format (pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#ta g_09_03). If set, tokenizer may produce tokens matching the specified pattern. Note that one and only of ‘tokenexp’ and ‘separators’ should be set.

Inputs

  • X (heterogeneous) - T:

Outputs

  • Y (heterogeneous) - T:

Type Constraints

  • T in ( tensor(string) ): Input/Output is a string tensor

Examples