StringNormalizer

StringNormalizer - 10

Version

  • name: StringNormalizer (GitHub)

  • domain: main

  • since_version: 10

  • function: False

  • support_level: SupportType.COMMON

  • shape inference: True

This version of the operator has been available since version 10.

Summary

StringNormalization performs string operations for basic cleaning. This operator has only one input (denoted by X) and only one output (denoted by Y). This operator first examines the elements in the X, and removes elements specified in “stopwords” attribute. After removing stop words, the intermediate result can be further lowercased, uppercased, or just returned depending the “case_change_action” attribute. This operator only accepts [C]- and [1, C]-tensor. If all elements in X are dropped, the output will be the empty value of string tensor with shape [1] if input shape is [C] and shape [1, 1] if input shape is [1, C].

Attributes

  • case_change_action: string enum that cases output to be lowercased/uppercases/unchanged. Valid values are “LOWER”, “UPPER”, “NONE”. Default is “NONE”

  • is_case_sensitive: Boolean. Whether the identification of stop words in X is case- sensitive. Default is false

  • locale: Environment dependent string that denotes the locale according to which output strings needs to be upper/lowercased.Default en_US or platform specific equivalent as decided by the implementation.

  • stopwords: List of stop words. If not set, no word would be removed from X.

Inputs

  • X (heterogeneous) - tensor(string): UTF-8 strings to normalize

Outputs

  • Y (heterogeneous) - tensor(string): UTF-8 Normalized strings

Examples

_nostopwords_nochangecase

import numpy as np
import onnx

input = np.array(["monday", "tuesday"]).astype(object)
output = input

# No stopwords. This is a NOOP
node = onnx.helper.make_node(
    "StringNormalizer",
    inputs=["x"],
    outputs=["y"],
    is_case_sensitive=1,
)
expect(
    node,
    inputs=[input],
    outputs=[output],
    name="test_strnormalizer_nostopwords_nochangecase",
)

_monday_casesensintive_nochangecase

import numpy as np
import onnx

input = np.array(["monday", "tuesday", "wednesday", "thursday"]).astype(object)
output = np.array(["tuesday", "wednesday", "thursday"]).astype(object)
stopwords = ["monday"]

node = onnx.helper.make_node(
    "StringNormalizer",
    inputs=["x"],
    outputs=["y"],
    is_case_sensitive=1,
    stopwords=stopwords,
)
expect(
    node,
    inputs=[input],
    outputs=[output],
    name="test_strnormalizer_export_monday_casesensintive_nochangecase",
)

_monday_casesensintive_lower

import numpy as np
import onnx

input = np.array(["monday", "tuesday", "wednesday", "thursday"]).astype(object)
output = np.array(["tuesday", "wednesday", "thursday"]).astype(object)
stopwords = ["monday"]

node = onnx.helper.make_node(
    "StringNormalizer",
    inputs=["x"],
    outputs=["y"],
    case_change_action="LOWER",
    is_case_sensitive=1,
    stopwords=stopwords,
)
expect(
    node,
    inputs=[input],
    outputs=[output],
    name="test_strnormalizer_export_monday_casesensintive_lower",
)

_monday_casesensintive_upper

import numpy as np
import onnx

input = np.array(["monday", "tuesday", "wednesday", "thursday"]).astype(object)
output = np.array(["TUESDAY", "WEDNESDAY", "THURSDAY"]).astype(object)
stopwords = ["monday"]

node = onnx.helper.make_node(
    "StringNormalizer",
    inputs=["x"],
    outputs=["y"],
    case_change_action="UPPER",
    is_case_sensitive=1,
    stopwords=stopwords,
)
expect(
    node,
    inputs=[input],
    outputs=[output],
    name="test_strnormalizer_export_monday_casesensintive_upper",
)

_monday_empty_output

import numpy as np
import onnx

input = np.array(["monday", "monday"]).astype(object)
output = np.array([""]).astype(object)
stopwords = ["monday"]

node = onnx.helper.make_node(
    "StringNormalizer",
    inputs=["x"],
    outputs=["y"],
    case_change_action="UPPER",
    is_case_sensitive=1,
    stopwords=stopwords,
)
expect(
    node,
    inputs=[input],
    outputs=[output],
    name="test_strnormalizer_export_monday_empty_output",
)

_monday_insensintive_upper_twodim

import numpy as np
import onnx

input = (
    np.array(
        ["Monday", "tuesday", "wednesday", "Monday", "tuesday", "wednesday"]
    )
    .astype(object)
    .reshape([1, 6])
)

# It does upper case cecedille, accented E
# and german umlaut but fails
# with german eszett
output = (
    np.array(["TUESDAY", "WEDNESDAY", "TUESDAY", "WEDNESDAY"])
    .astype(object)
    .reshape([1, 4])
)
stopwords = ["monday"]

node = onnx.helper.make_node(
    "StringNormalizer",
    inputs=["x"],
    outputs=["y"],
    case_change_action="UPPER",
    stopwords=stopwords,
)
expect(
    node,
    inputs=[input],
    outputs=[output],
    name="test_strnormalizer_export_monday_insensintive_upper_twodim",
)