TfIdfVectorizer#

TfIdfVectorizer - 9#

Version

  • name: TfIdfVectorizer (GitHub)

  • domain: main

  • since_version: 9

  • function:

  • support_level: SupportType.COMMON

  • shape inference: True

This version of the operator has been available since version 9.

Summary

Attributes

  • max_gram_length - INT (required) : Maximum n-gram length. If this value is 3, 3-grams will be used to generate the output.

  • max_skip_count - INT (required) : Maximum number of items (integers/strings) to be skipped when constructing an n-gram from X. If max_skip_count=1, min_gram_length=2, max_gram_length=3, this operator may generate 2-grams with skip_count=0 and skip_count=1, and 3-grams with skip_count=0 and skip_count=1

  • min_gram_length - INT (required) : Minimum n-gram length. If this value is 2 and max_gram_length is 3, output may contain counts of 2-grams and 3-grams.

  • mode - STRING (required) : The weighting criteria. It can be one of “TF” (term frequency), “IDF” (inverse document frequency), and “TFIDF” (the combination of TF and IDF)

  • ngram_counts - INTS (required) : The starting indexes of 1-grams, 2-grams, and so on in pool. It is useful when determining the boundary between two consecutive collections of n-grams. For example, if ngram_counts is [0, 17, 36], the first index (zero-based) of 1-gram/2-gram/3-gram in pool are 0/17/36. This format is essentially identical to CSR (or CSC) sparse matrix format, and we choose to use this due to its popularity.

  • ngram_indexes - INTS (required) : list of int64s (type: AttributeProto::INTS). This list is parallel to the specified ‘pool_*’ attribute. The i-th element in ngram_indexes indicate the coordinate of the i-th n-gram in the output tensor.

  • pool_int64s - INTS : List of int64 n-grams learned from the training set. Either this or pool_strings attributes must be present but not both. It’s an 1-D tensor starting with the collections of all 1-grams and ending with the collections of n-grams. The i-th element in pool stores the n-gram that should be mapped to coordinate ngram_indexes[i] in the output vector.

  • pool_strings - STRINGS : List of strings n-grams learned from the training set. Either this or pool_int64s attributes must be present but not both. It’s an 1-D tensor starting with the collections of all 1-grams and ending with the collections of n-grams. The i-th element in pool stores the n-gram that should be mapped to coordinate ngram_indexes[i] in the output vector.

  • weights - FLOATS : list of floats. This attribute stores the weight of each n-gram in pool. The i-th element in weights is the weight of the i-th n-gram in pool. Its length equals to the size of ngram_indexes. By default, weights is an all-one tensor.This attribute is used when mode is “IDF” or “TFIDF” to scale the associated word counts.

Inputs

  • X (heterogeneous) - T:

Outputs

  • Y (heterogeneous) - T1:

Type Constraints

  • T in ( tensor(int32), tensor(int64), tensor(string) ): Input is ether string UTF-8 or int32/int64

  • T1 in ( tensor(float) ): 1-D tensor of floats

Examples

_tf_only_bigrams_skip0

import numpy as np
import onnx

input = np.array([1, 1, 3, 3, 3, 7, 8, 6, 7, 5, 6, 8]).astype(np.int32)
output = np.array([0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0]).astype(np.float32)

ngram_counts = np.array([0, 4]).astype(np.int64)
ngram_indexes = np.array([0, 1, 2, 3, 4, 5, 6]).astype(np.int64)
pool_int64s = np.array([2, 3, 5, 4, 5, 6, 7, 8, 6, 7]).astype(  # unigrams
    np.int64
)  # bigrams

helper = TfIdfVectorizerHelper(
    mode="TF",
    min_gram_length=2,
    max_gram_length=2,
    max_skip_count=0,
    ngram_counts=ngram_counts,
    ngram_indexes=ngram_indexes,
    pool_int64s=pool_int64s,
)
node = helper.make_node_noweights()
expect(
    node,
    inputs=[input],
    outputs=[output],
    name="test_tfidfvectorizer_tf_only_bigrams_skip0",
)

_tf_batch_onlybigrams_skip0

import numpy as np
import onnx

input = np.array([[1, 1, 3, 3, 3, 7], [8, 6, 7, 5, 6, 8]]).astype(np.int32)
output = np.array(
    [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0]]
).astype(np.float32)

ngram_counts = np.array([0, 4]).astype(np.int64)
ngram_indexes = np.array([0, 1, 2, 3, 4, 5, 6]).astype(np.int64)
pool_int64s = np.array([2, 3, 5, 4, 5, 6, 7, 8, 6, 7]).astype(  # unigrams
    np.int64
)  # bigrams

helper = TfIdfVectorizerHelper(
    mode="TF",
    min_gram_length=2,
    max_gram_length=2,
    max_skip_count=0,
    ngram_counts=ngram_counts,
    ngram_indexes=ngram_indexes,
    pool_int64s=pool_int64s,
)
node = helper.make_node_noweights()
expect(
    node,
    inputs=[input],
    outputs=[output],
    name="test_tfidfvectorizer_tf_batch_onlybigrams_skip0",
)

_tf_onlybigrams_levelempty

import numpy as np
import onnx

input = np.array([1, 1, 3, 3, 3, 7, 8, 6, 7, 5, 6, 8]).astype(np.int32)
output = np.array([1.0, 1.0, 1.0]).astype(np.float32)

ngram_counts = np.array([0, 0]).astype(np.int64)
ngram_indexes = np.array([0, 1, 2]).astype(np.int64)
pool_int64s = np.array([5, 6, 7, 8, 6, 7]).astype(  # unigrams none
    np.int64
)  # bigrams

helper = TfIdfVectorizerHelper(
    mode="TF",
    min_gram_length=2,
    max_gram_length=2,
    max_skip_count=0,
    ngram_counts=ngram_counts,
    ngram_indexes=ngram_indexes,
    pool_int64s=pool_int64s,
)
node = helper.make_node_noweights()
expect(
    node,
    inputs=[input],
    outputs=[output],
    name="test_tfidfvectorizer_tf_onlybigrams_levelempty",
)

_tf_onlybigrams_skip5

import numpy as np
import onnx

input = np.array([1, 1, 3, 3, 3, 7, 8, 6, 7, 5, 6, 8]).astype(np.int32)
output = np.array([0.0, 0.0, 0.0, 0.0, 1.0, 3.0, 1.0]).astype(np.float32)

ngram_counts = np.array([0, 4]).astype(np.int64)
ngram_indexes = np.array([0, 1, 2, 3, 4, 5, 6]).astype(np.int64)
pool_int64s = np.array([2, 3, 5, 4, 5, 6, 7, 8, 6, 7]).astype(  # unigrams
    np.int64
)  # bigrams

helper = TfIdfVectorizerHelper(
    mode="TF",
    min_gram_length=2,
    max_gram_length=2,
    max_skip_count=5,
    ngram_counts=ngram_counts,
    ngram_indexes=ngram_indexes,
    pool_int64s=pool_int64s,
)
node = helper.make_node_noweights()
expect(
    node,
    inputs=[input],
    outputs=[output],
    name="test_tfidfvectorizer_tf_onlybigrams_skip5",
)

_tf_batch_onlybigrams_skip5

import numpy as np
import onnx

input = np.array([[1, 1, 3, 3, 3, 7], [8, 6, 7, 5, 6, 8]]).astype(np.int32)
output = np.array(
    [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0]]
).astype(np.float32)

ngram_counts = np.array([0, 4]).astype(np.int64)
ngram_indexes = np.array([0, 1, 2, 3, 4, 5, 6]).astype(np.int64)
pool_int64s = np.array([2, 3, 5, 4, 5, 6, 7, 8, 6, 7]).astype(  # unigrams
    np.int64
)  # bigrams

helper = TfIdfVectorizerHelper(
    mode="TF",
    min_gram_length=2,
    max_gram_length=2,
    max_skip_count=5,
    ngram_counts=ngram_counts,
    ngram_indexes=ngram_indexes,
    pool_int64s=pool_int64s,
)
node = helper.make_node_noweights()
expect(
    node,
    inputs=[input],
    outputs=[output],
    name="test_tfidfvectorizer_tf_batch_onlybigrams_skip5",
)

_tf_uniandbigrams_skip5

import numpy as np
import onnx

input = np.array([1, 1, 3, 3, 3, 7, 8, 6, 7, 5, 6, 8]).astype(np.int32)
output = np.array([0.0, 3.0, 1.0, 0.0, 1.0, 3.0, 1.0]).astype(np.float32)

ngram_counts = np.array([0, 4]).astype(np.int64)
ngram_indexes = np.array([0, 1, 2, 3, 4, 5, 6]).astype(np.int64)
pool_int64s = np.array([2, 3, 5, 4, 5, 6, 7, 8, 6, 7]).astype(  # unigrams
    np.int64
)  # bigrams

helper = TfIdfVectorizerHelper(
    mode="TF",
    min_gram_length=1,
    max_gram_length=2,
    max_skip_count=5,
    ngram_counts=ngram_counts,
    ngram_indexes=ngram_indexes,
    pool_int64s=pool_int64s,
)
node = helper.make_node_noweights()
expect(
    node,
    inputs=[input],
    outputs=[output],
    name="test_tfidfvectorizer_tf_uniandbigrams_skip5",
)

_tf_batch_uniandbigrams_skip5

import numpy as np
import onnx

input = np.array([[1, 1, 3, 3, 3, 7], [8, 6, 7, 5, 6, 8]]).astype(np.int32)
output = np.array(
    [[0.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0]]
).astype(np.float32)

ngram_counts = np.array([0, 4]).astype(np.int64)
ngram_indexes = np.array([0, 1, 2, 3, 4, 5, 6]).astype(np.int64)
pool_int64s = np.array([2, 3, 5, 4, 5, 6, 7, 8, 6, 7]).astype(  # unigrams
    np.int64
)  # bigrams

helper = TfIdfVectorizerHelper(
    mode="TF",
    min_gram_length=1,
    max_gram_length=2,
    max_skip_count=5,
    ngram_counts=ngram_counts,
    ngram_indexes=ngram_indexes,
    pool_int64s=pool_int64s,
)
node = helper.make_node_noweights()
expect(
    node,
    inputs=[input],
    outputs=[output],
    name="test_tfidfvectorizer_tf_batch_uniandbigrams_skip5",
)