{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Introduction to a numpy API for ONNX: CustomClassifier\n", "\n", "This notebook shows how to write python classifier using similar functions as numpy offers and get a class which can be inserted into a pipeline and still be converted into ONNX."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%load_ext mlprodict"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## A custom binary classifier\n", "\n", "Let's imagine a classifier not that simple about simple but not that complex about predictions. It does the following:\n", "* compute the barycenters of both classes,\n", "* determine an hyperplan containing the two barycenters of the clusters,\n", "* train a logistic regression on both sides.\n", "\n", "Some data first..."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["from sklearn.datasets import make_classification\n", "from pandas import DataFrame\n", "\n", "X, y = make_classification(200, n_classes=2, n_features=2, n_informative=2,\n", " n_redundant=0, n_clusters_per_class=2, hypercube=False)\n", "\n", "df = DataFrame(X)\n", "df.columns = ['X1', 'X2']\n", "df['y'] = y\n", "ax = df[df.y == 0].plot.scatter(x=\"X1\", y=\"X2\", color=\"blue\", label=\"y=0\")\n", "df[df.y == 1].plot.scatter(x=\"X1\", y=\"X2\", color=\"red\", label=\"y=1\", ax=ax);"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Split into train and test as usual."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The model..."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0,\n", " 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0,\n", " 1, 1, 1, 0, 0, 0], dtype=int64)"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy\n", "from sklearn.base import ClassifierMixin, BaseEstimator\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "class TwoLogisticRegression(ClassifierMixin, BaseEstimator):\n", " \n", " def __init__(self):\n", " ClassifierMixin.__init__(self)\n", " BaseEstimator.__init__(self)\n", " \n", " def fit(self, X, y, sample_weights=None):\n", " if sample_weights is not None:\n", " raise NotImplementedError(\"weighted sample not implemented in this example.\")\n", " \n", " # Barycenters\n", " self.weights_ = numpy.array([(y==0).sum(), (y==1).sum()])\n", " p1 = X[y==0].sum(axis=0) / self.weights_[0]\n", " p2 = X[y==1].sum(axis=0) / self.weights_[1]\n", " self.centers_ = numpy.vstack([p1, p2])\n", " \n", " # A vector orthogonal\n", " v = p2 - p1\n", " v /= numpy.linalg.norm(v)\n", " x = numpy.random.randn(X.shape[1])\n", " x -= x.dot(v) * v\n", " x /= numpy.linalg.norm(x)\n", " self.hyperplan_ = x.reshape((-1, 1))\n", " \n", " # sign\n", " sign = ((X - p1) @ self.hyperplan_ >= 0).astype(numpy.int64).ravel()\n", " \n", " # Trains models\n", " self.lr0_ = LogisticRegression().fit(X[sign == 0], y[sign == 0])\n", " self.lr1_ = LogisticRegression().fit(X[sign == 1], y[sign == 1])\n", "\n", " return self\n", " \n", " def predict_proba(self, X):\n", " sign = self.predict_side(X).reshape((-1, 1))\n", " prob0 = self.lr0_.predict_proba(X)\n", " prob1 = self.lr1_.predict_proba(X)\n", " prob = prob1 * sign - prob0 * (sign - 1)\n", " return prob\n", " \n", " def predict(self, X):\n", " prob = self.predict_proba(X)\n", " return prob.argmax(axis=1)\n", "\n", " def predict_side(self, X):\n", " return ((X - self.centers_[0]) @ self.hyperplan_ >= 0).astype(numpy.int64).ravel()\n", " \n", " \n", "model = TwoLogisticRegression()\n", "model.fit(X_train, y_train)\n", "model.predict(X_test)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Let's compare the model a single logistic regression. It shouuld be better. The same logistic regression applied on both sides is equivalent a single logistic regression and both half logistic regression is better on its side."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["(0.5, 0.64)"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import accuracy_score\n", "lr = LogisticRegression().fit(X_train, y_train)\n", "accuracy_score(y_test, lr.predict(X_test)), accuracy_score(y_test, model.predict(X_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["However, this is true in average but not necessarily true for one particular datasets. But that's not the point of this notebook."]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[-0.27222367, -0.16954845],\n", " [ 0.06570281, -0.17501428]])"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["model.centers_"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[0.01617249],\n", " [0.99986922]])"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["model.hyperplan_"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["(array([[ 1.27524081, -0.16215767]]), array([[-0.25198847, -0.58704473]]))"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["model.lr0_.coef_, model.lr1_.coef_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Let's draw the model predictions. Colored zones indicate the predicted class, green line indicates the hyperplan splitting the features into two. A different logistic regression is applied on each side."]}, {"cell_type": "code", "execution_count": 10, "metadata": {"scrolled": false}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["import matplotlib.pyplot as plt\n", "\n", "def draw_line(ax, v, p0, rect, N=50, label=None, color=\"black\"):\n", " x1, x2, y1, y2 = rect\n", " v = v / numpy.linalg.norm(v) * (x2 - x1)\n", " points = [p0 + v * ((i * 2. / N - 2) + (x1 - p0[0]) / v[0]) for i in range(0, N * 4 + 1)]\n", " arr = numpy.vstack(points)\n", " arr = arr[arr[:, 0] >= x1]\n", " arr = arr[arr[:, 0] <= x2]\n", " arr = arr[arr[:, 1] >= y1]\n", " arr = arr[arr[:, 1] <= y2]\n", " ax.plot(arr[:, 0], arr[:, 1], '.', label=label, color=color)\n", "\n", "def zones(ax, model, X):\n", " r = (X[:, 0].min(), X[:, 0].max(), X[:, 1].min(), X[:, 1].max())\n", " h = .02 # step size in the mesh\n", " xx, yy = numpy.meshgrid(numpy.arange(r[0], r[1], h), numpy.arange(r[2], r[3], h))\n", " Z = model.predict(numpy.c_[xx.ravel(), yy.ravel()])\n", " Z = Z.reshape(xx.shape)\n", " return ax.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)\n", "\n", "fig, ax = plt.subplots(1, 1)\n", "zones(ax, model, X)\n", "df[df.y == 0].plot.scatter(x=\"X1\", y=\"X2\", color=\"blue\", label=\"y=0\", ax=ax)\n", "df[df.y == 1].plot.scatter(x=\"X1\", y=\"X2\", color=\"red\", label=\"y=1\", ax=ax);\n", "rect = (df.X1.min(), df.X1.max(), df.X2.min(), df.X2.max())\n", "draw_line(ax, model.centers_[1] - model.centers_[0], model.centers_[0],\n", " rect, N=100, label=\"hyperplan\", color=\"green\")\n", "ax.legend();"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Conversion to ONNX = second implementation\n", "\n", "The conversion fails as expected because there is no registered converter for this new model."]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["MissingShapeCalculator\n", "---\n", "Unable to find a shape calculator for type ''.\n", "It usually means the pipeline being converted contains a\n", "transformer or a predictor with no corresponding converter\n", "implemented in sklearn-onnx. If the converted is implemented\n", "in another library, you need to register\n", "the converted so that it can be used by sklearn-onnx (function\n", "update_registered_converter). If the model is not yet covered\n", "by sklearn-onnx, you may raise an issue to\n", "https://github.com/onnx/sklearn-onnx/issues\n", "to get the converter implemented or even contribute to the\n", "project. If the model is a custom model, a new converter must\n", "be implemented. Examples can be found in the gallery.\n", "\n"]}], "source": ["from skl2onnx import to_onnx\n", "one_row = X_train[:1].astype(numpy.float32)\n", "try:\n", " to_onnx(model, one_row)\n", "except Exception as e:\n", " print(e.__class__.__name__)\n", " print(\"---\")\n", " print(e)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Writing a converter means implementing the prediction methods with ONNX operators. That's very similar to learning a new mathematical language even if this language is very close to *numpy*. Instead of having a second implementation of the predictions, why not having a single one based on ONNX? That way the conversion to ONNX would be obvious. Well do you know ONNX operators? Not really... Why not using then numpy functions implemented with ONNX operators? Ok! But how?"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## A single implementation with ONNX operators\n", "\n", "A classifier needs two pethods, `predict` and `predict_proba` and one graph is going to produce both of them. The user need to implement the function producing this graph, a decorator adds the two methods based on this graph."]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": ["from mlprodict.npy import onnxsklearn_class\n", "from mlprodict.npy.onnx_variable import MultiOnnxVar\n", "import mlprodict.npy.numpy_onnx_impl as nxnp\n", "import mlprodict.npy.numpy_onnx_impl_skl as nxnpskl\n", "\n", "\n", "@onnxsklearn_class('onnx_graph')\n", "class TwoLogisticRegressionOnnx(ClassifierMixin, BaseEstimator):\n", " \n", " def __init__(self):\n", " ClassifierMixin.__init__(self)\n", " BaseEstimator.__init__(self)\n", " \n", " def fit(self, X, y, sample_weights=None):\n", " if sample_weights is not None:\n", " raise NotImplementedError(\"weighted sample not implemented in this example.\")\n", " \n", " # Barycenters\n", " self.weights_ = numpy.array([(y==0).sum(), (y==1).sum()])\n", " p1 = X[y==0].sum(axis=0) / self.weights_[0]\n", " p2 = X[y==1].sum(axis=0) / self.weights_[1]\n", " self.centers_ = numpy.vstack([p1, p2])\n", " \n", " # A vector orthogonal\n", " v = p2 - p1\n", " v /= numpy.linalg.norm(v)\n", " x = numpy.random.randn(X.shape[1])\n", " x -= x.dot(v) * v\n", " x /= numpy.linalg.norm(x)\n", " self.hyperplan_ = x.reshape((-1, 1))\n", " \n", " # sign\n", " sign = ((X - p1) @ self.hyperplan_ >= 0).astype(numpy.int64).ravel()\n", " \n", " # Trains models\n", " self.lr0_ = LogisticRegression().fit(X[sign == 0], y[sign == 0])\n", " self.lr1_ = LogisticRegression().fit(X[sign == 1], y[sign == 1])\n", "\n", " return self\n", " \n", " def onnx_graph(self, X):\n", " h = self.hyperplan_.astype(X.dtype)\n", " c = self.centers_.astype(X.dtype)\n", "\n", " sign = ((X - c[0]) @ h) >= numpy.array([0], dtype=X.dtype)\n", " cast = sign.astype(X.dtype).reshape((-1, 1))\n", "\n", " prob0 = nxnpskl.logistic_regression( # pylint: disable=E1136\n", " X, model=self.lr0_)[1]\n", " prob1 = nxnpskl.logistic_regression( # pylint: disable=E1136\n", " X, model=self.lr1_)[1]\n", " prob = prob1 * cast - prob0 * (cast - numpy.array([1], dtype=X.dtype))\n", " label = nxnp.argmax(prob, axis=1)\n", " return MultiOnnxVar(label, prob)"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["TwoLogisticRegressionOnnx()"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["model = TwoLogisticRegressionOnnx()\n", "model.fit(X_train, y_train)"]}, {"cell_type": "code", "execution_count": 14, "metadata": {"scrolled": false}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["C:\\Python395_x64\\lib\\site-packages\\xgboost\\compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.\n", " from pandas import MultiIndex, Int64Index\n"]}, {"data": {"text/plain": ["array([1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0,\n", " 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0,\n", " 1, 1, 1, 0, 0, 0], dtype=int64)"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["model.predict(X_test.astype(numpy.float32))"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[0.44604164, 0.55395836],\n", " [0.5958315 , 0.40416852],\n", " [0.41722754, 0.5827725 ],\n", " [0.5319096 , 0.46809047],\n", " [0.47805768, 0.5219424 ]], dtype=float32)"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["model.predict_proba(X_test.astype(numpy.float32))[:5]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["It works with double too."]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[0.44604165, 0.55395835],\n", " [0.5958315 , 0.4041685 ],\n", " [0.41722751, 0.58277249],\n", " [0.53190958, 0.46809042],\n", " [0.47805765, 0.52194235]])"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["model.predict_proba(X_test.astype(numpy.float64))[:5]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["And now the conversion to ONNX."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": ["onx = to_onnx(model, X_test[:1].astype(numpy.float32),\n", " options={id(model): {'zipmap': False}})"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Let's check the output."]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'label': array([1, 0, 1, 0, 1], dtype=int64),\n", " 'probabilities': array([[0.44604164, 0.55395836],\n", " [0.5958315 , 0.40416852],\n", " [0.41722754, 0.5827725 ],\n", " [0.5319096 , 0.46809047],\n", " [0.47805768, 0.5219424 ]], dtype=float32)}"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["from mlprodict.onnxrt import OnnxInference\n", "\n", "oinf = OnnxInference(onx)\n", "oinf.run({'X': X_test[:5].astype(numpy.float32)})"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5"}}, "nbformat": 4, "nbformat_minor": 4}