CN114091674A

CN114091674A - Model reasoning acceleration method and system based on CPU equipment

Info

Publication number: CN114091674A
Application number: CN202210061563.6A
Authority: CN
Inventors: 李滨滨; 兰伏锋; 张涛; 薛延波; 赵鹏
Original assignee: Beijing Huapin Borui Network Technology Co Ltd
Current assignee: Beijing Huapin Borui Network Technology Co Ltd
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-02-25

Abstract

The embodiment of the invention discloses a model reasoning acceleration method based on CPU equipment, which comprises the following steps: converting the model into an ONNX file and carrying out equivalent replacement and fusion of a user-defined operator based on a characteristic processing correlation operator of the model; defining a custom operator in OpenVINO and adding a replacement method of the custom operator; converting the ONNX file into an IR format file; realizing a calculation flow of an OpenVINO operator on CPU equipment to compile a dynamic link library; analyzing the ONNX file, generating a model configuration file required by the inference server, and packaging the model configuration file, the IR file and the dynamic link library into a format file required by the inference server; and deploying the packaged files on the inference server to be a model service. The embodiment of the invention also discloses a model reasoning acceleration system based on the CPU equipment. The invention obviously improves the reasoning performance and the throughput of the model and reduces the reasoning delay of the model.

Description

Model reasoning acceleration method and system based on CPU equipment

Technical Field

The invention relates to the technical field of machine learning, in particular to a model reasoning acceleration method and system based on CPU (central processing unit) equipment.

Background

The recommendation system provides personalized recommendation results for the user through the recommendation model, and the recall rate of the recommendation results is closely related to the characteristic quantity of the recommendation model. The more features used by the recommended model, the more the model parameter quantity, the larger the model volume and the longer the reasoning time. Because the throughput of intensive model computation is constrained by the parallel computing capability and the cache size of the CPU equipment, on the premise of ensuring the recall rate of the recommendation model, the method is used for reasoning and accelerating the recommendation model, giving full play to the performance of the CPU equipment to the maximum extent, and meeting the time-consuming requirement of online reasoning service, and is one of the problems to be solved by the reasoning service urgently.

Openvion initial exploration (actual experience) is introduced in https:// oldpan.me/archives/openvion-first-try, and the reasoning acceleration of a human posture estimation HNet model by using OpenVINO is briefly described. The specialized academic paper of great masters of Beijing university of transportation "research on edge calculation solution of deep object detection model" discloses the use of TensorFlow for deep object detection, which is based on an object detection model (CV model).

Disclosure of Invention

In order to solve the above problems, the present invention aims to provide a model reasoning acceleration method and system based on a CPU device, which significantly improve the reasoning performance of the model, improve the throughput of the model, and reduce the reasoning delay of the model.

The embodiment of the invention provides a model reasoning acceleration method based on CPU equipment, which comprises the following steps:

converting the model into an ONNX format file, and performing equivalent replacement and fusion of a custom operator based on a feature processing related operator of the model in the conversion process to obtain an ONNX operator, wherein the model is a model trained by different frames, and the ONNX operator represents the custom operator in the ONNX format file;

defining the custom operator in OpenVINO and adding a replacement method of the custom operator to realize the conversion from the ONNX operator to the OpenVINO operator, wherein the OpenVINO operator represents the custom operator in the OpenVINO;

converting the ONNX format file into an IR format file, wherein the IR format file comprises an xml file and a bin file, the xml file is used for defining a model topological structure, and the bin file is used for storing model parameters;

realizing a computing process of the OpenVINO operator on CPU equipment to compile a dynamic link library;

analyzing the ONNX format file, generating a model configuration file required by an inference server, and packaging the model configuration file, the IR format file and the dynamic link library into a format file required by the inference server, wherein the model configuration file comprises input and output information of the model extracted from the ONNX format file and a path of the compiled dynamic link library;

and deploying the packaged files on the inference server to form a model service so as to carry out online inference on the model on the inference server through each CPU device.

As a further improvement of the present invention, the defining the custom operator in OpenVINO includes defining a name, a number of inputs, a number of outputs, an attribute, dimensions of inputs and outputs, and data types of inputs and outputs of the custom operator.

As a further improvement of the invention, the model adopts a plurality of feature processing methods in feature processing, after the model is converted into an ONNX format file, each feature processing method is divided into a plurality of basic operators,

the model adopts a plurality of characteristic processing methods during characteristic processing, after the model is converted into an ONNX format file, each characteristic processing method is divided into a plurality of basic operators,

the equivalent replacement and fusion of the user-defined operator are carried out based on the characteristic processing correlation operator of the model in the conversion process to obtain the ONNX operator, and the method comprises the following steps:

and by adopting a mode matching method, after a mode is matched, extracting useful attribute information from all basic operators obtained by splitting the characteristic processing method corresponding to the mode, setting the input of a first operator as the input of the custom operator, setting the output of a tail operator as the output of the custom operator, and setting the extracted attribute as the attribute of the custom operator to obtain the ONNX operator.

As a further improvement of the invention, the self-defined operator comprises a discrete characteristic operator, a continuous characteristic operator and an embedding operator;

wherein the discrete class feature operators comprise a CategoricalPlugin operator for looking up word lists and returning one-hot codes, a StringToHashPlugin operator for Hash value box processing on input with a large number of input categories, and a SpecStringToHashPlugin operator for processing an outlier-1;

the continuous characteristic operators comprise an IntBucketizePlugin operator and a FlataBucketizePlugin operator, the IntBucketizePlugin operator is used for performing box dividing processing on the integer type characteristics, and the FlataBucketizePlugin operator is used for performing box dividing processing on the floating point type characteristics;

the embedding operator comprises an embedding plug operator and a SafeEmbellingplug operator, the embedding plug operator is used for inquiring corresponding dense information input in the index table, converting the input sparse information into dense information and then averaging, and the SafeEmbellingplug operator is used for processing abnormal values smaller than 0.

As a further improvement of the present invention, the CategoricalPlugin operator uses HashMap to construct the vocabulary and index tables into a Hash table during the model initialization.

As a further improvement of the invention, the AVX512 instruction set is used for completing the summation calculation and the averaging calculation in the calculation process of the embedding operator.

The embodiment of the invention also provides a model reasoning acceleration system based on the CPU equipment, which comprises:

the automatic model format conversion tool is used for converting the model into an ONNX format file, and carrying out equivalent replacement and fusion of a user-defined operator based on a characteristic processing correlation operator of the model in the conversion process to obtain the ONNX operator, wherein the model is a model trained by different frames; the ONNX format file is used for converting the ONNX format file into an IR format file, wherein the IR format file comprises an xml file and a bin file, the xml file is used for defining a model topological structure, and the bin file is used for storing model parameters; the ONNX format file is further used for analyzing the ONNX format file, generating a model configuration file required by an inference server, and packaging the model configuration file, the IR format file and the dynamic link library into a format file required by the inference server, wherein the model configuration file comprises input and output information of the model extracted from the ONNX format file and a path of the compiled dynamic link library;

the system comprises a custom operator and an implementation tool, wherein the custom operator is defined in OpenVINO, a replacement method of the custom operator is added, and the conversion from the ONNX operator to the OpenVINO operator is achieved, wherein the OpenVINO operator represents the custom operator in the OpenVINO; the OpenVINO operator calculation method is also used for realizing the calculation process of the OpenVINO operator on CPU equipment so as to compile a dynamic link library;

and the online reasoning module is used for deploying the packaged files into model service on the reasoning server so as to carry out online reasoning on the model on the reasoning server through each CPU device.

and the automatic model format conversion tool adopts a mode matching method, after a mode is matched, useful attribute information is extracted from all basic operators obtained by splitting a plurality of characteristic processing methods corresponding to the mode, the input of a first operator is set as the input of the self-defined operator, the output of a tail operator is set as the output of the self-defined operator, and the extracted attribute is set as the attribute of the self-defined operator, so that the ONNX operator is obtained.

Embodiments of the present invention also provide an electronic device, which includes a memory and a processor, where the memory is configured to store one or more computer instructions, and the one or more computer instructions are executed by the processor to implement the method.

Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, the computer program being executed by a processor to implement the method.

The invention has the beneficial effects that:

the model is optimized by packaging into an automatic model reasoning acceleration tool, and the optimized model can be directly deployed on a reasoning server. In the model deployment stage, the volume of the optimized model is greatly reduced, and the time for model deployment is shortened; in the model reasoning stage, the optimized model delay is greatly reduced, and the throughput is obviously improved; on the premise of the same throughput, the optimized model needs less CPU resources, and the cost is greatly saved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic flowchart of a CPU-based model inference acceleration method according to an exemplary embodiment of the present invention;

fig. 2 is a block diagram of a model inference acceleration system based on a CPU device according to an exemplary embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.

In addition, in the description of the present invention, the terms used are for illustrative purposes only and are not intended to limit the scope of the present invention. The terms "comprises" and/or "comprising" are used to specify the presence of stated elements, steps, operations, and/or components, but do not preclude the presence or addition of one or more other elements, steps, operations, and/or components. The terms "first," "second," and the like may be used to describe various elements, not necessarily order, and not necessarily limit the elements. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. These terms are only used to distinguish one element from another. These and/or other aspects will become apparent to those of ordinary skill in the art in view of the following drawings, and the description of the embodiments of the present invention will be more readily understood by those of ordinary skill in the art. The drawings are only for purposes of illustrating the described embodiments of the invention. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated in the present application may be employed without departing from the principles described in the present application.

The model reasoning acceleration method based on the CPU equipment in the embodiment of the invention is shown in figure 1, and comprises the following steps:

s1, converting the model into an ONNX format file, and performing equivalent replacement and fusion of a custom operator based on a feature processing correlation operator of the model in the conversion process, wherein the model is a model trained by different frames, and the ONNX operator represents the custom operator in the ONNX format file;

s2, defining the custom operator in OpenVINO and adding a replacement method of the custom operator to realize conversion from the ONNX operator to an OpenVINO operator, wherein the OpenVINO operator represents the custom operator in the OpenVINO;

s3, converting the ONNX format file into an IR format file, wherein the IR format file comprises an xml file and a bin file, the xml file is used for defining a model topological structure, and the bin file is used for storing model parameters;

s4, the calculation process of the OpenVINO operator is realized on CPU equipment to be compiled into a dynamic link library;

s5, analyzing the ONNX format file, generating a model configuration file required by an inference server, and packaging the model configuration file, the IR format file and the dynamic link library into a format file required by the inference server, wherein the model configuration file comprises input and output information of the model extracted from the ONNX format file and a path of the compiled dynamic link library;

and S6, deploying the packaged files on the inference server to form a model service so as to carry out online inference on the model on the inference server through each CPU device.

Because the throughput capacity of intensive computation is constrained by the parallel computing capacity and the cache size of the CPU equipment, the recommendation model has higher requirements on the throughput under the condition of meeting the time delay. In order to solve the problem of high machine cost caused by low throughput, optimization needs to be performed on calculation and cache, characteristics of CPU equipment are fully exerted, throughput is expanded on the basis of meeting the requirement of reasoning service delay, and resource overhead is reduced. The method encapsulates the operations into an automatic model reasoning acceleration tool, optimizes the model, and can directly deploy the optimized model on a reasoning server. In the model deployment stage, the volume of the optimized model is greatly reduced, and the time for model deployment is shortened; in the model reasoning stage, the optimized model delay is greatly reduced, and the throughput is obviously improved; on the premise of the same throughput, the optimized model needs less CPU resources, and the cost is greatly saved. The method can fuse the network structures of the feature processing parts of the model, support the acceleration of an OMP (Intel OpenMP) and the acceleration of a CPU instruction set (such as AVX 512) of a mathematic library, and enable the performance of the recommended model to meet the service index.

The method supports models trained by various frames, automatically converts the formats of the models trained under different frames, namely converts the specific format of the frame into an ONNX format, converts the ONNX format into an IR format, and finally constructs the IR format into a format required by an inference server. The above format conversion process may be understood as including the S1, the S3, and the S5. When the models trained by different frames are converted into the ONNX format, the model files with different formats are unified into one format, so that the subsequent processing is facilitated.

The open Neural Network exchange is an intermediate representation format of a model ir (intermediate representation) for conversion in various deep learning training and reasoning frameworks. The format conversion process in S1 may be performed by itself or a third party tool. The conversion process of S1 can be realized by a method that the training framework support model can be converted into an ONNX format file through self-completion, and the model provided by the training framework is converted into the ONNX format file. For example, the training frame is a pytorech, which itself provides a conversion method from pytorech to ONNX: torch, onnx, export (), the conversion can be completed by using the method. What can be understood by the third party tool being that the training framework itself does not support the conversion of the model into an ONNX format file, then the conversion process of S1 needs to be implemented by the third party tool. For example, the training framework is Tensorflow, which itself does not support the conversion of Tensorflow to ONNX, requiring the use of a third party tool (e.g., the Tensorflow-ONNX tool provided by the ONNX community) to convert the Tensorflow model to an ONNX format file. And processing the correlation operator based on the characteristics is to obtain a custom operator in the ONNX format file, so that the optimized model can realize the correlation characteristic processing based on the custom operator. The user-defined operator is at least one operator, can be one type of operator, and can also be a plurality of operators of different types, so as to realize different feature processing methods. By customizing operators, the problem that the existing deep learning tools (such as OpenVINO) do not support the model conversion and optimization can be solved.

The process of S2 may be implemented by a Model Optimizer (Model Optimizer) of OpenVINO, i.e., an alternative method of defining the custom operator and adding the custom operator in the Model Optimizer. The replacement method of the custom operator can be understood as setting the input and the output of the OpenVINO operator as the input and the output of the ONNX operator, and setting the attribute value of the OpenVINO operator as the attribute value of the ONNX operator, so as to realize the conversion from the ONNX operator to the OpenVINO operator. For example, the following steps are carried out: the CategoricalPlugin operator of the ONNX formatted file has 1 input, 1 output, 5 attributes, then the CategoricalPlugin operator of OpenVINO:

input = (ONNX) input of category patch operator,

output = (ONNX) output of Categorical plug operator,

attribute 1= (ONNX) value of attribute 1 of the category plug-in operator,

attribute 2= (ONNX) value of attribute 2 of the category plug-in operator,

attribute 3= (ONNX) value of attribute 3 of the category plug operator,

attribute 4= (ONNX) value of attribute 4 of the category plug operator,

attribute 5= (ONNX) value of attribute 5 of the categoricalplug operator.

Since the ONNX operator obtained by adding the custom operator to the operator related to feature processing in S1 corresponds to an operation logic, and the Model Optimizer (Model Optimizer) of the deep learning tool (e.g., OpenVINO) does not have the operation logic of the custom operator, it is necessary to use the interface provided by OpenVINO to implement the conversion logic from the custom operator (abbreviated as ONNX operator) in the ONNX format to the custom operator (abbreviated as OpenVINO operator) in OpenVINO. And for each custom operator, extracting the attribute of the operator in the ONNX format, and then constructing the attribute of the OpenVINO operator according to the extracted attribute.

And when the attribute is extracted, acquiring the specific value of each attribute in the ONNX operator. And when the attributes of the OpenVINO operators are constructed, assigning the attributes to the OpenVINO operators according to the extracted specific values corresponding to each attribute of the ONNX operators.

For example, the ONNX operator has two attributes, a, b, where a =1 and b = [0,1 ]. The OpenVINO operator also has two attributes a and b, but the a and b have no specific values, and in the conversion from the ONNX operator to the OpenVINO operator, the values of the two attributes a and b of the ONNX operator are firstly extracted: 1 and [0,1], and then assign the values of 1 and [0,1] to the a, b attributes of the OpenVINO operator:

Openvino(a) = onnx(a) = 1

Openvino(b) = onnx(b) = [0,1]

thus, the construction of OpenVINO operator attributes is completed.

The process of S3 may be implemented by the Model Optimizer, i.e., converting the ONNX format file into an IR format file by a Model Optimizer. Ir (intermediate representation) is an intermediate format that converts model files of other frameworks into OpenVINO support. Namely, the ONNX format file is converted into an IR format file supported by OpenVINO through a Model Optimizer. The IR format file consists of two parts, an xml file defining the model topology and a bin file storing the model parameters. Compared with the trained model (i.e. the model before optimization), the model volume of the converted IR format file is obviously reduced, and the time for deploying the subsequent model can be shortened.

The process of S4 is completed in OpenVINO, that is, the calculation flow of the OpenVINO operator is implemented on a CPU device in OpenVINO. Since the CPU kernel of OpenVINO does not include the custom operator, the CPU kernel of each operator needs to be compiled into a dynamic link library. The CPU kernel is realized by a Plugin plug-in mechanism provided by OpenVINO, and the compiling into the dynamic link library is realized by a CMake tool.

The S5 is to generate a model configuration file according to the requirements of the inference server (such as Triton), which includes the input and output information of the model extracted from the ONNX format file, the path of the compiled dynamic link library, and other configuration information. And the IR format file, the configuration file and the dynamic link library of OpenVINO are packaged into a model format required by a reasoning server (Triton), and then model deployment and online reasoning can be carried out on the reasoning server.

The OpenVINO initial exploration introduced in the background art is to derive an ONNX format model from an original PyTorch model, and to use an UnSample operator, which is suitable for a human body posture estimation HNet model (which is a CV model). Although the method is realized based on the OpenVINO tool, equivalent replacement and fusion of the custom operator are carried out based on the characteristic processing correlation operator of the model in the process of converting the original model into the ONNX format model, and the custom operator is particularly suitable for reasoning acceleration of the recommendation model. The edge calculation scheme research of a deep object detection model introduced in the background technology is suitable for an object detection model (CV model), the model conversion method is to directly use an OpenVINO tool to convert pb files after a Tensorflow model is frezen into IR files of OpenVINO, and the model conversion method of the OpenVINO tool is used to have good support to CV models. The OpenVINO tool does not support direct conversion of the recommendation model, the user-defined operator is defined in the OpenVINO tool, equivalent replacement and fusion of the user-defined operator are achieved, and therefore model conversion is completed, and the method is particularly suitable for reasoning acceleration of the recommendation model. In an alternative embodiment, the custom operator is defined through an interface provided by OpenVINO, and the defining the custom operator in OpenVINO includes defining a name of the custom operator, a number of inputs, a number of outputs, an attribute, dimensions of inputs and outputs, and data types of inputs and outputs.

Since the custom operator of the present invention is not included in the set (i.e., the set of supported operators) of the deep learning tool (e.g., OpenVINO), the interface provided by the deep learning tool (e.g., OpenVINO) needs to be used to define the custom operator. For each custom operator, the name of the custom operator, several inputs, several outputs, which attributes are provided, shape and data type of the inputs and outputs need to be defined, and finally, the defined 7 custom operators are added into the opset of OpenVINO.

The attributes (attributes) of the custom operator are usually constant (i.e., fixed values).

For example, the following steps are carried out: suppose that a summation method needs to be customized: y = X + number, X is input, Y is output, number is a constant, and can be defined as an attribute, and if the value of "number" is set to 1, then Y = X +1 is implemented by the custom operator.

In an alternative embodiment, the model adopts a plurality of feature processing methods in feature processing, after the model is converted into an ONNX format file, each feature processing method is split into a plurality of basic operators,

For the recommendation model, three feature processing methods are generally used:

1. performing one-hot coding on the discrete features;

2. performing continuous type characteristic box separation;

3、embedding。

the API interfaces of the three feature processing provided by different frameworks are packaged on the basis of a plurality of low-order APIs, so that after the training model is converted into an ONNX format by using the APIs provided by TensorFlow, Keras or Pythroch, each feature processing method can be split into a plurality of basic operators, and the IO time and the Kernel starting time of a plurality of data can be increased in the reasoning process. In order to reduce IO time and Kernel starting time of data among operators to the maximum extent, in the conversion process, the method adopts a pattern matching (pattern matching) method to match head and tail operators of three feature processing methods, and operator fusion is carried out to obtain the user-defined operator in the ONNX format file. The fusion of the recommendation model feature processing correlation operators trained under different frames is completed through a pattern matching method.

The head-end operators refer to the first operator and the last operator for realizing a certain feature processing method. For example, the following steps are carried out: the assumed feature processing method is as follows: y = aX + b-c,

X—>Mul(a=2) —>Add(b=3) —>Sub(c=1) —>Y，

the basic operators for implementing the method are the multiplicative operator Mul, the additive operator Add and the subtractive operator Sub, the head and tail operators are the multiplicative operator Mul and the subtractive operator Sub respectively, and the intermediate operator is the additive operator Add.

When the pattern matching is carried out, after a pattern is matched, useful attribute information is extracted from basic operators (including head and tail operators and intermediate operators), the input of the first operator (namely the head operator) is set as the input of a custom operator, the output of the last operator (namely the tail operator) is set as the output of the custom operator, and the extracted attribute is set as the attribute of the custom operator.

For example, the following steps are carried out: when matching the pattern of Mul- > Add- > Sub, a feature processing method is found, and the 3 operators are fused into a custom operator, wherein the input of the custom operator is X, the output of the custom operator is Y, the attribute a =2, the attribute b =3, the attribute c =1, and the fused graph is X- > custom operator (a =2, b =3, c = 1) > Y.

In an optional embodiment, the custom operator includes a discrete class feature operator, a continuous class feature operator, and an embedding operator;

the embedding operators comprise an embedding plug operator and a SafeEmbellingplug operator, the embedding plug operator is used for processing the embedding _ lookup mean, and the SafeEmbellingplug operator is used for processing the abnormal value.

The operators fused by pattern matching have 7 operators in 3 types:

1. discrete class characteristics:

1) the CategoricalPlugin operator: and (4) looking up a word table (namely, vocalbulary _ list) by using a table look-up operator, and returning to one-hot coding (namely, one-hot encoding).

The vocabulary indicates which possible values a discrete feature has. For example, the following steps are carried out: the sex of a person is a discrete feature, and only two possibilities are possible, namely male and female, with 0 for male and 1 for female, and then the sex's vocafulariy list = [0,1 ].

2) Stringtohashplug operator: and carrying out hash binning on the input when the input types are excessive by using an operator needing hash binning.

3) Specstringtohashplug operator: and special operators for carrying out hash value binning are required to carry out special treatment on the abnormal value-1. The outlier for which this operator is-1.

2. Continuous type characteristic:

1) intbuckitzeplug operator: and (5) an integer type characteristic box operator.

2) Floatbuckezeplugin operator: a floating point type feature binning operator.

3、embedding：

1) Embeddingplug operator: and the vectorization operator is used for inquiring corresponding dense information input in the index table, converting the input sparse information into dense information and then averaging (namely embedding _ lookup mean).

2) SafeEmbeddingPlugin: and the safe vectorization operator performs special treatment on abnormal values smaller than 0. The outlier for which this operator is a number less than 0.

In an alternative embodiment, the CategoricalPlugin operator uses HashMap to build the vocabulary and index tables into a Hash table at initialization of the model.

In an alternative embodiment, the AVX512 instruction set is used to perform the summation calculation and the averaging calculation in the embedding operator calculation process.

The invention carries out optimization processing on the user-defined operator so as to optimize the performance of the operator, and mainly aims at two operators:

1. for the CategoricalPlugin operator:

the HashMap is a data structure type, is a set for storing Key-Value Key Value pairs, stores data according to HashCode values of keys, directly locates the values of the keys, has quick access speed, and is a logic data structure, so that the HashMap is used for constructing the HashMap during model initialization to accelerate table lookup.

2. For the embedding operator:

because the value of hidden _ size (hidden node (neuron)) in the embedding (discrete vector to continuous vector) process is typically set to a multiple of 16, the sum and average computations in the embedding process are done using the AVX512 instruction set. The AVX-512 instruction set belongs to a vector operation instruction, the instruction width is further expanded to 512 bits, 32 times of double-precision and 64 times of single-precision floating point operations or 8 64-bit and 16 32-bit integers can be packed in each clock cycle, the floating point performance is doubled, and the performance is increased by about 33% in integer calculation. According to the method, the embedding operator is accelerated through the AVX512 instruction set, and the reasoning performance of the recommendation model is obviously improved.

As shown in fig. 2, the model inference acceleration system based on the CPU device according to the embodiment of the present invention includes:

The system of the invention mainly comprises: the automatic model format conversion tool and the user-defined operator and implementation tool.

The user-defined operator and implementation tool are mainly used for implementing operator definition, operator implementation and compilation, and specifically comprise the following components:

1. operator definition:

OpenVINO's opset does not contain the customized 7 operators of the present invention, so the 7 operators need to be defined by the interface provided by OpenVINO. For each operator, the name of the operator, several inputs, several outputs, which attributes are provided, shape and data type of the inputs and outputs are defined, and finally the defined 7 operators are added into the opset of OpenVINO.

The Model Optimizer tool of OpenVINO does not have an implementation method for converting 7 ONNX operators added during operator fusion into the OpenVINO operators defined above, so that an interface provided by OpenVINO is needed to implement conversion logic from the ONNX operators to the OpenVINO operators. For each operator, extracting the attribute of the ONNX operator, and then constructing the attribute of the OpenVINO operator according to the extracted attribute.

2. Operator implementation and compilation

The CPU kernel of OpenVINO does not contain the customized 7 operators in the invention, so the CPU kernel of each operator needs to be realized and compiled into a dynamic link library.

The automatic model format conversion tool is mainly used for converting recommended models trained under different frames into IR of OpenVINO, and then constructing the IR into a format required by Triton, and specifically comprises the following components:

1. from the trained model to the ONNX model format, completing equivalent replacement and fusion of the user-defined operator of the characteristic processing correlation operator;

2. IR model format from ONNX model format to OpenVINO;

3. from the IR model format of OpenVINO to the model format of Triton.

In an alternative embodiment, the custom operator is defined through an interface provided by OpenVINO, and the defining the custom operator in OpenVINO includes defining a name of the custom operator, a number of inputs, a number of outputs, an attribute, dimensions of inputs and outputs, and data types of inputs and outputs.

The disclosure also relates to an electronic device comprising a server, a terminal and the like. The electronic device includes: at least one processor; a memory communicatively coupled to the at least one processor; and a communication component communicatively coupled to the storage medium, the communication component receiving and transmitting data under control of the processor; wherein the memory stores instructions executable by the at least one processor to implement the method of the above embodiments.

In an alternative embodiment, the memory is used as a non-volatile computer-readable storage medium for storing non-volatile software programs, non-volatile computer-executable programs, and modules. The processor executes various functional applications of the device and data processing, i.e., implements the method, by executing nonvolatile software programs, instructions, and modules stored in the memory.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store a list of options, etc. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be connected to the external device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more modules are stored in the memory and, when executed by the one or more processors, perform the methods of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, has corresponding functional modules and beneficial effects of the execution method, and can refer to the method provided by the embodiment of the application without detailed technical details in the embodiment.

The present disclosure also relates to a computer-readable storage medium for storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Furthermore, those of ordinary skill in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It will be understood by those skilled in the art that while the present invention has been described with reference to exemplary embodiments, various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A model reasoning acceleration method based on a CPU device is characterized by comprising the following steps:

2. The method of claim 1, wherein said defining the custom operator in OpenVINO comprises defining a name of the custom operator, a number of inputs, a number of outputs, attributes, dimensions of inputs and outputs, and data types of inputs and outputs.

3. The method of claim 1, wherein the model employs a plurality of feature processing methods in feature processing, each feature processing method being split into a plurality of base operators after the model is converted into an ONNX format file,

4. The method of claim 1, wherein the custom operators include discrete class feature operators, continuous type feature operators, and embedding operators;

5. The method of claim 4, wherein the CategoricalPlugin operator uses HashMap to build a vocabulary and an index table into a Hash table at initialization of the model.

6. The method of claim 4, wherein the summing and averaging calculations in the embedding operator calculation are done using the AVX512 instruction set.

7. A model inference acceleration system based on a CPU device, the system comprising:

8. The system of claim 7, wherein said defining the custom operator in OpenVINO comprises defining a name of the custom operator, a number of inputs, a number of outputs, attributes, dimensions of inputs and outputs, and data types of inputs and outputs.

9. The system of claim 7, wherein the model employs a plurality of feature processing methods in feature processing, each feature processing method being split into a plurality of base operators after the model is converted into an ONNX format file,

and the automatic model format conversion tool adopts a mode matching method, after a mode is matched, useful attribute information is extracted from all basic operators obtained by splitting a characteristic processing method corresponding to the mode, the input of a first operator is set as the input of the self-defined operator, the output of a tail operator is set as the output of the self-defined operator, and the extracted attribute is set as the attribute of the self-defined operator, so that the ONNX operator is obtained.

10. The system of claim 7, wherein the custom operators include discrete class feature operators, continuous type feature operators, and embedding operators;

11. The system of claim 10, wherein the CategoricalPlugin operator uses HashMap to build a vocabulary and an index table into a Hash table at initialization of the model.

12. The system of claim 10, wherein the summation and averaging calculations in the embedding operator calculation are done using the AVX512 instruction set.

13. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of any of claims 1-6.

14. A computer-readable storage medium, on which a computer program is stored, the computer program being executable by a processor for implementing the method according to any one of claims 1-6.