CN114997401A

CN114997401A - Adaptive inference acceleration method, apparatus, computer device and storage medium

Info

Publication number: CN114997401A
Application number: CN202210928200.8A
Authority: CN
Inventors: 占望鹏; 司超; 袁易之; 朱新宇; 肖博文; 王潇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-08-03
Filing date: 2022-08-03
Publication date: 2022-09-02
Anticipated expiration: 2042-08-03
Also published as: CN114997401B

Abstract

The application relates to an adaptive inference acceleration method, an apparatus, a computer device, a storage medium, and a computer program product. The method can be applied to the fields of artificial intelligence and intelligent traffic, and comprises the following steps: determining an inference result of the first network model; carrying out reasoning acceleration on the second network model on each reasoning acceleration frame to obtain reasoning time consumption and reasoning results; determining the errors of reasoning of the second network model on each reasoning acceleration frame in sequence based on a reasoning result; selecting a target reasoning acceleration frame from the candidate reasoning acceleration frames based on the reasoning time consumption and the target performance index value of the candidate reasoning acceleration frames of the second network model; and converting the second network model into a target network model corresponding to the target reasoning acceleration framework. By adopting the method, the network model for improving the reasoning speed can be quickly obtained, and the deployment time is further greatly reduced.

Description

Adaptive inference acceleration method, apparatus, computer device, and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for adaptive inference acceleration.

Background

After the training of the network model is completed, the trained network model is usually deployed on a corresponding service platform, so as to execute a corresponding service when a service request is initiated by a client. When a network model is deployed, in order to enable the deployed network model to quickly complete service, a professional person usually customizes and accelerates the network model, and then deploys the network model. However, the way in which the network model is customized and accelerated by a professional in the deployment process will greatly increase the deployment time.

Disclosure of Invention

Therefore, it is necessary to provide a method, an apparatus, a computer device, a computer readable storage medium, and a computer program product for adaptive inference acceleration, which can quickly obtain a network model for enhancing the inference speed, thereby greatly reducing the deployment time.

In a first aspect, the present application provides an adaptive inference acceleration method. The method comprises the following steps:

determining an inference result of the first network model;

carrying out reasoning acceleration on the second network model on each reasoning acceleration framework to obtain reasoning time consumption and a reasoning result; the second network model is obtained by performing model format conversion on the first network model;

sequentially determining the error of reasoning of the second network model on each reasoning acceleration frame based on the reasoning result of the first network model and each reasoning result of the second network model;

selecting a target reasoning acceleration frame from the candidate reasoning acceleration frames based on the reasoning time consumption and the target performance index value of the candidate reasoning acceleration frames of the second network model;

and converting the second network model into a target network model corresponding to the target reasoning acceleration framework.

In a second aspect, the application further provides an adaptive reasoning acceleration device. The device comprises:

the first determination module is used for determining an inference result of the first network model;

the second determining module is used for carrying out reasoning acceleration on the second network model on each reasoning acceleration framework to obtain reasoning time consumption and a reasoning result; the second network model is obtained by performing model format conversion on the first network model;

a third determining module, configured to sequentially determine, based on the inference result of the first network model and each inference result of the second network model, an error that the second network model infers on each inference acceleration framework;

the selection module is used for taking the inference acceleration frame corresponding to the error meeting the error condition as a candidate inference acceleration frame, and selecting a target inference acceleration frame from the candidate inference acceleration frame based on the inference time consumption and the target performance index value of the candidate inference acceleration frame of the second network model;

and the conversion module is used for converting the second network model into a target network model corresponding to the target reasoning acceleration framework.

In one embodiment, the first determining module is further configured to execute, by the data processor, a first network model, and determine a CPU inference result of the executed first network model on an original training framework; and running the first network model through a graphic processor, and determining a GPU inference result of the running first network model on the original training frame.

In one embodiment, the second determining module is further configured to create a corresponding inference session object based on each inference acceleration framework; loading the second network models to the corresponding reasoning session objects respectively; initializing an inference session object for loading the second network model; and carrying out reasoning acceleration based on the initialized reasoning conversation object to obtain a reasoning result and reasoning time consumption.

In one embodiment of the foregoing, the second determining module is further configured to perform data inference on the to-be-processed data based on the initialized inference session object to obtain an inference result and inference time; wherein, in the data reasoning process, the second network model is accelerated; the accelerated processing includes at least one of model quantization, removing inference-independent operations in the second network model, or fusing network layers in the second network model.

In one embodiment, the second determining module is further configured to perform data inference on the to-be-processed data according to the operated inference session object when the initialized inference session object operates in the data processor, so as to obtain a CPU inference result and CPU inference time; and when the initialized reasoning conversation object runs on the graphic processor, performing data reasoning on the data to be processed according to the running reasoning conversation object to obtain a GPU reasoning result and GPU reasoning time consumption.

In one embodiment, the selecting module is further configured to determine, based on the inference time consumption and the target performance index value of the second network model in the candidate inference acceleration framework, the number of machines required by the second network model for inference on the candidate inference acceleration framework; and selecting the frames with the machine number meeting the preset conditions from the candidate reasoning acceleration frames as target reasoning acceleration frames.

In one embodiment, the selecting module is further configured to obtain a time-consuming adjustment parameter and a target performance index value; determining a machine number influence factor based on the time consumption adjusting parameter and the inference time consumption of the second network model in the candidate inference acceleration framework; and determining the number of machines required by the second network model when the second network model carries out reasoning on the candidate reasoning acceleration framework according to the machine number influence factor and the target performance index value.

In one embodiment thereof, the apparatus further comprises:

the deployment module is used for responding to model deployment operation and deploying the target network model to a business service platform;

a configuration module, configured to respond to a configuration operation, configure the required machines for the target network model according to the number of machines; wherein the machine comprises a data processor or a graphics processor.

In one embodiment thereof, the apparatus further comprises:

the receiving module is used for receiving an image recognition request when the deployed target network model is an image recognition model;

the calling module is used for calling an image recognition model deployed on the business service platform according to the image recognition request;

and the identification module is used for carrying out image identification on the image to be identified corresponding to the image identification request through the image identification model to obtain an identification result.

In one embodiment, the selecting module is further configured to obtain a machine resource value; determining a total machine resource value based on the machine resource value and the number of machines; and selecting a frame with the total machine resource value meeting a preset condition from the candidate reasoning acceleration frames as a target reasoning acceleration frame.

In one embodiment, the inference result of the first network model comprises at least two inference values, and the inference result of the second network model on each inference acceleration framework comprises at least two inference values;

the third determining module is configured to sequentially determine a difference between at least two inference values of the first network model and at least two inference values of the second network model on each inference acceleration frame, so as to obtain a difference set corresponding to each inference acceleration frame; and taking the maximum difference value in each difference value set or the average difference value of each difference value set as the inference error of the second network model on each inference acceleration framework.

In one embodiment thereof, the apparatus further comprises:

the building module is used for respectively taking each network layer in the first network model as a node; constructing a graph network based on each of the nodes;

and the generation module is used for generating the second network model according to the model parameters of each network layer in the graph network and the first network model.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

determining an inference result of the first network model;

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

determining an inference result of the first network model;

carrying out reasoning acceleration on the second network model on each reasoning acceleration frame to obtain reasoning time consumption and reasoning results; the second network model is obtained by performing model format conversion on the first network model;

selecting a target reasoning acceleration frame from the candidate reasoning acceleration frames based on the reasoning time consumption and the target performance index value of the candidate reasoning acceleration frames of the second network model, wherein the reasoning acceleration frames corresponding to the errors meeting the error condition are used as the candidate reasoning acceleration frames;

In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

determining an inference result of the first network model;

and converting the second network model into a target network model corresponding to the target inference acceleration framework.

According to the self-adaptive reasoning acceleration method, the self-adaptive reasoning acceleration device, the computer equipment, the storage medium and the computer program product, the reasoning result of the first network model is determined, and then the second network model is automatically subjected to reasoning acceleration on each reasoning acceleration framework to obtain the reasoning time consumption and the reasoning result; and sequentially determining the errors of reasoning of the second network model on each reasoning acceleration frame based on the reasoning results of the first network model and the reasoning results of the second network model, and selecting a frame meeting the precision requirement as a candidate reasoning acceleration frame by using the errors, so that the optimal candidate reasoning acceleration frame can be quickly selected without professionals. In addition, a target reasoning acceleration frame meeting the reasoning time consumption and the corresponding performance index requirement is selected from the optimal candidate reasoning acceleration frames, and then the second network model is converted into a target network model corresponding to the target reasoning acceleration frame, so that the target network model meets the reasoning time consumption and the corresponding performance index requirement when reasoning, and the target network model is deployed, thereby not only reducing the deployment time, but also enabling the deployed network model to quickly complete the reasoning process.

Drawings

FIG. 1 is a diagram of an application environment of the adaptive inference acceleration method in one embodiment;

FIG. 2 is a flow diagram of an adaptive inference acceleration method in one embodiment;

FIG. 3 is a diagram of CPU accelerated evaluation in one embodiment;

FIG. 4 is a diagram illustrating accelerated GPU evaluation in one embodiment;

FIG. 5 is a flow diagram that illustrates the construction of inference session objects for inference acceleration, in accordance with an embodiment;

FIG. 6 is a flow chart illustrating an adaptive inference acceleration method according to another embodiment;

FIG. 7 is a block diagram showing the structure of an adaptive inference accelerator in one embodiment;

FIG. 8 is a block diagram showing the construction of an adaptive inference accelerator in another embodiment;

FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Before describing embodiments of the present application, the technology related to the present application is briefly described as follows:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

The adaptive inference acceleration method provided by the embodiment of the application can be applied to an application environment as shown in fig. 1. Wherein the terminal 102 communicates through a network, a first server 104 and a second server 106. The data storage system may store data that the first server 104 and the second server 106 need to process. The data storage system may be integrated on the first server 104 and the second server 106, respectively, or may be located on the cloud or other network server.

The first server 104 (or the second server 106) determines inference results of the first network model; carrying out reasoning acceleration on the second network model on each reasoning acceleration framework to obtain reasoning time consumption and a reasoning result; the second network model is obtained by performing model format conversion on the first network model; sequentially determining the errors of the second network model in reasoning on each reasoning acceleration frame based on the reasoning results of the first network model and each reasoning result of the second network model; selecting a target reasoning acceleration frame from the candidate reasoning acceleration frames based on the reasoning time consumption and the target performance index value of the candidate reasoning acceleration frames of the second network model; and converting the second network model into a target network model corresponding to the target reasoning acceleration framework, and then deploying the target network model. Therefore, when the terminal 102 sends an image recognition request to the second server 106 (or the first server 104) deploying the target network model, the target network model can be used for image recognition, and a final recognition result is obtained.

The terminal 102 may be a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, an internet of things device, and a portable wearable device, and the internet of things device may be a smart speaker, a smart television, a smart air conditioner, and a smart vehicle-mounted device. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like.

The first server 104 and the second server 106 may be independent physical servers, or may be service nodes in a blockchain system, a point-To-point (P2P, Peer To Peer) network is formed between the service nodes in the blockchain system, and the P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP).

In addition, the server 104 and the second server 106 may be a server cluster formed by a plurality of physical servers, and may be cloud servers providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, web services, cloud communications, middleware services, domain name services, security services, Content Delivery Networks (CDNs), and big data and artificial intelligence platforms.

The terminal 102, the first server 104, and the second server 106 may be connected through communication connection manners such as bluetooth, USB (Universal Serial Bus), or a network, which is not limited herein.

In one embodiment, an adaptive inference acceleration method is provided, which can be applied to the first server 104 and the second server 106 in fig. 1, and is described by taking the method as an example of the method applied to the first server 104 (hereinafter referred to as a server) in fig. 1, as shown in fig. 2, the method includes the following steps:

s202, determining an inference result of the first network model.

The first network model may be a neural network model trained based on an original training framework, such as a first network model based on a Pytorch or tensrflow training framework, and the first network model may be a deep learning model. In practical applications, the first network model may be an image recognition model, an image classification model, a video recommendation model, a video potential evaluation model, or the like.

The inference result of the first network model may refer to a result of the first network model in inferring the input data to be processed.

In one embodiment, the server may determine the inference time of the first network model in addition to the inference result of the first network model. The inference time of the first network model may refer to a time consumed by the first network model in inferring the input to-be-processed data.

For example, the first network model performs classification processing on the input image to be classified, so that the result of image classification (e.g., whether a cat or a dog is classified) can be obtained, and the time consumed in the process of classification processing can also be obtained.

In one embodiment, the first network model may be run on a data processor or a graphics processor to perform the inference process, as follows: the server runs the first network model through the data processor, determines the CPU inference result of the running first network model on the original training frame, and can also determine the CPU inference time consumption of the first network model on the original training frame. And running the first network model through the graphic processor, determining a GPU reasoning result of the running first network model on the original training frame, and determining the GPU reasoning time consumption of the first network model on the original training frame.

The data processor may be a Central Processing Unit (CPU), and is a final execution Unit for information Processing and program operation. A Graphics Processing Unit (GPU) may be a microprocessor that performs image and Graphics related operations.

For example, as shown in fig. 3, the server runs a deep learning model (referred to as an original training format model for short) of an original training format through the CPU, and performs inference by using the running original training format model, thereby obtaining CPU inference time (denoted as CPU _ T1) and a CPU inference result of the original training format model on a Pytorch or tensrflow training framework; in addition, as shown in fig. 4, the server runs the original training format model through the GPU, and performs inference by using the running original training format model, thereby obtaining the GPU inference time (denoted as GPU _ T1) and the GPU inference result of the original training format model on the Pytorch or tensrflow training framework.

And S204, carrying out reasoning acceleration on the second network model on each reasoning acceleration framework to obtain reasoning time consumption and a reasoning result.

The second network model can be a network model of an intermediate representation format for conversion in various deep learning training and reasoning frameworks, and is obtained by performing model format conversion on the first network model. The intermediate representation format may refer to a model format that is supported by both the mainstream inference acceleration framework. For example, in an actual service, model training may be performed using two original training models, i.e., a pitoch model and a TensorFlow model, and then a Network model (abbreviated as an ONNX model) in an ONNX (Open Neural Network Exchange) format is derived, and since the prevailing inference acceleration framework supports the ONNX format, the Network model of the original training model can be uniformly converted into a uniform format, i.e., ONNX, for storage after training. And finally, converting into a model format supported by the target equipment during deployment, such as the following conversion in the reasoning and deployment processes: pytrch → ONNX → TensorRT, or Pytrch → ONNX → OpenVINO.

The inference time of the second network model may refer to a time consumed by the second network model in performing inference acceleration on the input to-be-processed data. The inference result of the second network model may refer to a result obtained by the second network model in performing inference acceleration on the input data to be processed.

The reasoning acceleration may refer to acceleration processing of the second network model during the reasoning process, so that the model increases the reasoning speed and completes the business service more quickly. For example, the image recognition model performs image recognition on an input image to be recognized, and performs acceleration processing on the image recognition model in the recognition process, such as removing operations unrelated to reasoning, so that the model completes classification tasks quickly.

The inference acceleration framework can refer to a framework or an acceleration library which can accelerate the second network model so as to accelerate the model inference speed, and after the inference acceleration framework is used for carrying out inference acceleration on the second network model, the speed of executing business services can be accelerated at double speed when deployment is completed. In practical applications, the inference acceleration framework may be an onxrustime framework, an OpenVINO framework, a TensorRT framework, or a self-developed acceleration library.

The self-developed acceleration library may include a self-developed CPU acceleration library and a self-developed GPU acceleration library. For some operators which are not supported by open source acceleration framework CPU reasoning or operators with lower speed, such as image scaling operators, the self-defined operator acceleration optimization can be carried out through C + +, and a self-developed CPU acceleration library can be well compatible and accelerated. For operators which are not supported by open source acceleration framework GPU reasoning or operators with lower speed, such as image scaling operators, self-defined GPU operator acceleration optimization is carried out through C + +, and a self-developed GPU acceleration library can be well compatible and accelerated. Meanwhile, a self-developed GPU acceleration library supports hybrid quantization calculation, automatic precision correction is carried out on the problems of overflow and loss of quantization calculation precision of half-precision floating point numbers (fp 16), overflow conditions are calculated through automatically detecting each layer of fp16 of a model, overflow layers are automatically positioned, quantization bits of the layers are automatically modified into fp32, and the final calculation results of the model are automatically repaired.

In an embodiment, S204 may specifically include: when the second network model is evaluated in an accelerating mode on the data processor, the server firstly runs the second network model on the data processor, and then carries out reasoning acceleration on the running second network model on each reasoning acceleration frame to obtain CPU reasoning time consumption and CPU reasoning results on each reasoning acceleration frame of the second network model. In addition, when the second network model is accelerated and evaluated on the image processor, the server can also operate the second network model on the image processor, and then perform inference acceleration on the operated second network model on each inference acceleration framework to obtain the GPU inference time consumption and the GPU inference result on each inference acceleration framework of the second network model.

For example, after the deep learning model in the original training format is converted into the ONNX model, the server runs the ONNX model through the CPU, data to be processed (such as an image to be recognized) is input into the ONNX model, and the ONNX model performs inference acceleration on an onxrentatime frame, so that CPU inference time (recorded as CPU _ T2) and a CPU inference result of the ONNX model on the onxrentatime frame are obtained. Similarly, the CPU inference time (denoted as CPU _ T3) and the CPU inference result of the ONNX model on the OpenVINO framework, and the CPU inference time (denoted as CPU _ T4) and the CPU inference result of the ONNX model on the self-developed CPU acceleration library can be obtained, as shown in fig. 3.

For another example, after the deep learning model in the original training format is converted into the ONNX model, the server may further run the ONNX model through the GPU, and the ONNX model performs inference acceleration on the onxrentatime frame, so as to obtain GPU inference time (denoted as GPU _ T2) and a GPU inference result of the ONNX model on the onxrentatime frame. Similarly, the GPU reasoning time consumption (marked as GPU _ T3) and the GPU reasoning result of the ONNX model on the TensorRT framework can be obtained. In addition, the ONNX model quantizes a half-precision floating point number (fp 16) on an onxruntime framework, and further obtains GPU inference time (marked as GPU _ T4) and a GPU inference result of the ONNX model quantized by fp16 on the onxruntime framework; the ONNX model is quantized by fp16 on a TensorRT framework, and further GPU reasoning time (marked as GPU _ T5) and a GPU reasoning result of the ONNX model quantized by fp16 on the TensorRT framework are obtained; and carrying out integer quantization (such as INT 8) on the ONNX model on a TensorRT framework, and further obtaining GPU reasoning time (marked as GPU _ T6) and a GPU reasoning result of the ONNX model quantized by INT8 on the TensorRT framework. Finally, the inference time (denoted as GPU _ T7) and the GPU inference result of the ONNX model on the self-developed GPU acceleration library can also be obtained, as shown in fig. 4.

And S206, sequentially determining the inference error of the second network model on each inference acceleration frame based on the inference result of the first network model and each inference result of the second network model.

The error can be called as the precision error of the model and is used for measuring the deviation degree of the second network model from a normal value when the second network model is inferred on each inference acceleration framework. It is noted that the normal value is referenced to the inference result of the first network model, i.e. the normal value is equal to the inference result of the first network model.

In one embodiment, the inference result of the first network model may include an inference value, and the inference result of the second network model on each inference acceleration frame may include an inference value, and in this case, the server may sequentially determine the inference error of the second network model on each inference acceleration frame directly according to the inference value of the first network model and each inference value of the second network model. For example, the data to be processed is an image to be recognized, the CPU inference value of the deep learning model in the original training format is 0.9, and the CPU inference values of the ONNX model on the onxrentatime frame, the OpenVINO frame and the self-developed CPU acceleration library are 0.85, 0.82 and 0.84, respectively, so that the errors of the ONNX model on the onxrentatime frame, the OpenVINO frame and the self-developed CPU acceleration library are 0.05, 0.08 and 0.06, respectively. Wherein, the error of the ONNX model in the Onxruntime framework is the minimum.

In one embodiment, the inference result of the first network model comprises at least two inference values, and the inference result of the second network model on each inference acceleration frame comprises at least two inference values; the server sequentially determines the difference between at least two inference values of the first network model and at least two inference values of the second network model on each inference acceleration frame to obtain a difference set corresponding to each inference acceleration frame; and taking the maximum difference value in each difference value set or the average difference value of each difference value set as the inference error of the second network model on each inference acceleration frame.

For example, the data to be processed is 4 images to be recognized, the CPU inference values of the deep learning model in the original training format are (0.9, 0.8, 0.6, 0.7), and the CPU inference values of the ONNX model on the onxrustime frame, the OpenVINO frame, and the CPU acceleration library in self research are (0.85, 0.73, 0.6, 0.66), (0.9, 0.76, 0.58, 0.7), and (0.85, 0.8, 0.6, 0.69), respectively, then the average errors of the ONNX model in the onxrustime frame, the OpenVINO frame, and the CPU acceleration library in self research are 0.0375, 0.015, and 0.015, respectively, and the maximum errors are 0.07, 0.04, and 0.05, respectively, can be obtained. Among them, the average error of the ONNX model in the Onxruntime frame inference is 0.0375, and the maximum error is 0.07.

And S208, taking the inference acceleration frame corresponding to the error meeting the error condition as a candidate inference acceleration frame, and selecting a target inference acceleration frame from the candidate inference acceleration frame based on the inference time consumption and the target performance index value of the second network model in the candidate inference acceleration frame.

The target performance index value may be used to balance the capacity of the second network model to process traffic within a specified time, for example, the second network model may process N images to be recognized within 1 minute, where N is a positive integer greater than 1. In practical applications, the target performance indicator value may include a query rate per second.

In one embodiment, the step of "taking the inference acceleration frame corresponding to the error satisfying the error condition as the candidate inference acceleration frame" in S208 may specifically include: the server sorts the obtained errors from small to large, and then selects the reasoning acceleration frames at the front as candidate reasoning acceleration frames, for example, selects the reasoning acceleration frames ranked at the top 3 as candidate reasoning acceleration frames; or the server acquires the set error interval and takes the inference acceleration frame corresponding to the error falling in the error interval as a candidate inference acceleration frame.

In an embodiment, the step of "selecting a target inference acceleration frame from the candidate inference acceleration frames based on the inference time consumption and the target performance index value of the candidate inference acceleration frames of the second network model" in S208 may specifically include: the server determines the machine number required by the second network model to reason on the candidate reasoning acceleration frame based on the reasoning time consumption and the target performance index value of the second network model on the candidate reasoning acceleration frame; and selecting frames with the number of machines meeting preset conditions from the candidate reasoning acceleration frames as target reasoning acceleration frames.

Wherein, the number of machines refers to the number of machines required by the second network model to reason on the candidate inference acceleration framework, and the machines can be data processors or graphic processors.

The calculation process for the number of machines is specifically as follows: the server acquires a time-consuming adjustment parameter and a target performance index value; determining machine quantity influence factors based on the time consumption adjusting parameters and the inference time consumption of the second network model in the candidate inference acceleration framework; and determining the number of machines required by the second network model when the second network model carries out reasoning on the candidate reasoning acceleration framework according to the machine number influence factor and the target performance index value. Specifically, the number of machines required for the second network model to perform inference on the candidate inference acceleration framework may be calculated with reference to a calculation formula of the number of machines, which is as follows:

P=QPS×(T/1000)

and P is the number of machines required by the second network model when reasoning on the candidate reasoning acceleration frame, QPS represents a target performance index value, T represents the reasoning consumed time of the second network model on the candidate reasoning acceleration frame, 1000 is a consumed time adjusting parameter, and T/1000 represents a machine number influence factor.

After the machine number required by the second network model to be inferred on the candidate inference acceleration framework is calculated, the target inference acceleration framework can be selected according to the machine number, that is, the framework with the machine number meeting preset conditions is selected from the candidate inference acceleration framework as the target inference acceleration framework, and for example, the inference acceleration framework with the minimum required machine number is selected as the target inference acceleration framework in the subsequent inference acceleration framework.

In addition, a target reasoning acceleration framework can be selected according to the machine cost, and the method comprises the following specific steps: the server acquires a machine resource value; determining a total machine resource value based on the machine resource value and the number of machines; and selecting a frame with a machine total resource value meeting a preset condition as a target reasoning acceleration frame from the candidate reasoning acceleration frames.

Where a machine resource value may be a numerical value used to measure the value of an individual machine.

For example, machine cost calculation is performed according to the number of machines and the unit price of the machine, and the machine cost = the number of machines × the unit price of the machine, thereby deriving a frame with optimal cost as a target inference acceleration frame.

It is noted that since the second network model may run on both hardware, the data processor and the graphics processor, after the target inference acceleration framework is selected, the type of hardware running, such as the data processor or the graphics processor, may also be selected, such that when the model is deployed, a corresponding number of data processors or graphics processors are configured for the deployed model. For example, after the server selects the target inference acceleration frame based on the inference time consumption and the target performance index value of the second network model in the candidate inference acceleration frame, if the second network model is run on the data processor when the second network model is inferentially accelerated on the target inference acceleration frame, the data processor may be selected; by analogy, a graphics processor may also be selected.

And S210, converting the second network model into a target network model corresponding to the target reasoning acceleration framework.

In one embodiment, the server performs format conversion on the second network model by using the model converter, and converts the second network model into a target network model corresponding to the target inference acceleration framework. For example, if the number of required machines is the minimum or the cost of the machine is the minimum when the ONNX model is inferred on the OpenVINO framework, the ONNX model is converted into a target network model corresponding to the OpenVINO framework.

After the second network model is converted into the target network model, the target network model may be deployed to a corresponding service platform, and the deployment process is as follows: the server responds to the model deployment operation and deploys the target network model on the business service platform; in response to the configuration operation, the required machines are configured for the target network model in terms of the number of machines, such as configuring a number of data processors or graphics processors.

After the target network model completes deployment, when receiving a corresponding service request, the deployed target network model may be used to execute a service, specifically: and the server receives the service request, and calls the deployed target network model to process the data to be processed corresponding to the service request, so as to complete the service. When the deployed target network model is an image recognition model, receiving an image recognition request; calling an image recognition model deployed on a business service platform according to the image recognition request; and carrying out image recognition on the image to be recognized corresponding to the image recognition request through the image recognition model to obtain a recognition result. It should be noted that the image recognition process may be performed according to the configured CPU or GPU. Therefore, reasoning acceleration is carried out before deployment, and then the model is converted into a model under a target reasoning acceleration framework and supported by a platform, so that the processing speed of the model can be effectively increased.

In the embodiment, the reasoning result of the first network model is determined, and then the second network model is automatically subjected to reasoning acceleration on each reasoning acceleration frame to obtain the reasoning time consumption and the reasoning result; and sequentially determining the errors of reasoning on each reasoning acceleration frame of the second network model based on the reasoning results of the first network model and each reasoning result of the second network model, and selecting the frame meeting the precision requirement as a candidate reasoning acceleration frame by using the errors, so that the optimal candidate reasoning acceleration frame can be quickly selected without professionals. In addition, a target reasoning acceleration frame meeting the reasoning time consumption and the corresponding performance index requirement is selected from the optimal candidate reasoning acceleration frames, and then the second network model is converted into a target network model corresponding to the target reasoning acceleration frame, so that the target network model meets the reasoning time consumption and the corresponding performance index requirement when reasoning, and the target network model is deployed, thereby not only reducing the deployment time, but also enabling the deployed network model to quickly complete the reasoning process.

In an embodiment, as shown in fig. 5, S204 may specifically include:

and S502, creating a corresponding reasoning conversation object based on each reasoning acceleration framework.

Different reasoning acceleration frameworks can create different reasoning conversation objects, namely, each reasoning acceleration framework is created with a corresponding reasoning conversation object.

S504, the second network models are respectively loaded to the corresponding reasoning conversation objects, and the reasoning conversation objects loaded with the second network models are initialized.

S506, reasoning acceleration is carried out based on the initialized reasoning conversation object, and a reasoning result and reasoning time consumption are obtained.

And each inference conversation object has a corresponding inference result and inference time consumption on a corresponding inference acceleration frame, namely the second network model has a corresponding inference result and inference time consumption when the inference acceleration is performed on each inference acceleration frame.

The initialized inference session object can be inferentially accelerated on the data processor or can be inferentially accelerated on the image processor.

For example, when the second network model is an ONNX model, the inference acceleration process of the ONNX model can be described in the following three stages:

phase 1, an inference session (InferenceSession) object is created.

When creating the inference Session object, the server may create the inference Session object based on the onxruntime framework, for example, when constructing the Session object at the front end of python, the python end may call the referresession class construction function in C + + through http:// onxrentanime _ bound _ state.

And 2, loading and initializing the model.

After completing the creation of the inference session object, the server loads the ONNX model into the InferenceSession object and then initializes it. For example, when a model loading is required, the server may call a corresponding Load () function at the C + + back end, Load the ONNX model into the infernesssession object using the Load () function, and then initialize the ONNX model.

And 3, running the model.

When the ONNX model needs to be operated, the InferenceSession object reads in one piece of data to be processed each time, and then carries out reasoning acceleration to obtain the final output of the model.

In an embodiment, the step of performing inference acceleration based on the initialized inference session object to obtain an inference result and inference time may specifically include: the server performs data inference on the data to be processed based on the initialized inference session object to obtain an inference result and inference time; wherein, in the data reasoning process, the second network model is accelerated; the acceleration process includes at least one of removing inference-independent operations in the second network model or fusing network layers in the second network model.

The operation unrelated to inference can include Dropout operation, for example, Dropout operation can not be executed in the process of inference, so that the inference process can be accelerated.

Fusing the network layers in the second network model may include: and fusing at least two of the convolutional layer (Conv), the parameter normalization layer (BN) and the activation layer (Relu) in the second network model, such as fusing the convolutional layer and the activation layer in the second network model, or fusing the convolutional layer, the parameter normalization layer and the activation layer in the second network model. In the network reasoning stage, because the convolution layer, the parameter normalization layer and the activation layer are in linear operation, the convolution layer, the parameter normalization layer and the activation layer can be fused (namely merged), and the computation of the BN layer is reduced after the convolution layer, the parameter normalization layer and the activation layer are merged, so that the network reasoning can be accelerated. The analytical procedure for model fusion was as follows:

calculation of convolutional layer:

calculation of parameter normalization layer:

wherein the content of the first and second substances,

after merging the parameter normalization layer into the convolutional layer, the calculation formula of the convolutional layer is as follows:

order to

，

Then, the calculation formula of the convolutional layer is:

. It can be seen that the parameter normalization layer is merged into the convolutional layer, and the weight bias of the convolutional layer is changed, but the calculation of one less parameter normalization layer is beneficial to speeding up the inference.

In one embodiment, S506 may specifically include: when the initialized reasoning conversation object runs in the data processor, the server performs data reasoning on the data to be processed according to the running reasoning conversation object to obtain a CPU reasoning result and CPU reasoning time consumption; and when the initialized reasoning conversation object runs on the graphic processor, performing data reasoning on the data to be processed according to the running reasoning conversation object to obtain a GPU reasoning result and GPU reasoning time consumption.

In the above embodiment, the corresponding inference session object is created based on each inference acceleration framework, the second network model is loaded to the corresponding inference session object, the inference session object loaded with the second network model is initialized, and inference acceleration is performed based on the initialized inference session object, so that the second network model can quickly complete the inference process, and the model deployment time is effectively reduced.

In order to more clearly understand the technical solution of the present application, the following is specifically described herein with reference to fig. 6:

(1) and carrying out model training by using a Pythrch training frame to obtain an original training format model.

In addition, the original training format model can also be obtained by using a TensorFlow training framework for model training.

(2) And storing the original training format model by using the model parameters and the network structure.

(3) And the original training format model is automatically converted into a unified ONNX-format reasoning model (called ONNX model for short), so that the subsequent reasoning acceleration is facilitated.

(4) And performing self-adaptive accelerated evaluation on the original training format model and the ONNX model on the CPU and the GPU, and outputting the inference time consumption and the inference result on the CPU and the GPU.

As shown in fig. 3, the accelerated evaluation of the original training format model and the ONNX model on the CPU is as follows:

a) original training format model: CPU reasoning time consumption (recorded as CPU _ T1) and CPU reasoning results of the original training format model on a Pythrch or TensorFlow training framework;

b) CPU acceleration scheme for the onxruntime framework: the ONNX model carries out CPU inference acceleration on an ONnxruntime frame to obtain inference time (recorded as CPU _ T2) and a CPU inference result;

c) CPU acceleration scheme for OpenVINO framework: the ONNX model carries out CPU inference acceleration on an OpenVINO framework to obtain inference time (recorded as CPU _ T3) and a CPU inference result;

d) acceleration scheme of self-developed CPU acceleration library: for some operators which are not supported by open source acceleration framework CPU inference or operators with lower speed, such as image scaling operators, the self-defined CPU operator acceleration optimization is carried out through C + +, and the self-developed CPU acceleration library can be well compatible and accelerated, so that inference time consumption (CPU _ T4) and CPU inference results corresponding to the self-developed CPU acceleration library can be obtained.

As shown in fig. 4, the accelerated evaluation of the raw training format model and the ONNX model on the GPU is as follows:

a) original training format model: the GPU reasoning time (marked as GPU _ T1) and the GPU reasoning result of the original training format model on a Pythrch or TensorFlow frame;

b) GPU acceleration scheme for the onxruntime framework: the GPU reasoning time (marked as GPU _ T2) and the GPU reasoning result of the ONNX model on the ONnxruntime framework;

c) GPU acceleration scheme of the TensorRT framework comprises the steps of reasoning time consumption (marked as GPU _ T3) and a GPU reasoning result of a GPU of an ONNX model on the TensorRT framework;

d) GPU fp16 quantization acceleration scheme for the onxruntime framework: GPU fp16 quantitative inference time (denoted as GPU _ T4) and GPU inference result of ONNX model on onxruntime framework

e) GPU fp16 quantization acceleration scheme of the TensorRT framework: GPU fp16 quantitative inference time (marked as GPU _ T5) and GPU inference results of the ONNX model on TensorRT framework

f) GPU INT8 quantization acceleration scheme of the TensorRT framework: the GPU INT8 of the ONNX model on a TensorRT framework quantifies inference time (marked as GPU _ T6) and a GPU inference result;

g) acceleration scheme of self-developed GPU acceleration library:

for some operators which are not supported by open source acceleration framework GPU inference or operators with lower speed, such as image scaling operators, the self-defined GPU operator acceleration optimization is carried out through C + +, and the self-developed CPU acceleration library can be well compatible and accelerated, so that inference time consumption (GPU _ T7) and a GPU inference result corresponding to the self-developed GPU acceleration library can be obtained.

(5) And (5) evaluating the accuracy consistency of the model.

The inference time and the inference result obtained by the step (4) are in 11, namely: CPU has 4 kinds of reasoning time consumption and reasoning result, GPU has 7 kinds of reasoning time consumption and reasoning result. In order to ensure the correctness of the service result, model precision evaluation is performed on 9 acceleration schemes in the 11 schemes so as to select the scheme with the model precision meeting the requirement.

The evaluation of the model accuracy can be performed on the model accuracy of other schemes by taking the original training format model as a reference, and the specific steps are as follows:

if the inference result of the original training format model is (0.1, 0.2,0.3, 0.4) and the inference result of the CPU acceleration scheme of the onxruntime framework is (0.09, 0.21,0.3, 0.4), the average error is 0.005 and the maximum error is 0.01. Similarly, the average error and the maximum error of other acceleration schemes can be calculated, so as to obtain 9 precision errors, which are recorded as: CPU _ diff 1-CPU _ diff3, GPU _ diff 1-GPU _ diff 6. Where CPU _ diff1 is the error of the CPU acceleration scheme of the ONnxruntime framework, and so on.

(6) And selecting an acceleration scheme by utilizing the maximum acceleration benefit.

And S1, excluding the acceleration scheme which does not meet the precision requirement.

And selecting an acceleration scheme with accuracy errors meeting the requirements, for example, eliminating the accuracy errors larger than 0.01 from 9 accuracy errors of CPU _ diff 1-CPU _ diff3 and GPU _ diff 1-GPU _ diff6 so as to obtain the acceleration scheme meeting the accuracy requirements, and taking the acceleration scheme meeting the accuracy requirements as a candidate acceleration scheme.

S2, the number of machines is calculated.

Calculating the number of CPU machines and the number of GPU machines required by the candidate acceleration scheme according to the target QPS and the reasoning time consumption (such as CPU _ T1-CPU _ T4 and GPU _ T1-GPU _ T7), wherein the calculation formula is as follows: number of machines = target QPS/(1000/inference time).

S3, calculating the cost of the machine and selecting the optimal acceleration scheme.

Calculating the machine cost from the number of machines calculated in S2, the machine cost = the number of machines × the unit price of the machine, thereby obtaining a cost-optimal acceleration scheme.

And S4, generating a model in a corresponding format according to an inference acceleration framework in the cost-optimal acceleration scheme, such as generating an image recognition model in a TensorRT format.

(7) The model is deployed to a service platform.

(8) The number of machines is configured for the deployed model according to the target QPS.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides an adaptive inference accelerating device for implementing the above-mentioned adaptive inference accelerating method. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the above method, so specific limitations in one or more embodiments of the adaptive inference acceleration apparatus provided below may refer to the limitations on the adaptive inference acceleration method in the foregoing, and details are not described here.

In one embodiment, as shown in fig. 7, there is provided an adaptive inference accelerating apparatus, including: a first determining module 702, a second determining module 704, a third determining module 706, a selecting module 708, and a converting module 710, wherein:

a first determining module 702, configured to determine an inference result of the first network model;

the second determining module 704 is configured to perform inference acceleration on the second network model on each inference acceleration framework to obtain inference time consumption and an inference result; the second network model is obtained by performing model format conversion on the first network model;

a third determining module 706, configured to sequentially determine, based on the inference result of the first network model and each inference result of the second network model, an error that the second network model infers on each inference acceleration framework;

a selecting module 708, configured to use the inference acceleration frame corresponding to the error meeting the error condition as a candidate inference acceleration frame, and select a target inference acceleration frame from the candidate inference acceleration frame based on the inference time consumption and the target performance index value of the candidate inference acceleration frame in the second network model;

and the conversion module 710 is configured to convert the second network model into a target network model corresponding to the target inference acceleration framework.

In one embodiment, the first determining module 702 is further configured to execute, by the data processor, the first network model, and determine a CPU inference result of the executed first network model on the original training framework; and running the first network model through the graphic processor, and determining a GPU inference result of the running first network model on the original training frame.

In one embodiment, the second determining module 704 is further configured to create a corresponding inference session object based on each inference acceleration framework; loading the second network models to the corresponding reasoning conversation objects respectively; initializing the reasoning conversation object loaded with the second network model; and carrying out reasoning acceleration based on the initialized reasoning conversation object to obtain a reasoning result and reasoning time consumption.

In one embodiment, the second determining module 704 is further configured to perform data inference on the to-be-processed data based on the initialized inference session object to obtain an inference result and inference time; wherein, in the data reasoning process, the second network model is accelerated; the acceleration processing includes at least one of model quantization, removal of inference-independent operations in the second network model, or fusion of network layers in the second network model.

In one embodiment, the second determining module 704 is further configured to perform data inference on the to-be-processed data according to the operated inference session object when the initialized inference session object operates in the data processor, so as to obtain a CPU inference result and CPU inference time; and when the initialized reasoning conversation object runs on the graphic processor, performing data reasoning on the data to be processed according to the running reasoning conversation object to obtain a GPU reasoning result and GPU reasoning time consumption.

In one embodiment, the selecting module 708 is further configured to determine, based on the inference time consumption and the target performance index value of the second network model in the candidate inference acceleration framework, the number of machines required for the second network model to infer on the candidate inference acceleration framework; and selecting frames with the number of machines meeting preset conditions from the candidate reasoning acceleration frames as target reasoning acceleration frames.

In one embodiment, the selecting module 708 is further configured to obtain a time-consuming adjustment parameter and a target performance index value; determining machine quantity influence factors based on the time consumption adjusting parameters and the inference time consumption of the second network model in the candidate inference acceleration framework; and determining the number of machines required by the second network model when the second network model carries out reasoning on the candidate reasoning acceleration framework according to the machine number influence factor and the target performance index value.

In one embodiment, as shown in fig. 8, the apparatus further comprises:

a deployment module 712, configured to respond to a model deployment operation, deploy the target network model to the service platform;

a configuration module 714 for responding to the configuration operation, configuring the required machines for the target network model according to the number of the machines; wherein the machine comprises a data processor or a graphics processor.

In one embodiment, as shown in fig. 8, the apparatus further comprises:

a receiving module 716, configured to receive an image recognition request when the deployed target network model is an image recognition model;

the calling module 718 is configured to call, according to the image recognition request, an image recognition model deployed on the service platform;

and the identifying module 720 is configured to perform image identification on the image to be identified corresponding to the image identification request through the image identification model to obtain an identification result.

In one embodiment, the selecting module 708 is further configured to obtain a machine resource value; determining a total machine resource value based on the machine resource value and the number of machines; and selecting a frame with a machine total resource value meeting a preset condition as a target reasoning acceleration frame from the candidate reasoning acceleration frames.

In one embodiment, the inference result of the first network model comprises at least two inference values, and the inference result of the second network model on each inference acceleration frame comprises at least two inference values;

a third determining module 706, configured to sequentially determine a difference between at least two inference values of the first network model and at least two inference values of the second network model on each inference acceleration frame, so as to obtain a difference set corresponding to each inference acceleration frame; and taking the maximum difference value in each difference value set or the average difference value of each difference value set as the inference error of the second network model on each inference acceleration frame.

In one embodiment, as shown in fig. 8, the apparatus further comprises:

a building module 722, configured to take each network layer in the first network model as a node respectively; constructing a graph network based on each node;

the generating module 724 is configured to generate a second network model according to model parameters of each network layer in the graph network and the first network model.

The modules in the adaptive inference accelerator can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device comprises a processor, a memory, an Input/Output (I/O) interface and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing reasoning time consumption and reasoning results. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement an adaptive inference acceleration method.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the respective adaptive inference acceleration method described above when executing the computer program.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the respective adaptive inference acceleration method described above.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of the respective adaptive inference acceleration method described above.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include a Read-Only Memory (ROM), a magnetic tape, a floppy disk, a flash Memory, an optical Memory, a high-density embedded nonvolatile Memory, a resistive Random Access Memory (ReRAM), a Magnetic Random Access Memory (MRAM), a Ferroelectric Random Access Memory (FRAM), a Phase Change Memory (PCM), a graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. An adaptive inference acceleration method, characterized in that the method comprises:

determining an inference result of the first network model;

2. The method of claim 1, wherein determining the inference result of the first network model comprises:

running a first network model through a data processor, and determining a CPU inference result of the running first network model on an original training frame;

and operating the first network model through a graphic processor, and determining a GPU inference result of the operated first network model on the original training frame.

3. The method of claim 1, wherein performing inference acceleration on the second network model on each inference acceleration framework to obtain inference time consumption and inference results comprises:

establishing a corresponding reasoning conversation object based on each reasoning acceleration framework;

loading the second network models to the corresponding reasoning session objects respectively;

initializing an inference session object loading the second network model;

and carrying out reasoning acceleration based on the initialized reasoning conversation object to obtain a reasoning result and reasoning time consumption.

4. The method of claim 3, wherein the performing inference acceleration based on the initialized inference session object, obtaining inference results and inference time consumption comprises:

performing data reasoning on the data to be processed based on the initialized reasoning session object to obtain a reasoning result and reasoning time consumption;

wherein, in the data reasoning process, the second network model is accelerated;

the accelerated processing includes at least one of removing inference-independent operations in the second network model or fusing network layers in the second network model.

5. The method of claim 4, wherein the performing data inference on the to-be-processed data based on the initialized inference session object to obtain an inference result and an inference time includes:

when the initialized reasoning conversation object runs in a data processor, carrying out data reasoning on data to be processed according to the running reasoning conversation object to obtain a CPU reasoning result and CPU reasoning time consumption;

and when the initialized reasoning conversation object runs on the graphic processor, performing data reasoning on the data to be processed according to the running reasoning conversation object to obtain a GPU reasoning result and GPU reasoning time consumption.

6. The method according to any one of claims 1 to 5, wherein the selecting a target inference acceleration frame in the candidate inference acceleration frames based on the inference time consumption and the target performance index value of the second network model in the candidate inference acceleration frames comprises:

determining the number of machines required by the second network model to reason on the candidate inference acceleration framework based on the inference time consumption and the target performance index value of the second network model on the candidate inference acceleration framework;

and selecting the frames with the machine number meeting the preset conditions from the candidate reasoning acceleration frames as target reasoning acceleration frames.

7. The method of claim 6, wherein determining the number of machines required by the second network model to reason on the candidate inference acceleration framework based on the inference time consumption and the target performance metric value of the second network model on the candidate inference acceleration framework comprises:

acquiring a time-consuming adjustment parameter and a target performance index value;

determining a machine number influence factor based on the time consumption adjusting parameter and the inference time consumption of the second network model in the candidate inference acceleration framework;

and determining the number of machines required by the second network model when the second network model carries out reasoning on the candidate reasoning acceleration framework according to the machine number influence factor and the target performance index value.

8. The method of claim 7, wherein after converting the second network model into the target network model corresponding to the target inference acceleration framework, the method further comprises:

responding to model deployment operation, and deploying the target network model to a business service platform;

responding to configuration operation, and configuring required machines for the target network model according to the machine number;

wherein the machine comprises a data processor or a graphics processor.

9. The method of claim 8, further comprising:

when the deployed target network model is an image recognition model, receiving an image recognition request;

calling an image recognition model deployed on the service platform according to the image recognition request;

and carrying out image recognition on the image to be recognized corresponding to the image recognition request through the image recognition model to obtain a recognition result.

10. The method according to claim 6, wherein said selecting, as a target inference acceleration frame, a frame whose number of machines meets a preset condition from among the candidate inference acceleration frames comprises:

acquiring a machine resource value;

determining a total machine resource value based on the machine resource value and the number of machines;

and selecting a frame with the total machine resource value meeting a preset condition from the candidate reasoning acceleration frames as a target reasoning acceleration frame.

11. The method according to any one of claims 1 to 5, wherein the inference result of the first network model comprises at least two inference values, and the inference result of the second network model on each inference acceleration framework comprises at least two inference values;

the sequentially determining the errors of the second network model in reasoning on each reasoning acceleration framework based on the reasoning results of the first network model and each reasoning result of the second network model comprises:

sequentially determining the difference between at least two inference values of the first network model and at least two inference values of the second network model on each inference acceleration frame to obtain a difference set corresponding to each inference acceleration frame;

and taking the maximum difference value in each difference value set or the average difference value of each difference value set as the inference error of the second network model on each inference acceleration framework.

12. The method according to any one of claims 1 to 5, further comprising:

taking each network layer in the first network model as a node respectively;

constructing a graph network based on each of the nodes;

and generating the second network model according to the model parameters of each network layer in the graph network and the first network model.

13. An adaptive inference acceleration apparatus, characterized in that the apparatus comprises:

the selection module is used for taking the inference acceleration frame corresponding to the error meeting the error condition as a candidate inference acceleration frame, and selecting a target inference acceleration frame in the candidate inference acceleration frame based on the inference time consumption and the target performance index value of the second network model in the candidate inference acceleration frame;

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 12.

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 12.

16. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, realizes the steps of the method of any one of claims 1 to 12.