CN114691148A - Model reasoning acceleration method and device, electronic equipment and storage medium - Google Patents

Model reasoning acceleration method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114691148A
CN114691148A CN202210374235.1A CN202210374235A CN114691148A CN 114691148 A CN114691148 A CN 114691148A CN 202210374235 A CN202210374235 A CN 202210374235A CN 114691148 A CN114691148 A CN 114691148A
Authority
CN
China
Prior art keywords
deep learning
learning model
target
operator
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210374235.1A
Other languages
Chinese (zh)
Inventor
黄贲
田津津
王锐
田少卿
林晓春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210374235.1A priority Critical patent/CN114691148A/en
Publication of CN114691148A publication Critical patent/CN114691148A/en
Priority to PCT/CN2022/126151 priority patent/WO2023197554A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • G06F8/4441Reducing the execution time required by the program code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/447Target code generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/51Source to source
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Abstract

The disclosure provides a model reasoning acceleration method, a model reasoning acceleration device, electronic equipment and a storage medium, and relates to the field of artificial intelligence, in particular to the field of deep learning. The specific implementation scheme is as follows: the method comprises the steps of obtaining an original deep learning model obtained based on dynamic language training, obtaining a corresponding relation between a first description of a target object in the original deep learning model based on the dynamic language and a second description of the target object in the original deep learning model based on a preset static language, converting the target object in the original deep learning model into an object described based on the preset static language based on the corresponding relation, obtaining a target deep learning model, and loading the target deep learning model into a preset deep learning inference optimizer to obtain an optimized target deep learning model. The present disclosure solves the problem in the related art that it is difficult to convert a deep learning model trained with dynamic language into a static language model suitable for a deep learning inference optimizer.

Description

Model reasoning acceleration method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a model inference acceleration method, apparatus, electronic device, and storage medium.
Background
The deep learning model reasoning acceleration technology is used for reducing the reasoning time delay of the deep learning model.
In the related art, a deep learning inference optimizer is generally adopted to perform inference acceleration on a deep learning model. However, the deep learning model is usually obtained based on dynamic language training, and the language attribute of the deep learning inference optimizer is static language, so when the deep learning inference optimizer is used to accelerate the deep learning model, the deep learning model trained by the dynamic language needs to be converted into a static language model suitable for the deep learning inference optimizer. However, in the related art, it is difficult to convert the deep learning model trained by the dynamic language into a static language model suitable for the deep learning inference optimizer.
Disclosure of Invention
The disclosure provides a model reasoning acceleration method, a model reasoning acceleration device, an electronic device and a storage medium.
According to an aspect of the present disclosure, there is provided a model inference acceleration method, including: acquiring an original deep learning model obtained based on dynamic language training; acquiring a corresponding relation between a first description and a second description, wherein the first description is the description of a target object in the original deep learning model based on a dynamic language, and the second description is the description of the target object in the original deep learning model based on a preset static language; converting a target object in the original deep learning model into an object based on preset static language description based on the corresponding relation to obtain a target deep learning model; and loading the target deep learning model into a preset deep learning inference optimizer to obtain an optimized target deep learning model.
Optionally, based on the correspondence, converting the target object in the original deep learning model into an object described based on a predetermined static language, and obtaining a target deep learning model, where the target deep learning model includes at least one of: performing type annotation on the variables in the original deep learning model under the condition that the target objects are the variables in the original deep learning model, wherein the type annotation is used for identifying the type of the variables as a target type recognized by the variables in a preset static language; under the condition that the target object is an original operator which is not supported by a preset static language in the original deep learning model, replacing the original operator with a target operator which is supported by the preset static language; and in the case that the target object is an original grammar which is not supported by the preset static language in the original deep learning model, replacing the original grammar with a target grammar which is supported by the preset static language.
Optionally, loading the target deep learning model into a predetermined deep learning inference optimizer to obtain an optimized target deep learning model, including: detecting whether a first operator in the target deep learning model has a second operator of a mapping relation in a preset deep learning inference optimizer, wherein the first operator is any one operator in the target deep learning model; and loading the target deep learning model into a preset deep learning inference optimizer by mapping the first operator in the target deep learning model to the corresponding second operator to obtain the optimized target deep learning model.
Optionally, the obtaining an optimized target deep learning model by mapping a first operator in the target deep learning model to a corresponding second operator and loading the target deep learning model into a predetermined deep learning inference optimizer includes: under the condition that a plurality of first operators in the target deep learning model respectively have second operators with mapping relations, and under the condition that continuous operators exist in the second operators with the mapping relations, the continuous operators are fused to obtain a fusion operator; and replacing the continuous operator with the fusion operator, and loading the target deep learning model into a preset deep learning inference optimizer to obtain an optimized target deep learning model.
Optionally, fusing the continuous operators to obtain a fused operator, including: determining an operator to be fused in the continuous operators; and fusing the operators to be fused to obtain a fused operator.
Optionally, determining an operator to be fused in the continuous operators includes: determining a plurality of fusion modes based on the continuous operator; respectively acquiring weighted values of a plurality of fusion modes; determining a target fusion mode from the multiple fusion modes based on the weighted values of the multiple fusion modes; and determining an operator included in the target fusion mode as an operator to be fused in the continuous operators.
Optionally, after the target deep learning model is loaded into a predetermined deep learning inference optimizer, obtaining an optimized target deep learning model, further comprising: compiling the optimized target deep learning model to obtain a compiled model; receiving data to be predicted; and inputting the data to be predicted into the compiled model to obtain prediction result data.
According to another aspect of the present disclosure, there is provided a model inference acceleration apparatus including: the first acquisition module is used for acquiring an original deep learning model obtained based on dynamic language training; the second obtaining module is used for obtaining the corresponding relation between a first description and a second description, wherein the first description is the description of the target object in the original deep learning model based on the dynamic language, and the second description is the description of the target object in the original deep learning model based on the preset static language; the conversion module is used for converting the target object in the original deep learning model into an object based on the preset static language description based on the corresponding relation to obtain a target deep learning model; and the mapping module is used for loading the target deep learning model into a preset deep learning inference optimizer to obtain an optimized target deep learning model.
Optionally, the conversion module comprises at least one of: the annotation unit is used for performing type annotation on the variables in the original deep learning model under the condition that the target object is the variables in the original deep learning model, wherein the type annotation is used for identifying the type of the variables as the target type recognized by the variables in a preset static language; the first replacing unit is used for replacing the original operator with a target operator supported by a preset static language under the condition that the target object is the original operator not supported by the preset static language in the original deep learning model; and a second replacing unit, configured to replace the original grammar with the target grammar supported by the predetermined static language in the case that the target object is the original grammar which is not supported by the predetermined static language in the original deep learning model.
Optionally, the loading module includes: the detection unit is used for detecting whether a first operator in the target deep learning model has a second operator with a mapping relation in a preset deep learning inference optimizer, wherein the first operator is any one operator in the target deep learning model; and the mapping unit is used for loading the target deep learning model into a preset deep learning inference optimizer by mapping the first operator in the target deep learning model into a corresponding second operator to obtain an optimized target deep learning model.
Optionally, the mapping unit includes: the fusion subunit is used for fusing the continuous operators to obtain a fusion operator under the condition that the plurality of first operators in the target deep learning model respectively have second operators with mapping relationships and under the condition that the plurality of first operators respectively have continuous operators in the second operators with mapping relationships; and the mapping subunit is used for replacing the continuous operator with the fusion operator, and loading the target deep learning model into a preset deep learning inference optimizer to obtain an optimized target deep learning model.
Optionally, the fusion subunit comprises: the determining secondary subunit is used for determining an operator to be fused in the continuous operators; and the fusion secondary subunit is used for fusing the operator to be fused to obtain a fusion operator.
Optionally, the determining subunit is further configured to: determining a plurality of fusion modes based on the continuous operator; respectively obtaining weighted values of a plurality of fusion modes; determining a target fusion mode from the multiple fusion modes based on the weighted values of the multiple fusion modes; and determining an operator included in the target fusion mode as an operator to be fused in the continuous operators.
Optionally, the apparatus further comprises: the compiling module is used for compiling the optimized target deep learning model to obtain a compiled model; the receiving module is used for receiving data to be predicted; and the prediction module is used for inputting the data to be predicted into the compiled model to obtain prediction result data.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the methods described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of the above.
According to another aspect of the disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the above.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of a model inference acceleration method according to a first embodiment of the present disclosure;
FIG. 2 is a flow chart of a model inference acceleration method according to a second embodiment of the present disclosure;
FIG. 3 is a schematic diagram of an architecture of a model inference acceleration apparatus according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a model structure according to an embodiment of the disclosure;
FIG. 5 is a flow chart of a method for operator fusion in a deep learning model according to an embodiment of the present disclosure;
FIG. 6 is a block diagram of a model inference acceleration arrangement according to an embodiment of the present disclosure;
FIG. 7 is a block diagram of an electronic device for implementing a model inference acceleration method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Description of the terms
TorchScript, belongs to a sub-language of the Python language offered by PyTorch.
TensorRT, a high-performance deep learning inference optimizer, can reduce the inference delay of a deep learning model and improve the throughput rate of the deep learning model.
AST (Abstract Syntax Tree), a Tree representation of the Abstract Syntax structure of code written in a programming language.
XTCL (XPU software Compilation Library), a high-performance deep learning inference optimizer applied to XPU.
The IR (Intermediate Representation) language is an Intermediate language between hardware and programs.
ONNX (Open Neural Network Exchange), which is an IR language.
In the embodiment of the present disclosure, a model inference acceleration method is provided, and fig. 1 is a flowchart of the model inference acceleration method according to the first embodiment of the present disclosure. As shown in fig. 1, the model inference acceleration method includes the following steps:
and step S101, acquiring an original deep learning model obtained based on dynamic language training.
In one embodiment, there are multiple dynamic languages used to train the original deep learning model, for example, dynamic languages include Python, and the like. When the original deep learning model is obtained, a required layer (layer) can be added on the basis of a model in a deep learning frame, a required classifier and an optimization algorithm are selected, the deep learning model is further built, then the built deep learning model is trained by using a training sample data set, and the trained original deep learning model is obtained. There are many deep learning frameworks for constructing and training deep learning models, including PyTorch, Torch, TensorFlow, and so on.
Step S102, acquiring a corresponding relationship between a first description and a second description, wherein the first description is a description of a target object in the original deep learning model based on a dynamic language, and the second description is a description of the target object in the original deep learning model based on a predetermined static language.
In one embodiment, the target object comprises variables, grammar and operators in the original deep learning model, the first description is description of the variables, the grammar and the operators in the original deep learning model based on a dynamic language, and the second description is description of the variables, the grammar and the operators in the original deep learning model based on a preset static language.
And step S103, converting the target object in the original deep learning model into an object based on the preset static language description based on the corresponding relation to obtain the target deep learning model.
In one embodiment, the original deep learning model is constructed and trained through Python language in PyTorch, and the target deep learning model described based on the static language TorchScript is obtained by converting the target object in the original deep learning model.
And step S104, loading the target deep learning model into a preset deep learning inference optimizer to obtain an optimized target deep learning model.
In some optional embodiments, the optimized target deep learning model is obtained by loading the target deep learning model converted into the TorchScript language into the deep learning inference optimizer, and the inference acceleration of the target deep learning model is realized based on the optimized target deep learning model and the deep learning inference optimizer. The deep learning inference optimizer is used for carrying out inference acceleration on a deep learning model, and the deep learning inference optimizer is in multiple types including TensorRT, a graph compilation engine XTCL and the like.
According to the method, an original deep learning model obtained based on dynamic language training is obtained, the corresponding relation between a first description of a target object in the original deep learning model based on the dynamic language and a second description of the target object in the original deep learning model based on a preset static language is obtained, the target object in the original deep learning model is converted into an object based on a preset static language description based on the corresponding relation, the target deep learning model is obtained, and the target deep learning model is loaded into a preset deep learning inference optimizer to obtain the optimized target deep learning model. In the related art, the language attribute of the original deep learning model obtained based on the deep learning framework is a dynamic language, and after the original deep learning model obtained based on the dynamic language training is obtained in the deep learning frameworks, model inference is directly performed under the deep learning frameworks. However, in the deep learning frames, model reasoning is directly performed, which causes a problem of a long reasoning time due to a slow model calculation speed, and in a field having a high requirement on the reasoning time delay, such as automatic driving, the requirement on the reasoning time delay cannot be satisfied by directly performing model reasoning in the deep learning frames. Therefore, the obtained original deep learning model needs to be converted from a dynamic language to a static language to obtain an optimized target deep learning model capable of being subjected to inference acceleration in the deep learning inference optimizer, and then the deep learning inference optimizer is used for carrying out inference acceleration on the optimized target deep learning model.
The method aims to solve the problem that a deep learning model trained by dynamic language is difficult to convert into a static language model suitable for a deep learning inference optimizer in the related art. In the method, the target object in the original deep learning model obtained based on the dynamic language training is converted into the object based on the preset static language description based on the corresponding relation between the first description and the second description, so that the original deep learning model of the dynamic language is automatically converted into the target deep learning model of the static language. And obtaining an optimized target deep learning model for accelerating model reasoning based on the target deep learning model and a preset deep learning reasoning optimizer. Therefore, the problems of difficult model conversion and low efficiency when the deep learning model trained by the dynamic language is converted into the static language model suitable for the deep learning inference optimizer in the related technology are solved.
In some optional embodiments, the method for converting the target object in the original deep learning model into the object based on the predetermined static language description based on the correspondence between the first description and the second description to obtain the target deep learning model may include at least one of the following: performing type annotation on the variables in the original deep learning model under the condition that the target objects are the variables in the original deep learning model, wherein the type annotation is used for identifying the type of the variables as a target type recognized by the variables in a preset static language; by annotating the type of the variables, the program is made aware of the type of each variable; under the condition that the target object is an original operator which is not supported by a preset static language in the original deep learning model, replacing the original operator with a target operator which is supported by the preset static language; and in the case that the target object is an original grammar which is not supported by the preset static language in the original deep learning model, replacing the original grammar with a target grammar which is supported by the preset static language.
In one embodiment, the original operator that is not supported by the predetermined static language is an operator that cannot be converted into a grammar based on a predetermined static language description. For example, when converting an original deep learning model trained based on a dynamic Python language into a target deep learning model described by a static TorchScript language, some syntax not supported by the TorchScript language needs to be replaced. For example, a list constructed based on the dynamic Python language may contain multiple types of variables, whereas the TorchScript language does not support a list containing multiple types of variables. Therefore, it is necessary to analyze the values of the variables in the list to obtain the types of the variables, and then replace the list containing the variables of the plurality of types with the lists respectively corresponding to the variables. For another example, when an original deep learning model trained based on dynamic Python language in PyTorch is converted into a target deep learning model described by static TorchScript language, some Python operators not supported by the TorchScript language need to be replaced. For example, TorchScript language does not support Numpy (numerical pathon). Therefore, the NumPy-related operator needs to be replaced by an operator in PyTorch, and since the TorchScript language is a subclass of Python language provided by PyTorch, the converted operator can be supported by the TorchScript language after the operator in the Python language which is not supported by the TorchScript language is replaced by the operator in PyTorch.
By the method, type annotation is carried out on variables in the original deep learning model, and original operators and original grammars which are not supported by the static language in the original deep learning model are replaced, so that the original deep learning model of the dynamic language is converted into the target deep learning model of the static language. And acquiring an optimized target deep learning model which can be loaded into the reasoning acceleration optimizer based on the target deep learning model of the static language, thereby realizing the reasoning acceleration of the deep learning model. Therefore, the problem that model conversion is difficult when a deep learning model trained by dynamic language is converted into a static language model which can be loaded into an inference acceleration optimizer in the related art is solved.
In some embodiments, the target deep learning model is loaded into a predetermined deep learning inference optimizer, and the method for obtaining the inference-accelerated target deep learning model includes various steps, for example, the following steps can be adopted: detecting whether a first operator in the target deep learning model has a second operator of a mapping relation in a preset deep learning inference optimizer, wherein the first operator is any one operator in the target deep learning model; and loading the target deep learning model into a preset deep learning inference optimizer by mapping the first operator in the target deep learning model to the corresponding second operator to obtain the optimized target deep learning model.
By the method, the first operator in the target deep learning model is mapped to the corresponding second operator in the deep learning inference optimizer, the optimized target deep learning model is obtained, model prediction is carried out on the basis of the optimized target deep learning model obtained by correspondingly replacing the first operator, the processing efficiency of the target deep learning model is improved, and the inference acceleration of the model is realized.
In some optional embodiments, mapping a first operator in the target deep learning model to a corresponding second operator, and loading the target deep learning model into a predetermined deep learning inference optimizer to obtain the target deep learning model after inference acceleration, may include the following steps: and under the condition that the plurality of first operators in the target deep learning model respectively have second operators with mapping relations, and under the condition that the plurality of first operators respectively have continuous operators in the second operators with mapping relations, fusing the continuous operators to obtain a fused operator, replacing the continuous operators with the fused operator, and loading the target deep learning model into a preset deep learning inference optimizer to obtain an optimized target deep learning model.
By the method, the first operators with the mapping relation and continuous execution sequence are correspondingly replaced in the deep learning inference optimizer, the continuous operators with the continuous execution sequence are obtained, the fusion operators are obtained by fusing the continuous operators, and the fusion operators are used for replacing a plurality of corresponding continuous operators which are not fused, so that the input and output of data in the operation process of the optimized target deep learning model can be reduced, the operation speed of the target deep learning model is increased, and the inference acceleration of the target deep learning model is realized.
In some optional embodiments, the method for fusing the continuous operators to obtain the fused operator may include the following steps: determining an operator to be fused in the continuous operators; and fusing the operators to be fused to obtain a fused operator. By fusing the operators to be fused in the continuous operators, the input and output of data in the operation process of the deep learning model can be reduced, and the reasoning acceleration of the target deep learning model is realized.
In some optional embodiments, the method for determining an operator to be fused among consecutive operators may include the following steps: determining a plurality of fusion modes based on the continuous operator; respectively acquiring weighted values of a plurality of fusion modes; determining a target fusion mode from the multiple fusion modes based on the weighted values of the multiple fusion modes; and determining an operator included in the target fusion mode as an operator to be fused in the continuous operators. And determining a target fusion mode according to the weight value, determining an operator to be fused based on the target fusion mode, and fusing the operator to be fused to obtain an optimized deep learning model, so that the optimized deep learning model with the highest processing speed is obtained, and the reasoning acceleration of the deep learning model is realized.
In some optional embodiments, after loading the target deep learning model into a predetermined deep learning inference optimizer, and obtaining an optimized target deep learning model, the method may further include the following steps:
step S201, compiling the optimized target deep learning model to obtain a compiled model.
Step S202, receiving data to be predicted.
Step S203, inputting the data to be predicted into the compiled model to obtain the prediction result data.
By the method, the compiled model is obtained based on the optimized target deep learning model, and data prediction is performed according to the compiled model, so that the data prediction speed is increased.
Based on the above embodiments and optional embodiments, a model reasoning acceleration method is provided, which is described in detail below.
It should be noted that, in the alternative embodiment, the dynamic language is Python language in PyTorch, and the deep learning inference optimizer is TensorRT.
In the related art, the deep learning model can be constructed and trained by using PyTorch, and the model inference is performed by using the native PyTorch deep learning model, but the single time delay of the model inference is long by using the native PyTorch deep learning model for the model inference. In some fields with higher requirements on the deep learning model reasoning speed, the native PyTorch deep learning model is used for model reasoning, so that the requirements on the deep learning model reasoning speed cannot be met. For example, in the field of automatic driving and the like, the calculation speed of the deep learning model is relatively high, if the native PyTorch deep learning model is used for reasoning the image processing model, the single time delay may be as long as 200 milliseconds, in this case, the user can only obtain 5 image prediction results within one second, which obviously cannot meet the requirements of practical application.
In the related art, in order to accelerate the inference of the deep learning model, the deep learning model obtained based on PyTorch may be converted from dynamic language to static language, and then the obtained deep learning model of the static language is loaded into the deep learning inference optimizer to accelerate the inference of the deep learning model. The deep learning inference optimizer is of various types, for example, a deep learning inference optimizer TensorRT (Graphics Processing Unit) supported by a GPU (Graphics Processing Unit) can be used for reasoning acceleration of a deep learning model. The method for loading the deep learning model acquired based on PyTorch into TensorRT comprises the following steps: the method comprises the steps of firstly converting a Python coding-based PyTorch deep learning model into an IR language, and then loading the deep learning model converted into the IR language into a TensorRT to carry out reasoning acceleration on the deep learning model, wherein the IR language is a static language, and the common IR language in the related technology comprises an ONNX language. The ONNX language is an open file format designed for machine learning and can be used for storing a trained model, and the ONNX language can convert a PyTorch deep learning model into a deep learning model which can be loaded to TensorRT. TensorRT carries out computation graph fusion, tensor fusion and operator optimization on the loaded network structure diagram of the deep learning model. The network structure diagram of the deep learning model is composed of nodes and edges, the nodes in the network structure diagram correspond to operators in the deep learning model, and the edges in the network structure diagram connecting two adjacent nodes correspond to input and output of data between corresponding operators in the deep learning model.
The method in the related art can realize loading the deep learning model of the dynamic language training into the deep learning inference optimizer for inference acceleration. However, this method has the following disadvantages: the language attribute of a Python coding-based PyTorch deep learning model is dynamic language, the language attribute of IR language is static language, and when the PyTorch deep learning model is converted into the IR language deep learning model, the problem that variables, grammar and operators in the deep learning model are difficult to convert exists; when the deep learning inference optimizer is loaded with the deep learning model converted into the IR language, the deep learning inference optimizer needs to fully support operators in the deep learning model of the IR language, operators which are not supported by the deep learning inference optimizer cannot be bypassed, the operators need to be processed by a plug-in writing method and converted into the operators supported by the deep learning inference optimizer, and the processing method wastes labor; in addition, the deep learning model of the IR language acquired in the related art does not support loading the deep learning model of the IR language into other deep learning inference optimizers besides the TensorRT. For example, the deep learning model of the IR language obtained in the related art does not support loading into the graph compilation engine XTCL.
In view of this, an optional embodiment of the present disclosure provides a model inference acceleration method, which obtains a deep learning model in an IR language by automatically performing type annotation on variables in a PyTorch deep learning model based on Python language coding, and replacing grammars and operators that are not supported by the TorchScript language. Therefore, the problem that variable, grammar and operator conversion is difficult in the related art is solved. The operators supported by the deep learning inference optimizer are replaced, Python primary realization is carried out on the operators not supported by the deep learning inference optimizer, the problem that plug-in processing needs to be carried out on the operators not supported when the deep learning model of the IR language is loaded to the deep learning inference optimizer is solved, and the problem that the obtained deep learning model of the IR language does not support loading to other deep learning inference optimizers except TensorRT is solved.
Fig. 3 is an architectural diagram of model inference acceleration according to a third embodiment of the present disclosure. The following describes an inference acceleration method of a deep learning model with reference to a schematic diagram of a model inference acceleration architecture shown in fig. 3.
The reasoning acceleration method of the deep learning model comprises the following steps:
and step 1, preprocessing the Python coding-based Pythch deep learning model based on the preheating data, and converting the Python coding-based Pythch deep learning model into a TorchScript language deep learning model according to a preprocessing result.
Wherein the pre-heating data comprises a training sample data set for training the PyTorch deep learning model.
When the deep learning model based on the dynamic language Python coding is converted into the TorchScript deep learning model of the static language, the problems that the variable type cannot be identified, the language is incompatible and part of operators cannot be supported may exist. Therefore, when the Python-coded deep learning model is converted into the TorchScript deep learning model of the static language, the Python-coded PyTorch deep learning model needs to be preprocessed, so that the Python code can be converted into the TorchScript language, and the model preprocessing process is executed at the Python level. In conjunction with the architectural diagram shown in fig. 3, the process is implemented in a front-end import tool.
The method for preprocessing the Python coding-based PyTorch deep learning model specifically comprises the following steps:
and 1.1, recording the types of variables in the Pythroch deep learning model, and grammar and operators which do not support conversion into TorchScript language.
It should be understood that TorchScript is a Python sub-language provided by PyTorch, and is an IR language formed by parsing a Python AST syntax tree. TorchScript is free of dependencies on the Python runtime and is free of the limitations of the Python GIL Lock (Global Interpreter Lock).
Step 1.2, using the preheating data to automatically correct the inference process (fallback process) of the PyTorch deep learning model, specifically comprising: in the process of preheating the Pythch deep learning model by using the preheating data, the types accepted and returned by each function in the Pythch deep learning model are deduced, the variables in the Pythch deep learning model are subjected to type annotation, the grammar which does not support conversion into TorchScript language in the Pythch deep learning model is corrected, the operator which does not support conversion into TorchScript language in the Pythch deep learning model is replaced, and the grammar and the operator which do not support conversion into TorchScript language in the Pythch deep learning model are enabled to support conversion into TorchScript language.
Step 1.3, correcting a constructor (init function) which does not support the TorchScript language conversion in the PyTorch deep learning model, so that the constructor (init function) supports the TorchScript language conversion.
By the method, the Python-coding-based Pythch deep learning model is preprocessed, and then the Python-coding-based Pythch deep learning model is loaded into a TorchScript language format, so that the dynamic-language-based Python-coding-based Pythch deep learning model can run in a static language environment, for example, a C + + environment.
And 2, optimizing the network structure diagram of the deep learning model converted into the TorchScript language based on the preheating data, and further acquiring the optimized model structure diagram.
The network structure diagram of the deep learning model of the TorchScript language is an original model structure diagram, and the original model structure diagram can be optimized through the following steps to obtain an optimized model structure diagram:
and 2.1, carrying out whole graph analysis on the original model structure graph. The method specifically comprises the following steps: inputting the preheating data into a deep learning model converted into TorchScript language, and acquiring the data input and output conditions of each node in the original model structure diagram. The data input and output condition includes the type and size of the input and output data.
And 2.2, carrying out graph optimization on the original model structure diagram. The method specifically comprises the following steps: and determining redundant nodes in the original model structure diagram according to the running condition of the deep learning model, deleting the redundant nodes, and pruning branches which cannot be accessed in the original model structure diagram when the model runs.
And 2.3, performing subgraph segmentation on the model structure diagram after pruning and redundant nodes deletion, wherein the subgraph is a plurality of small graphs obtained after segmenting the model structure diagram according to the support condition of a bottom layer engine (namely a deep learning model reasoning acceleration optimizer for carrying out reasoning acceleration on the deep learning model) on operators in the deep learning model. The subgraph segmentation specifically comprises the following steps: and carrying out sub-graph segmentation on the model structure diagram subjected to graph optimization according to the support condition of the bottom layer engine on an operator in the deep learning model.
And 2.4, replacing and fusing operators in the deep learning model corresponding to the model structure diagram. Specifically, an operator in the bottom engine is used for replacing and fusing a corresponding operator in the deep learning model, and a model structure diagram after the operator is replaced and fused is obtained. Or transmitting the subgraph obtained by segmenting the subgraph to a bottom engine, taking a corresponding operator in the bottom engine as a new single operator, carrying out operator replacement on a plurality of operators corresponding to the segmented subgraph, adding the replaced operators to corresponding positions of the model structure chart, and obtaining the optimized model structure chart.
The types of the bottom layer engines are various, for example, the bottom layer engines comprise a graph compiling engine XTCL and a deep learning model inference acceleration optimizer TensorRT supported by a GPU.
In the process of carrying out subgraph segmentation on the model structure diagram subjected to graph optimization, a plurality of operators which are sequentially connected in execution and supported by a bottom engine are replaced and fused, and the operators which are not supported by the bottom engine are realized by PyTorch. Namely, only the operators which are supported by the bottom engine and are sequentially connected in execution sequence are subjected to operator fusion, and operators which are not supported by the bottom engine are not subjected to operator replacement or operator fusion. Therefore, the problem that the plug-in is required to be written when the bottom layer engine does not support the operator in the deep learning model is solved, and the optimization efficiency of the deep learning model is greatly improved.
The following describes in detail the process of replacing and fusing multiple operators sequentially connected in execution order supported by the underlying engine in the deep learning model.
FIG. 4 is a schematic diagram of an original model structure provided in accordance with the present disclosure. Referring to FIG. 4, the model structure diagram includes nodes A, B, E, G, F, H that are supported by the underlying engine, and nodes C, D, I that are not supported by the underlying engine.
Fig. 5 is a flowchart of an operator fusion method in the deep learning model provided by the present disclosure. The following describes the steps of performing sub-graph segmentation on the model structure diagram shown in fig. 4 with reference to fig. 5, taking the deep learning model as an example applied to a single underlying engine.
And traversing the whole model structure chart by adopting a DFS (Depth First Search, Depth traversal) method, confirming nodes which are not supported by a bottom layer engine in the model structure chart, and acquiring the topological reverse order of the model structure chart. The topological reverse order of the model structure diagram is explained with reference to fig. 4: the topological order of the model structure chart is ABCDEFGFH, and the topological reverse order of the model structure chart is HFGEDCBA.
Iteration is carried out according to the topological reverse order of the model structure diagram, and node fusion is carried out according to the output points of the nodes. And when iterative fusion is carried out according to the topological reverse order of the model structure diagram, carrying out node fusion according to the output edge in the model structure diagram, and stopping fusion when the two nodes which are currently processed form a loop. Node fusion is explained with reference to fig. 5: the node I is a node which is not supported by a bottom engine, so that the node I is not fused, the node H and the node F are nodes supported by the bottom engine, and the execution sequence of the node H and the node F is connected, so that the node H and the node F are fused, and the fused node F (H) is obtained; the node E and the node G are nodes supported by a bottom engine, and the node G is connected with the execution sequence of the fused node F (H), so that the node G and the fused node F (H) are fused to obtain a fused node G (FH); the fused nodes G (FH) are connected with the execution sequence of the node E, so that the node E and the fused nodes G (FH) are fused to obtain fused nodes E (GFH); the execution sequence of the fused node e (gfh) and the node D is connected, but the node D is a node which is not supported by the bottom engine, so the fused node e (gfh) and the node D are not fused, and the fused node e (gfh) and the node B are not fused during node fusion, because if the node e (gfh) and the node B are fused, the obtained fused node B (egfh) and the node D which is not supported by the bottom engine form a loop; the node E (GFH) and the node B cannot be fused, but the node B and the node A which are supported by the bottom engine and are sequentially connected can be fused to obtain a fused node A (B). Thus, node ABCDEGFH in the original model structure diagram is transformed into node A (B) CDE (GFH) I.
And for the condition that the deep learning model is suitable for a plurality of bottom layer engines, for each bottom layer engine, carrying out sub-graph segmentation on the model structure diagram according to the method, and calculating the sum of the weights of each node in the sub-graph segmentation process. In calculating the sum of the weights of the nodes, the weight of each node may be set to 1 by default. And after carrying out subgraph segmentation on the model structure chart, caching the subgraph segmentation scheme, taking the subgraph with the maximum weight in the cached subgraph segmentation scheme as a target subgraph when carrying out the subgraph segmentation of the model structure chart next time, replacing the corresponding continuous operator with the target subgraph to further obtain the updated model structure chart, and then carrying out further subgraph segmentation on the updated model structure chart.
It should be appreciated that node fusion can reduce the time for data input and output between subgraphs, and thus reduce the Kernel runtime (Kernel Launch) of the underlying engine. After the plurality of continuous operators are fused, the plurality of continuous operators supported by the bottom engine are converted into fusion nodes of a fixed static network structure, and the fusion nodes are serialized to obtain new nodes.
And 3, inputting the data to be predicted into a deep learning model corresponding to the optimized structure diagram to obtain a prediction result.
Through the implementation mode of the disclosure, the problem that the dynamic language of the deep learning model in the related technology is difficult to convert into the static language is solved by automatically carrying out type annotation on variables in the deep learning model trained by the dynamic language Python and replacing grammars and operators which are not supported by TorchScript. By replacing operators supported by the TensorRT, fusing a plurality of continuous operators supported by the TensorRT and realizing the operators not supported by the TensorRT by Python protogenesis, the problem that the operators not supported by the TensorRT need to write plug-ins in the related art and the problem that the model does not support the graph compiling engine XTCL are solved. When model reasoning is carried out based on the optimized deep learning model obtained through the disclosed implementation mode, the method has high reasoning acceleration efficiency, in one embodiment, the neural network model ResNet-50 is subjected to reasoning acceleration processing based on the implementation mode of the disclosure, and the throughput of the neural network model ResNet-50 after the acceleration processing is improved by 64.8% QPS.
Fig. 6 is a block diagram of a model inference acceleration device according to an embodiment of the present disclosure. Referring to fig. 6, the model inference accelerating apparatus includes a first obtaining module 601, a second obtaining module 602, a converting module 603, and a loading module 604, which will be described in detail below.
A first obtaining module 601, configured to obtain an original deep learning model obtained based on dynamic language training; a second obtaining module 602, connected to the first obtaining module 601, configured to obtain a corresponding relationship between a first description and a second description, where the first description is a description of a target object in the original deep learning model based on a dynamic language, and the second description is a description of the target object in the original deep learning model based on a predetermined static language; a conversion module 603, connected to the second obtaining module 602, configured to convert, based on the correspondence, a target object in the original deep learning model into an object described based on a predetermined static language, so as to obtain a target deep learning model; and a loading module 604, connected to the converting module 603, for loading the target deep learning model into a predetermined deep learning inference optimizer to obtain an optimized target deep learning model.
It should be noted here that the first obtaining module 601, the second obtaining module 602, the converting module 603, and the loading module 604 respectively correspond to steps S101 to S104 in the model inference acceleration method, and a plurality of modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in the foregoing embodiment.
In some optional embodiments, the conversion module comprises at least one of: the annotation unit is used for performing type annotation on the variables in the original deep learning model under the condition that the target object is the variables in the original deep learning model, wherein the type annotation is used for identifying the type of the variables as the target type recognized by the variables in a preset static language; the first replacing unit is used for replacing the original operator with a target operator supported by a preset static language under the condition that the target object is the original operator not supported by the preset static language in the original deep learning model; and a second replacing unit, configured to replace the original grammar with the target grammar supported by the predetermined static language in the case that the target object is the original grammar which is not supported by the predetermined static language in the original deep learning model.
In some optional embodiments, the load module comprises: the detection unit is used for detecting whether a first operator in the target deep learning model has a second operator with a mapping relation in a preset deep learning inference optimizer, wherein the first operator is any one operator in the target deep learning model; and the mapping unit is used for loading the target deep learning model into a preset deep learning inference optimizer by mapping the first operator in the target deep learning model into a corresponding second operator to obtain an optimized target deep learning model.
In some optional embodiments, the mapping unit comprises: the fusion subunit is used for fusing the continuous operators to obtain a fusion operator under the condition that the plurality of first operators in the target deep learning model respectively have second operators with mapping relationships and under the condition that the plurality of first operators respectively have continuous operators in the second operators with mapping relationships; and the mapping subunit is used for replacing the continuous operator with the fusion operator, and loading the target deep learning model into a preset deep learning inference optimizer to obtain an optimized target deep learning model.
In some alternative embodiments, the fusion subunit includes: the determining secondary subunit is used for determining an operator to be fused in the continuous operators; and the fusion secondary subunit is used for fusing the operator to be fused to obtain a fusion operator.
In some optional embodiments, the determining the secondary subunit is further to: determining a plurality of fusion modes based on the continuous operator; respectively acquiring weighted values of a plurality of fusion modes; determining a target fusion mode from the multiple fusion modes based on the weighted values of the multiple fusion modes; and determining an operator included in the target fusion mode as an operator to be fused in the continuous operators.
In some optional embodiments, the model inference accelerating means further comprises: the compiling module is used for compiling the optimized target deep learning model to obtain a compiled model; the receiving module is used for receiving data to be predicted; and the prediction module is used for inputting the data to be predicted into the compiled model to obtain prediction result data.
According to an embodiment of the present disclosure, the present disclosure also provides an electronic device.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the electronic device 700 includes a computing unit 701, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 707 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
A number of components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 707 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 executes the respective methods and processes described above, such as the model inference acceleration method. For example, in some embodiments, the model inference acceleration method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 707. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM 703 and executed by the computing unit 701, may perform one or more steps of the model inference acceleration method described above. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the model inference acceleration method by any other suitable means (e.g., by means of firmware).
The present disclosure also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the methods described above.
The present disclosure also provides a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any of the above.
The present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the above.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (17)

1. A model inference acceleration method, comprising:
acquiring an original deep learning model obtained based on dynamic language training;
acquiring a corresponding relation between a first description and a second description, wherein the first description is the description of a target object in the original deep learning model based on the dynamic language, and the second description is the description of the target object in the original deep learning model based on a preset static language;
based on the corresponding relation, converting the target object in the original deep learning model into an object based on the preset static language description to obtain a target deep learning model;
and loading the target deep learning model into a preset deep learning inference optimizer to obtain an optimized target deep learning model.
2. The method according to claim 1, wherein the converting the target object in the original deep learning model into an object based on the predetermined static language description based on the correspondence, resulting in a target deep learning model, comprises at least one of:
performing type annotation on the variable in the original deep learning model when the target object is the variable in the original deep learning model, wherein the type annotation is used for identifying that the type of the variable is a target type recognized by the variable in the predetermined static language;
replacing the original operator with a target operator supported by the predetermined static language under the condition that the target object is the original operator not supported by the predetermined static language in the original deep learning model;
and in the case that the target object is an original grammar which is not supported by the predetermined static language in the original deep learning model, replacing the original grammar with a target grammar which is supported by the predetermined static language.
3. The method of claim 1, wherein the loading the target deep learning model into a predetermined deep learning inference optimizer, resulting in an optimized target deep learning model, comprises:
detecting whether a first operator in the target deep learning model has a second operator of a mapping relation in the preset deep learning inference optimizer, wherein the first operator is any one operator in the target deep learning model;
and loading the target deep learning model into a preset deep learning inference optimizer by mapping a first operator in the target deep learning model to a corresponding second operator to obtain an optimized target deep learning model.
4. The method of claim 3, wherein the loading the target deep learning model into a predetermined deep learning inference optimizer by mapping a first operator in the target deep learning model to a corresponding second operator to obtain an optimized target deep learning model comprises:
under the condition that a plurality of first operators in the target deep learning model respectively have second operators with mapping relations, and under the condition that continuous operators exist in the second operators with the mapping relations, the continuous operators are fused to obtain a fusion operator;
and replacing the continuous operator with the fusion operator, and loading the target deep learning model into a preset deep learning inference optimizer to obtain an optimized target deep learning model.
5. The method of claim 4, wherein said fusing the continuous operators to obtain a fused operator comprises:
determining an operator to be fused in the continuous operators;
and fusing the operators to be fused to obtain a fused operator.
6. The method of claim 5, wherein the determining the operator to be fused among the continuous operators comprises:
determining a plurality of fusion modes based on the continuous operator;
respectively acquiring the weighted values of the multiple fusion modes;
determining a target fusion mode from the multiple fusion modes based on the weighted values of the multiple fusion modes;
and determining an operator included in the target fusion mode as an operator to be fused in the continuous operators.
7. The method according to any one of claims 1 to 6, wherein after said loading the target deep learning model into a predetermined deep learning inference optimizer, obtaining an optimized target deep learning model, further comprising:
compiling the optimized target deep learning model to obtain a compiled model;
receiving data to be predicted;
and inputting the data to be predicted into the compiled model to obtain predicted result data.
8. A model inference acceleration apparatus, comprising:
the first acquisition module is used for acquiring an original deep learning model obtained based on dynamic language training;
a second obtaining module, configured to obtain a correspondence between a first description and a second description, where the first description is a description of a target object in the original deep learning model based on the dynamic language, and the second description is a description of the target object in the original deep learning model based on a predetermined static language;
the conversion module is used for converting the target object in the original deep learning model into an object based on the preset static language description based on the corresponding relation to obtain a target deep learning model;
and the loading module is used for loading the target deep learning model into a preset deep learning inference optimizer to obtain an optimized target deep learning model.
9. The apparatus of claim 8, wherein the conversion module comprises at least one of:
the annotation unit is used for performing type annotation on the variable in the original deep learning model when the target object is the variable in the original deep learning model, wherein the type annotation is used for identifying that the type of the variable is a target type recognized by the variable in the predetermined static language;
a first replacing unit, configured to replace an original operator with a target operator supported by the predetermined static language in the original deep learning model if the target object is the original operator not supported by the predetermined static language;
and a second replacing unit, configured to replace the original grammar with a target grammar supported by the predetermined static language if the target object is an original grammar that is not supported by the predetermined static language in the original deep learning model.
10. The apparatus of claim 8, wherein the loading module comprises:
the detection unit is used for detecting whether a first operator in the target deep learning model has a second operator with a mapping relation in the preset deep learning inference optimizer, wherein the first operator is any one operator in the target deep learning model;
and the mapping unit is used for loading the target deep learning model into a preset deep learning inference optimizer by mapping a first operator in the target deep learning model into a corresponding second operator so as to obtain an optimized target deep learning model.
11. The apparatus of claim 10, wherein the mapping unit comprises:
the fusion subunit is configured to fuse, when the plurality of first operators in the target deep learning model respectively have second operators with mapping relationships, and when continuous operators exist in the second operators with mapping relationships, the plurality of first operators respectively have continuous operators, so as to obtain a fusion operator;
and the mapping subunit is used for replacing the continuous operator with the fusion operator, and loading the target deep learning model into a preset deep learning inference optimizer to obtain an optimized target deep learning model.
12. The apparatus of claim 11, wherein the fusion subunit comprises:
the determining secondary subunit is used for determining an operator to be fused in the continuous operators;
and the fusion secondary subunit is used for fusing the operator to be fused to obtain a fusion operator.
13. The apparatus of claim 12, wherein the determining subunit is further configured to: determining a plurality of fusion modes based on the continuous operator; respectively acquiring the weighted values of the multiple fusion modes; determining a target fusion mode from the multiple fusion modes based on the weighted values of the multiple fusion modes; and determining an operator included in the target fusion mode as an operator to be fused in the continuous operators.
14. The apparatus of any of claims 8 to 13, wherein the apparatus further comprises:
the compiling module is used for compiling the optimized target deep learning model to obtain a compiled model;
the receiving module is used for receiving data to be predicted;
and the prediction module is used for inputting the data to be predicted into the compiled model to obtain prediction result data.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 7.
17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 7.
CN202210374235.1A 2022-04-11 2022-04-11 Model reasoning acceleration method and device, electronic equipment and storage medium Pending CN114691148A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210374235.1A CN114691148A (en) 2022-04-11 2022-04-11 Model reasoning acceleration method and device, electronic equipment and storage medium
PCT/CN2022/126151 WO2023197554A1 (en) 2022-04-11 2022-10-19 Model reasoning acceleration method and apparatus, and electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210374235.1A CN114691148A (en) 2022-04-11 2022-04-11 Model reasoning acceleration method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114691148A true CN114691148A (en) 2022-07-01

Family

ID=82142102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210374235.1A Pending CN114691148A (en) 2022-04-11 2022-04-11 Model reasoning acceleration method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN114691148A (en)
WO (1) WO2023197554A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115809688A (en) * 2022-08-24 2023-03-17 北京百度网讯科技有限公司 Model debugging method and device, electronic equipment and storage medium
WO2023197554A1 (en) * 2022-04-11 2023-10-19 北京百度网讯科技有限公司 Model reasoning acceleration method and apparatus, and electronic device and storage medium
CN117372846A (en) * 2023-10-17 2024-01-09 湖南苏科智能科技有限公司 Target detection method, platform, device and equipment based on embedded platform

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200326934A1 (en) * 2020-06-26 2020-10-15 Intel Corporation System to analyze and enhance software based on graph attention networks
CN112819153A (en) * 2020-12-31 2021-05-18 杭州海康威视数字技术股份有限公司 Model transformation method and device
CN113342345A (en) * 2021-05-17 2021-09-03 北京百度网讯科技有限公司 Operator fusion method and device of deep learning framework
CN113379070A (en) * 2021-08-13 2021-09-10 苏州浪潮智能科技有限公司 Deep learning framework conversion method, system, storage medium and equipment
CN113448545A (en) * 2021-06-23 2021-09-28 北京百度网讯科技有限公司 Method, apparatus, storage medium, and program product for machine learning model servitization
CN113986234A (en) * 2021-09-19 2022-01-28 苏州浪潮智能科技有限公司 Cross-platform model reasoning method, system, storage medium and equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449858A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Processing method of neural network model and related equipment
CN112418427A (en) * 2020-11-25 2021-02-26 广州虎牙科技有限公司 Method, device, system and equipment for providing deep learning unified reasoning service
CN114691148A (en) * 2022-04-11 2022-07-01 北京百度网讯科技有限公司 Model reasoning acceleration method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200326934A1 (en) * 2020-06-26 2020-10-15 Intel Corporation System to analyze and enhance software based on graph attention networks
CN112819153A (en) * 2020-12-31 2021-05-18 杭州海康威视数字技术股份有限公司 Model transformation method and device
CN113342345A (en) * 2021-05-17 2021-09-03 北京百度网讯科技有限公司 Operator fusion method and device of deep learning framework
CN113448545A (en) * 2021-06-23 2021-09-28 北京百度网讯科技有限公司 Method, apparatus, storage medium, and program product for machine learning model servitization
CN113379070A (en) * 2021-08-13 2021-09-10 苏州浪潮智能科技有限公司 Deep learning framework conversion method, system, storage medium and equipment
CN113986234A (en) * 2021-09-19 2022-01-28 苏州浪潮智能科技有限公司 Cross-platform model reasoning method, system, storage medium and equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023197554A1 (en) * 2022-04-11 2023-10-19 北京百度网讯科技有限公司 Model reasoning acceleration method and apparatus, and electronic device and storage medium
CN115809688A (en) * 2022-08-24 2023-03-17 北京百度网讯科技有限公司 Model debugging method and device, electronic equipment and storage medium
CN115809688B (en) * 2022-08-24 2023-10-24 北京百度网讯科技有限公司 Model debugging method and device, electronic equipment and storage medium
CN117372846A (en) * 2023-10-17 2024-01-09 湖南苏科智能科技有限公司 Target detection method, platform, device and equipment based on embedded platform

Also Published As

Publication number Publication date
WO2023197554A1 (en) 2023-10-19

Similar Documents

Publication Publication Date Title
CN114691148A (en) Model reasoning acceleration method and device, electronic equipment and storage medium
CN113342345A (en) Operator fusion method and device of deep learning framework
CN110780879B (en) Decision execution method, device, equipment and medium based on intelligent compiling technology
CN114820279B (en) Distributed deep learning method and device based on multiple GPUs and electronic equipment
US20220101194A1 (en) Method, electronic device, and computer program product for processing machine learning model
CN112527281A (en) Operator upgrading method and device based on artificial intelligence, electronic equipment and medium
CN112270413A (en) Operator merging method and device, electronic equipment and storage medium
CN115809063A (en) Storage process compiling method, system, electronic equipment and storage medium
CN112783508B (en) File compiling method, device, equipment and storage medium
CN114217848A (en) Dependency relationship processing method and device, electronic equipment and computer storage medium
US11023101B2 (en) System and method for implementing a self service machine learning framework
CN115186738B (en) Model training method, device and storage medium
CN115809688B (en) Model debugging method and device, electronic equipment and storage medium
US20220207427A1 (en) Method for training data processing model, electronic device and storage medium
US9921814B2 (en) Control flow graph analysis
CN110727428B (en) Method and device for converting service logic layer codes and electronic equipment
CN113138760A (en) Page generation method and device, electronic equipment and medium
CN113010182B (en) Method and device for generating upgrade file and electronic equipment
CN113051479B (en) File processing and recommendation information generation methods, devices, equipment and storage medium
CN114780021B (en) Copy repairing method and device, electronic equipment and storage medium
CN114880357A (en) Source code information retrieval method and device, electronic equipment, storage medium and product
CN117648092A (en) Byte code processing method and device
CN117667112A (en) Self-adaptive generation method of front-end development document based on babel
CN115525295A (en) Automatic code editing method and device, electronic equipment and storage medium
CN115563183A (en) Query method, device and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination