CN113031952A

CN113031952A - Method and device for determining execution code of deep learning model and storage medium

Info

Publication number: CN113031952A
Application number: CN201911356174.0A
Authority: CN
Inventors: 王劭杰; 章放; 刘伟良; 韩新承; 江欣聪
Original assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Current assignee: Shanghai Goldway Intelligent Transportation System Co Ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2021-06-25

Abstract

The application discloses a method and a device for determining an execution code of a deep learning model and a storage medium, and belongs to the technical field of data processing. The method comprises the following steps: determining at least one preprocessing intermediate code corresponding to the intermediate code based on a data processing mode of the intermediate code of the deep learning model, wherein the at least one preprocessing intermediate code is different from each other; converting each preprocessed intermediate code into an executable code to obtain at least one executable code; execution code for the deep learning model is determined from the at least one executable code. The method and the device determine at least one preprocessing intermediate code according to different data processing modes of the intermediate code, and correspondingly obtain at least one executable code, so that an execution code of a deep learning model with the highest operation efficiency can be determined from the at least one executable code, and the problem that the execution code of the deep learning model can only be determined according to a fixed mode is avoided.

Description

Method and device for determining execution code of deep learning model and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for determining an execution code of a deep learning model, and a storage medium.

Background

With the rapid development of data processing technology, deep learning models are widely applied to the fields of computer vision, natural language processing and the like. Generally, a source code of a deep learning model cannot be suitable for different types of hardware platforms, and therefore, in a use process, the source code of the deep learning model is generally required to be converted into an intermediate code, and then the intermediate code is converted into an executable code of the hardware platform based on a configuration file of the specific hardware platform, so as to obtain an execution code of the deep learning model which can run in the hardware platform.

However, the code conversion process provided above is performed according to a fixed mode, and the determination manner of the execution code of the deep learning model is relatively single, so that the execution code of the deep learning model is less efficient.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for determining an execution code of a deep learning model, which can solve the problem of poor running efficiency of the execution code of the deep learning model in the related art.

The technical scheme is as follows:

in one aspect, a method for determining execution codes of a deep learning model is provided, and the method includes:

determining at least one preprocessing intermediate code corresponding to an intermediate code based on a data processing mode of the intermediate code of a deep learning model, wherein the data processing mode comprises data reuse and/or data non-reuse, and the at least one preprocessing intermediate code is different from each other;

converting each preprocessed intermediate code into an executable code to obtain at least one executable code;

determining execution code for the deep learning model from the at least one executable code.

In one possible implementation manner of the present application, the method is applied to any one of the scenarios of target detection, target tracking, semantic segmentation, speech recognition, character recognition, and natural language processing.

In one possible implementation manner of the present application, the determining at least one preprocessed intermediate code corresponding to the intermediate code based on a data processing manner of the intermediate code of the deep learning model includes:

using the intermediate code as a pre-processing intermediate code; and/or the presence of a gas in the gas,

performing cache optimization processing on at least one intermediate code block in the plurality of intermediate code blocks based on the data processing mode of each intermediate code block;

and taking an intermediate code formed by at least one intermediate code block after the cache optimization processing and all intermediate code blocks which are not subjected to the cache optimization processing as a preprocessing intermediate code.

In a possible implementation manner of the present application, the performing, based on a data processing manner of each intermediate code block, cache optimization processing on at least one intermediate code block in the plurality of intermediate code blocks includes:

when the data processing modes of the intermediate code blocks comprise data non-reuse, performing double-cache optimization processing on at least one intermediate code block in the intermediate code blocks;

the step of taking an intermediate code formed by at least one intermediate code block after the cache optimization processing and all intermediate code blocks which are not subjected to the cache optimization processing as a preprocessing intermediate code comprises the following steps:

and taking an intermediate code formed by at least one intermediate code block subjected to double cache optimization processing and all intermediate code blocks which are not subjected to double cache optimization processing as a preprocessing intermediate code.

when the data processing mode of part of the intermediate code blocks in the intermediate code blocks comprises data reuse and the data processing mode of part of the intermediate code blocks in the intermediate code blocks comprises data non-reuse, performing double-cache optimization processing on at least one intermediate code block in the intermediate code blocks, the data processing mode of which comprises data non-reuse, and performing ring-cache optimization processing on at least one intermediate code block in the intermediate code blocks, the data processing mode of which comprises data reuse;

and taking an intermediate code formed by at least one intermediate code block subjected to double-cache optimization, at least one intermediate code block subjected to annular cache optimization and all intermediate code blocks which are not subjected to cache optimization as a preprocessing intermediate code.

when the data processing modes of the intermediate code blocks comprise data reuse, performing annular cache optimization processing on at least one intermediate code block in the intermediate code blocks;

and taking an intermediate code formed by at least one intermediate code block subjected to the ring cache optimization processing and all intermediate code blocks not subjected to the ring cache optimization processing as a preprocessing intermediate code.

In one possible implementation manner of the present application, the determining, from the at least one executable code, execution code of the deep learning model includes:

determining an execution performance index of each executable code based on tensor data to be tested;

determining execution code for the deep learning model from the at least one executable code based on the execution performance indicators for each executable code.

In a possible implementation manner of the present application, the operation of performing double-cache optimization processing on one intermediate code block in the intermediate code includes the following steps:

and changing part of node information in the syntax tree so that one changed intermediate code block supports the following operations:

alternately caching input data into two input caches, and performing calculation processing in the input cache which is not currently subjected to cache operation in the two input caches while caching the data each time;

recording the cache positions of the first input data in the two input caches;

alternately caching the calculated data into two output caches, and outputting the data in the output cache which is not currently subjected to cache operation in the two output caches while caching each time;

wherein the syntax tree is used for indicating the running logic of the intermediate code block, and each node information in the syntax tree is used for indicating a running step of the intermediate code block.

In a possible implementation manner of the present application, the operation of performing circular cache optimization processing on one intermediate code block in the intermediate code includes the following steps:

carrying out annular cache processing on input data;

alternately caching the data processed by the annular cache into two output caches, and outputting the data in the output cache which is not currently subjected to cache operation in the two output caches while caching each time;

In another aspect, an apparatus for determining execution code of a deep learning model is provided, the apparatus including:

the deep learning model comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining at least one preprocessing intermediate code corresponding to an intermediate code based on a data processing mode of the intermediate code of the deep learning model, the data processing mode comprises data reuse and/or data non-reuse, and the at least one preprocessing intermediate code is different from each other;

the conversion module is used for converting each preprocessing intermediate code into an executable code to obtain at least one executable code;

a second determination module to determine execution code of the deep learning model from the at least one executable code.

In one possible implementation manner of the present application, the apparatus is applied to any scenario of target detection, target tracking, semantic segmentation, speech recognition, character recognition, and natural language processing.

In a possible implementation manner of the present application, the intermediate code includes a plurality of intermediate code blocks, each of the intermediate code blocks is used for processing different tensor data, and the first determining module is configured to:

In one possible implementation manner of the present application, the first determining module is configured to:

In one possible implementation manner of the present application, the second determining module is configured to:

recording the cache positions of the first input data in the two input caches;

carrying out annular cache processing on input data;

In another aspect, an electronic device is provided, where the electronic device includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus, the memory is used to store a computer program, and the processor is used to execute the program stored in the memory, so as to implement the steps of the method for determining the execution code of the deep learning model according to the above aspect.

In another aspect, a computer-readable storage medium is provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method for determining the execution code of the deep learning model according to the above aspect.

The technical scheme provided by the application can at least bring the following beneficial effects:

the data processing mode of the intermediate code of the deep learning model can comprise data reuse and/or data non-reuse, and at least one different preprocessing intermediate code corresponding to the intermediate code can be determined according to the different data processing modes of the intermediate code. And converting each preprocessing intermediate code into an executable code to obtain at least one executable code, so that an execution code of a deep learning model with higher operation efficiency can be determined from the at least one executable code, and the problem that the execution code of the deep learning model can only be determined according to a fixed mode is avoided.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a method for determining execution codes of a deep learning model according to an embodiment of the present application;

fig. 2 is a schematic diagram of a double-cache optimization process provided in an embodiment of the present application;

FIG. 3 is a diagram of another double-cache optimization process provided in an embodiment of the present application;

FIG. 4 is a diagram illustrating another double-cache optimization process according to an embodiment of the present disclosure;

FIG. 5 is a diagram illustrating another double-cache optimization process according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of a circular cache optimization process according to an embodiment of the present disclosure;

FIG. 7 is a diagram illustrating another circular cache optimization process provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of an apparatus for determining execution code of a deep learning model according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before explaining the method for determining the execution code of the deep learning model provided in the embodiment of the present application in detail, relevant terms, execution subjects, and application scenarios provided in the embodiment of the present application are introduced.

First, related terms related to the embodiments of the present application will be briefly described.

Deep learning compiler: the method can be used for converting and the like of the source code of the deep learning model to obtain executable code which can be applied to a specific hardware platform. Further, the deep learning compiler can be used for performing code optimization processing, so that the execution code of the output deep learning model is more efficient.

An intermediate code: or IR (Intermediate Representation), also referred to as Intermediate expression or Intermediate language, is a term widely used in the computer industry to refer to internal Representation code that is easily converted into executable code, which may be equivalent to substitute for source code.

Syntax tree: a syntax tree is a graphical representation of the structure of the code, which represents the operational logic of the code.

And (3) node: a node is a point in the syntax tree that represents an operational relationship.

Cache optimization: the cache optimization is an optimization of codes, data movement and data calculation can be performed in parallel, and the effects of hiding access time and improving the utilization rate of a chip are achieved.

Double-buffer optimization treatment: double-buffer optimization is a processing mode for code, so that parallel processing of data through two buffers can be realized.

And (3) annular cache optimization processing: the ring cache optimization processing is a processing mode for the code, so that the code can realize multiplexing of data.

DSP (Digital Signal Processing): a general-purpose chip, which is often used for signal processing and image processing.

Next, a brief description will be given of an execution body related to an embodiment of the present application.

The method for determining the execution code of the deep learning model provided by the embodiment of the application can be executed by an electronic device, a deep learning compiler can be arranged in the electronic device, and the electronic device can realize the method for determining the execution code of the deep learning model through the deep learning compiler. As an example, the electronic device may be a notebook computer, a portable computer, a desktop computer, and the like, which is not limited in this application.

For ease of understanding, the following description is directed to the general working process of a deep learning compiler, which generally includes the following steps:

1. lexical analysis, i.e., processing words consisting of characters, scans the source code from left to right, character by character, to produce word symbols.

2. And (3) grammatical analysis, namely taking the word symbol as input, analyzing whether the word symbol string forms a grammatical unit which accords with grammatical rules, such as expression, assignment, circulation and the like, and analyzing and checking whether each statement has a correct logical structure according to the grammatical rules used by the language.

3. Semantic checking and intermediate code generation, the intermediate code being an internal representation of the source code. Intermediate code may make the code structure logically simpler and more explicit.

4. And code optimization, namely performing equivalent transformation on the intermediate code, so that the executable code generated by the transformed intermediate code has higher operation efficiency, namely the running time of the executable code is shorter, and the occupied storage space is smaller. In the embodiment of the present application, this step is an optional step.

5. Executable code generation, which is the last stage of compilation. The parsed or optimized intermediate code is converted into executable code.

Next, a brief description is given of an application scenario related to the embodiment of the present application.

The method for determining the execution code of the deep learning model provided by the embodiment of the application can be applied to acceleration operation of the deep learning model in any scene such as target detection, target tracking, semantic segmentation, voice recognition, character recognition, natural language processing and the like.

After the related terms and execution subjects related to the embodiments of the present application are described, a method for determining execution codes of a deep learning model provided by the embodiments of the present application will be described in detail with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flowchart of a method for determining execution code of a deep learning model, which may be implemented by the execution subject, and the method may include the following implementation steps.

Step 101: and determining at least one preprocessing intermediate code corresponding to the intermediate code based on a data processing mode of the intermediate code of the deep learning model, wherein the data processing mode comprises data reuse and/or data non-reuse, and the at least one preprocessing intermediate code is different from each other.

The deep learning model may be a model obtained by training in a hardware platform, the hardware platform may be an ARM (Advanced RISC Machine, Reduced Instruction Set Computing (RISC) microprocessor), a DSP, or the like, and the deep learning model may be a Caffe, a tensrflow, a PyTorch, or the like, which is not limited in this embodiment.

In this embodiment, the data processing manner includes data reuse and/or data non-reuse, that is, the data processing manner may include data reuse, data non-reuse, data reuse and data non-reuse. The data reuse refers to that when the intermediate code is actually executed (or it can be understood that the executable code corresponding to the intermediate code is executed), one input data may be used for multiple operations, that is, one input data may be processed multiple times. Data non-reuse means that each input data is used for only one operation, i.e. each input data is processed only once, when the intermediate code is actually run.

In general, the source code of the deep learning model cannot be adapted to different types of hardware platforms. For this reason, when the deep learning model needs to be run in a specific hardware platform, the source code of the deep learning model can be generally converted into an intermediate code, that is, an internal representation code equivalent to the source code. After the intermediate code is obtained, further processing may be performed on the intermediate code to generate pre-processed intermediate code.

In this embodiment, at least one different preprocessed intermediate code corresponding to the intermediate code may be determined according to different data processing manners of the intermediate code, and as an example, the specific implementation may include: when the intermediate code comprises a plurality of intermediate code blocks, and each intermediate code block is used for processing different tensor data, the intermediate code is used as a preprocessing intermediate code. And/or performing cache optimization processing on at least one intermediate code block in the plurality of intermediate code blocks based on the data processing mode of each intermediate code block. And taking an intermediate code formed by at least one intermediate code block after the cache optimization processing and all intermediate code blocks which are not subjected to the cache optimization processing as a preprocessing intermediate code.

That is, the intermediate code may include a plurality of intermediate code blocks, and the plurality of intermediate code blocks may be used to perform parallel processing on the plurality of tensor data, where taking the example that the intermediate code includes the plurality of intermediate code blocks, a process of determining at least one preprocessed intermediate code corresponding to the intermediate code is described.

In this embodiment, the tensor data may be regarded as a three-dimensional data set, and each input data may be a line of data in the data set or may be one data in the data set.

In case that the intermediate code blocks have different data processing modes, the obtained at least one preprocessed intermediate code may be divided into two classes. The first type is that all intermediate code blocks in the intermediate code are not subjected to cache optimization processing, namely the intermediate code is directly used as a preprocessing intermediate code. The second type is to perform cache optimization processing on part of the intermediate code blocks in the intermediate code blocks of the intermediate code, and to take the processed intermediate code as a preprocessing intermediate code without performing cache optimization processing on other part of the intermediate code blocks, wherein each intermediate code block may be subjected to cache optimization processing in one or more ways, and thus, the second type of preprocessing intermediate code may be correspondingly subjected to one or more ways.

Therefore, the at least one preprocessing intermediate code may include a first type of preprocessing intermediate code, may also include any one or more of a second type of preprocessing intermediate code, and may also include any one or more of a first type of preprocessing intermediate code and a second type of preprocessing intermediate code.

For example, when the intermediate code includes two intermediate code blocks a and b, the cache optimization processing is divided into two

modes

1 and 2. The intermediate code may be directly used as a pre-processed intermediate code. And performing cache optimization processing on the intermediate code block a according to the cache optimization processing 1, performing cache optimization processing on the intermediate code block b according to the cache optimization processing 1, and taking the processed intermediate code as a preprocessed intermediate code. And performing cache optimization processing on the intermediate code block a according to the cache optimization processing 1, performing cache optimization processing on the code block b according to the cache optimization processing 2, and taking the processed intermediate code as a preprocessed intermediate code. And performing cache optimization processing on the intermediate code block a according to the cache optimization processing 1, not performing cache optimization processing on the intermediate code block b, and using the intermediate code only performing cache optimization processing on the intermediate code block a as a preprocessing intermediate code, and the like, so that a plurality of second-class preprocessing intermediate codes can be obtained.

As an example, the cache optimization processing manner may include double cache optimization processing and circular cache optimization processing, and in this case, for a case that the intermediate code includes a plurality of intermediate code blocks, determining at least one preprocessed intermediate code corresponding to the intermediate code may include several specific implementation manners as follows:

the first implementation mode comprises the following steps: based on the data processing mode of each intermediate code block, a specific implementation mode for performing cache optimization processing on at least one intermediate code block in the plurality of intermediate code blocks may be as follows: and when the data processing modes of the intermediate code blocks comprise data non-reuse, performing double-cache optimization processing on at least one intermediate code block in the intermediate code blocks. Correspondingly, the specific implementation manner of taking the intermediate code composed of at least one intermediate code block after the cache optimization processing and all intermediate code blocks which are not subjected to the cache optimization processing as a preprocessing intermediate code may be: and taking an intermediate code formed by at least one intermediate code block subjected to double cache optimization processing and all intermediate code blocks which are not subjected to double cache optimization processing as a preprocessing intermediate code.

When the data processing modes of the intermediate code blocks are all data non-reuse, when cache optimization processing needs to be performed on at least one intermediate code block in the intermediate code, double cache optimization processing can be performed, and then the preprocessed intermediate code block of the second type is determined. For example, when the intermediate code includes two intermediate code blocks a and b, the intermediate code block a may be subjected to double-cache optimization processing and the intermediate code block b may be subjected to double-cache optimization processing, and the processed intermediate code may be used as a second type of preprocessed intermediate code. Further, only the intermediate code block a may be subjected to double-cache optimization, and then the intermediate code composed of the intermediate code block a subjected to double-cache optimization and the intermediate code block b not subjected to double-cache optimization is used as a second-type preprocessed intermediate code. Further, only the intermediate code block b may be subjected to the double-cache optimization, and then the intermediate code composed of the intermediate code block b subjected to the double-cache optimization and the intermediate code block a not subjected to the double-cache optimization is used as a second-type preprocessed intermediate code.

The second implementation mode comprises the following steps: based on the data processing mode of each intermediate code block, a specific implementation mode for performing cache optimization processing on at least one intermediate code block in the plurality of intermediate code blocks may be as follows: when the data processing mode of part of the intermediate code blocks in the intermediate code blocks comprises data reuse and the data processing mode of part of the intermediate code blocks in the intermediate code blocks comprises data non-reuse, performing double-cache optimization processing on at least one intermediate code block in the intermediate code blocks, the data processing mode of which comprises data non-reuse, and performing ring-cache optimization processing on at least one intermediate code block in the intermediate code blocks, the data processing mode of which comprises data reuse. Correspondingly, the specific implementation manner of taking the intermediate code formed by the at least one intermediate code block after the cache optimization processing and all the intermediate code blocks without the cache optimization processing as a preprocessing intermediate code may be: and taking an intermediate code formed by at least one intermediate code block subjected to double-cache optimization, at least one intermediate code block subjected to annular cache optimization and all intermediate code blocks which are not subjected to cache optimization as a preprocessing intermediate code.

The intermediate code may be considered to have a first partial intermediate code block, the data processing manner of the first partial intermediate code block is that data is not reused, and the corresponding first partial intermediate code block may perform double cache optimization processing. Similarly, the intermediate code blocks in the intermediate code except for the first partial intermediate code block may be considered as second partial intermediate code blocks, the data processing mode of the second partial intermediate code blocks is data reuse, and the corresponding second partial intermediate code blocks may be subjected to ring cache optimization processing.

That is, at least one intermediate code block in the first part of intermediate code blocks may be subjected to a double cache optimization process, and at least one intermediate code block in the second part of intermediate code blocks may be subjected to a ring cache optimization process, so that at least one preprocessed intermediate code block of the second type may be determined. For example, when the intermediate code includes two intermediate code blocks a and b, the intermediate code block a may be subjected to double-cache optimization and the intermediate code block b may be subjected to ring-cache optimization, and the processed intermediate code is used as a second type of preprocessed intermediate code. Further, only the intermediate code block a may be subjected to the double-cache optimization, and then the intermediate code composed of the intermediate code block a subjected to the double-cache optimization and the intermediate code block b not subjected to the cache optimization is used as a second-type preprocessed intermediate code. Further, only the intermediate code block b may be subjected to the circular cache optimization processing, and then the intermediate code composed of the intermediate code block b subjected to the circular cache optimization processing and the intermediate code block a not subjected to the cache optimization processing is used as a second type of preprocessed intermediate code.

The third implementation mode comprises the following steps: based on the data processing mode of each intermediate code block, a specific implementation mode for performing cache optimization processing on at least one intermediate code block in the plurality of intermediate code blocks may be as follows: and when the data processing modes of the plurality of intermediate code blocks all comprise data reuse, performing annular cache optimization processing on at least one intermediate code block in the plurality of intermediate code blocks. Correspondingly, the specific implementation manner of taking the intermediate code formed by the at least one intermediate code block after the cache optimization processing and all the intermediate code blocks without the cache optimization processing as a preprocessing intermediate code may be: and taking an intermediate code formed by at least one intermediate code block subjected to the ring cache optimization processing and all intermediate code blocks not subjected to the ring cache optimization processing as a preprocessing intermediate code.

When the data processing modes of the intermediate code blocks are all data reuse, when cache optimization processing needs to be performed on at least one intermediate code block in the intermediate code, annular cache optimization processing can be performed, and then a second type of preprocessed intermediate code block is determined. For example, when the intermediate code includes two intermediate code blocks a and b, the intermediate code block a may be subjected to circular cache optimization processing and the intermediate code block b may be subjected to circular cache optimization processing, and the processed intermediate code may be used as a second type of preprocessed intermediate code. Further, only the intermediate code block a may be subjected to the loop cache optimization, and then the intermediate code composed of the intermediate code block a subjected to the loop cache optimization and the intermediate code block b not subjected to the loop cache optimization is used as a second type of preprocessed intermediate code. Further, only the intermediate code block b may be subjected to the loop cache optimization, and then the intermediate code composed of the intermediate code block b subjected to the loop cache optimization and the intermediate code block a not subjected to the loop cache optimization is used as a second type of preprocessed intermediate code.

After various generation of the second-class preprocessing intermediate codes are introduced, the specific processes of double-cache optimization and circular-cache optimization are introduced next.

1. The operation of performing double-cache optimization processing on one intermediate code block in the intermediate code comprises the following steps: and changing part of node information in the syntax tree so that one changed intermediate code block supports the following operations: (1) and alternately caching the input data into the two input caches, and performing calculation processing in the input caches which do not execute the caching operation currently in the two input caches while caching the data each time. (2) The buffer position of the first data in the two input buffers is recorded. (3) And alternately caching the data after calculation into two output caches, and outputting the data in the output cache which is not currently subjected to cache operation in the two output caches while caching each time. Wherein the syntax tree is used for indicating the running logic of the one intermediate code block, and each node information in the syntax tree is used for indicating a running step of the one intermediate code block.

The syntax tree is generally an abstract tree structure established based on code information extracted from the source code for running, and is used for representing the running logic of the source code. It should be noted that the running logic of different intermediate code blocks may be represented by different syntax trees, i.e. in a possible implementation, one intermediate code block may correspond to one syntax tree.

Generally, the syntax tree includes a plurality of nodes, each of which may represent a run step, and the plurality of run steps constitute run logic in the syntax tree. The node information may include information such as a node type and a node location, and the modification of the node information is actually a modification of a part of codes in the intermediate code.

That is to say, the double-cache optimization processing of the intermediate code block, that is, changing part of node information in the corresponding syntax tree, actually makes some modifications to part of codes in the intermediate code block. In particular, the portion of code used to implement the allocation of the input buffers may be modified such that the input buffers change from one to two. And modifying part of codes for realizing the input loop so that input data can be alternately input into the two input buffers, wherein one input buffer is used for inputting data, and the other input buffer is used for calculating data, and the input and the calculation of the data are carried out in parallel. A part of the code indicating the initial input buffer is added, i.e. one input buffer is determined among the two input buffers for buffering the first input data. The portion of code used to allocate the output buffers is modified so that the output buffers change from one to two. And modifying part of codes for executing the output loop so that the data can be alternately output from the two output buffers, wherein one output buffer is used for outputting the data, and the other output buffer is used for storing the data obtained after the calculation of the input buffer, thereby realizing the parallel operation of the output and the calculation of the data.

For example, a first slice input cache tag a, a second slice input cache tag b, a first slice output cache tag c, and a second slice output cache tag d are set. As shown in fig. 2, when T is 1, the 1 st data is input a, when T is 2, the 2 nd data is input b, the 1 st data is calculated in a and output to c, when T is 3, the 3 rd data is input a, the 2 nd data is calculated in b and output to d, the 1 st data is output by c, then data is alternately input by a and b, and data is alternately output by c and d. As shown in fig. 3, when T is N, the nth data is input to b, the nth-1 data is calculated in a and output to c, the nth-2 data is output by d, when T is N +1, the nth data is calculated in b and output to d, the nth-1 data is output by c, and when T is N +2, the nth data is output by d. Compared with the double-cache optimization processing mode with two input caches and one output cache shown in fig. 4 and 5, it is obvious that the double-cache optimization processing mode shown in fig. 2 and 3 can achieve higher data processing efficiency.

2. The operation of performing the ring cache optimization processing on one intermediate code block in the intermediate code comprises the following steps: and changing part of node information in the syntax tree so that one changed intermediate code block supports the following operations: (1) and carrying out annular cache processing on the input data. (2) And alternately caching the data processed by the annular cache into the two output caches, and outputting the data in the output cache which is not currently subjected to the cache operation in the two output caches while caching each time. Wherein the syntax tree is used for indicating the running logic of the one intermediate code block, and each node information in the syntax tree is used for indicating a running step of the one intermediate code block.

Generally, when the data processing mode of the intermediate code block is data reuse, the intermediate code block can be subjected to ring cache optimization processing. When the condition that input data is used in a plurality of operation processes exists in the process of operating the intermediate code block, annular cache optimization processing is carried out on the intermediate code block, so that the data used in the plurality of operation processes can be input only once, the data is prevented from being input for a plurality of times, and the data processing efficiency is improved.

And (3) optimizing the annular cache of the intermediate code block, namely changing part of node information in a corresponding syntax tree, namely actually modifying part of codes in the intermediate code block to a certain extent. In particular, the portion of code used to implement the allocation of the input cache may be modified such that there may be more than one input cache. A partial code for executing an input loop is modified to determine a start input buffer so that a plurality of input data can be sequentially input into the plurality of input buffers from the start input buffer. A portion of the code indicating the first input data input buffer is added, i.e. an input buffer is determined among the plurality of input buffers for buffering the first input data. The portion of code used to allocate the output buffers is modified so that the output buffers change from one to two. And modifying part of codes for executing the output loop so that the data can be alternately output from the two output buffers, wherein one output buffer is used for outputting the data, and the other output buffer is used for storing the data obtained after the calculation of the input buffer, thereby realizing the parallel operation of the output and the calculation of the data.

The number of the input buffers can be determined according to the number of the input buffers used for calculation and the movement span of the input buffers used for calculation. As shown in fig. 6, when T is 4, the number of input buffers a, b, and c used for calculation is 3, and when T is 5, the number of input buffers b, c, and d used for calculation is 1. Since the number of input buffers used for calculation is 3 and the movement span of the input buffers used for calculation is 1, the number of input buffers can be determined to be 4 accordingly.

Usually the initial input buffer can be determined by the loop variable, the number of input buffers. Setting the initial input buffer as index, the number of input buffers as n, and the loop variable as i. The location of the input buffer can be started by i% n, i.e. by complementing the loop variable and the number of input buffers. For example, as shown in fig. 6, when T is 1, the loop variable is 1, and 1% 4 is 1, it may be determined that the starting input buffer is the first input buffer, that is, the 1 st data buffer is in a.

For example, the input buffer is divided into four buffers a, b, c and d, and the output buffer is divided into two buffers e and f. As shown in fig. 6, the 1 st data input a is performed when T is 1, and the 2 nd data input b is performed when T is 2. When T is 3, the 3 rd data input c, when T is 4, the 4 th data input d, a, b, c are used for calculation, the 1 st data to e are output in a, when T is 5, the 5 th data input a, b, c, d are used for calculation, the 2 nd data to f are output in b, the 1 st data is output in e, then the data are alternately input by a, b, c, d, and the data are alternately output by e, f. As shown in fig. 7, when T is N, the last data is input with the nth data, d, a, b, and c are used for calculation, a outputs the (N-3) th data to e, f outputs the (N-4) th data, when T is N +1, b, c, and d are used for calculation, b outputs the (N-2) th data to f, and f outputs the (N-3) th data, when T is N +2, c and d are used for calculation, c outputs the (N-1) th data to e, and f outputs the (N-2) th data, when T is N +3, d is used for calculation and outputs the (N) th data to f, e outputs the (N-1) th data, and when T is N +4, f outputs the (N) th data.

Step 102: and converting each preprocessing intermediate code into an executable code to obtain at least one executable code.

At least one executable code may be converted by at least one preprocessing intermediate code. When the same preprocessing intermediate code runs on different hardware platforms, the executable code obtained by conversion of the preprocessing intermediate code also differs. Specifically, the implementation manner of converting the preprocessed intermediate code into the executable code may be:

1. and analyzing a copy code for realizing data input and output in the preprocessing intermediate code, determining information such as an original address and a destination address of the copy, the input data length and the like in the copy code, and setting the information as first parameter information.

The input data may be a line of data in the tensor data or one of the tensor data, and the length of the corresponding input data is different according to the difference of the input data.

2. Acquiring a hardware interface code of a specific hardware platform, analyzing a configuration file for indicating parameter information in the hardware interface code, determining information such as a destination address, an original address and an input data length for indicating copying in the configuration file in a character matching mode and the like, and setting the information as second parameter information.

The hardware interface code and configuration file of a specific hardware platform are generally determined and provided by a manufacturer, and the hardware interface code and configuration file of different hardware platforms are different.

3. The corresponding second parameter information is replaced with the first parameter information.

That is, the destination address in the configuration file indicating the copy is changed to the destination address parsed from the intermediate code, the original address in the configuration file indicating the copy is changed to the original address parsed from the intermediate code, and the input data length in the configuration file indicating the data length is changed to the input data length parsed from the intermediate code.

And after the second parameter information of the configuration file is replaced, replacing the copied code in the preprocessing intermediate code with the replaced hardware interface code, thereby obtaining the executable code capable of running on a specific hardware platform.

Step 103: execution code for the deep learning model is determined from the at least one executable code.

The at least one executable code may be executed in a specific hardware platform, and generally, a user may select one executable code from the at least one executable code according to actual conditions for executing the deep learning model on the specific hardware platform.

The specific implementation manner of determining the execution code of the deep learning model from the at least one executable code may be: based on tensor data to be tested, an execution performance index of each executable code is determined. The execution code of the deep learning model is determined from the at least one executable code based on the execution performance indicators of each executable code.

Generally, the size of the tensor data influences the execution performance of the executable code in a specific hardware platform, and for the same size of tensor data, the difference of the data in the tensor data does not influence the execution performance of the executable code in the specific hardware platform, so that the size of the tensor data to be tested can be the same as the size of the tensor data to be processed by the deep learning model.

The execution performance index may include an execution time index, an execution memory index, and a comprehensive index, and the comprehensive index may be calculated according to the execution time index and the execution memory index.

That is, the at least one executable code runs in a specific hardware platform in sequence to obtain the execution time, the execution memory and the comprehensive index of each executable code. Comparing the execution performance indexes of the at least one executable code, and selecting the executable code with the shortest execution time, the executable code with the smallest execution memory and the executable code with the highest comprehensive index. Therefore, one executable code can be further selected from the executable code with the shortest execution time, the executable code with the smallest execution memory and the executable code with the highest comprehensive index according to actual conditions to be determined as the execution code of the deep learning model.

It should be noted that there may be one executable code with the shortest execution time, the least execution memory and the highest comprehensive index, or there may be multiple executable codes with the same and the shortest execution time.

For example, when the intermediate code includes two intermediate code blocks, the two intermediate code blocks are both processed in a manner that data is not reused. Four preprocessing intermediate codes can be obtained from the intermediate codes, and can be correspondingly converted into four executable codes, namely a 1 st executable code, a 2 nd executable code, a 3 rd executable code and a 4 th executable code, which are sequentially operated in a specific hardware platform, and the execution time, the execution memory and the comprehensive index of the four executable codes are determined, so that one executable code can be determined as the execution code of the deep learning model according to actual requirements.

In the embodiment of the application, the data processing mode of the intermediate code of the deep learning model may include data reuse and/or data non-reuse, and at least one different preprocessed intermediate code corresponding to the intermediate code may be determined according to the data processing mode of the intermediate code. And converting each preprocessing intermediate code into an executable code to obtain at least one executable code, so that an execution code of a deep learning model with higher operation efficiency can be determined from the at least one executable code, and the problem that the execution code of the deep learning model can only be determined according to a fixed mode is avoided.

Fig. 8 is a schematic structural diagram of an apparatus for determining execution code of a deep learning model according to an embodiment of the present application, where the apparatus for determining execution code of a deep learning model may be implemented by software, hardware, or a combination of the two. The device for determining the execution code of the deep learning model comprises:

a first determining module 810, configured to determine, based on a data processing manner of intermediate codes of a deep learning model, at least one preprocessed intermediate code corresponding to the intermediate codes, where the data processing manner includes data reuse and/or data non-reuse, and the at least one preprocessed intermediate code is different from each other;

a conversion module 820, configured to convert each preprocessed intermediate code into an executable code, so as to obtain at least one executable code;

a second determining module 830, configured to determine an execution code of the deep learning model from the at least one executable code.

In a possible implementation manner of the present application, the intermediate code includes a plurality of intermediate code blocks, each of the intermediate code blocks is used for processing different tensor data, and the first determining module 810 is configured to:

In one possible implementation manner of the present application, the first determining module 810 is configured to:

In a possible implementation manner of the present application, the second determining module 830 is configured to:

recording the cache positions of the first input data in the two input caches;

carrying out annular cache processing on input data;

It should be noted that: in the determining apparatus for the execution code of the deep learning model provided in the foregoing embodiment, when determining the execution code of the deep learning model, only the division of the above function modules is taken as an example, and in practical applications, the function distribution may be completed by different function modules according to needs, that is, the internal structure of the apparatus may be divided into different function modules, so as to complete all or part of the functions described above. In addition, the determining apparatus for the execution code of the deep learning model and the determining method embodiment of the execution code of the deep learning model provided in the foregoing embodiments belong to the same concept, and details of a specific implementation process thereof are referred to in the method embodiment and are not described herein again.

Fig. 9 is a schematic structural diagram of an electronic device 900 according to an embodiment of the present application, where the control device 900 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 901 and one or more memories 902, where the memory 902 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 901 to implement the method for determining the execution code of the deep learning model provided in the foregoing method embodiments.

Certainly, the electronic device 900 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the electronic device 900 may further include other components for implementing device functions, which are not described herein again.

The embodiment of the present application further provides a non-transitory computer-readable storage medium, and when instructions in the storage medium are executed by a processor of a mobile terminal, the mobile terminal is enabled to execute the method for determining the execution code of the deep learning model provided in the embodiment shown in fig. 1.

Embodiments of the present application further provide a computer program product containing instructions, which when executed on a computer, cause the computer to execute the method for determining the execution code of the deep learning model provided in the embodiment shown in fig. 1.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for determining execution code of a deep learning model, the method comprising:

2. The method of claim 1, wherein the method is applied to any scenario of target detection, target tracking, semantic segmentation, speech recognition, text recognition, natural language processing.

3. The method of claim 2, wherein the intermediate code comprises a plurality of intermediate code blocks, each intermediate code block is used for processing different tensor data, and determining at least one preprocessed intermediate code corresponding to the intermediate code based on a data processing mode of the intermediate code of the deep learning model comprises:

4. The method of claim 3, wherein the performing cache optimization processing on at least one of the plurality of intermediate code blocks based on the data processing mode of each intermediate code block comprises:

5. The method of claim 3, wherein the performing cache optimization processing on at least one of the plurality of intermediate code blocks based on the data processing mode of each intermediate code block comprises:

6. The method of claim 3, wherein the performing cache optimization processing on at least one of the plurality of intermediate code blocks based on the data processing mode of each intermediate code block comprises:

7. The method of any one of claims 1-6, wherein said determining execution code for the deep learning model from the at least one executable code comprises:

8. The method of claim 4 or 5, wherein performing double cache optimization on an intermediate code block in the intermediate code comprises:

recording the cache positions of the first input data in the two input caches;

9. The method of claim 5 or 6, wherein performing a ring cache optimization process on an intermediate code block in the intermediate code comprises:

carrying out annular cache processing on input data;

10. An apparatus for determining execution code of a deep learning model, the apparatus comprising:

11. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus, the memory is used for storing computer programs, and the processor is used for executing the programs stored in the memory to realize the steps of the method according to any one of claims 1-9.

12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.