CN116415622A

CN116415622A - Deep neural network reasoning acceleration method, device, equipment and storage medium

Info

Publication number: CN116415622A
Application number: CN202310320784.5A
Authority: CN
Inventors: 高希彤; 何祥焕; 叶可江; 须成忠
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2023-03-28
Filing date: 2023-03-28
Publication date: 2023-07-11

Abstract

The application relates to a deep neural network reasoning acceleration method, device, equipment and storage medium. The method comprises the following steps: constructing a Einsum calculation graph for representing a deep neural network model structure; acquiring a set of possible storage formats of all intermediate results generated by the deep neural network in the reasoning calculation process from layer to layer based on the Einsum calculation graph, wherein the intermediate results are non-final output results generated by the deep neural network in the reasoning calculation process from layer to layer; searching the possible storage format sets of all the intermediate results by adopting a local greedy search strategy to obtain an optimal storage format combination which enables the inference calculation time of the deep neural network to be shortest; and combining and deploying the deep neural network according to the optimal storage format. According to the embodiment of the application, the reasoning calculation speed of the deep neural network can be greatly improved under the condition that the model structure of the deep neural network is not changed.

Description

Deep neural network reasoning acceleration method, device, equipment and storage medium

Technical Field

The application belongs to the technical field of deep learning, and particularly relates to a deep neural network reasoning acceleration method, device, equipment and storage medium.

Background

In recent years, with the improvement of computing power of hardware devices and the increasing availability of data, deep learning technology has achieved remarkable effects in the fields of image, text, voice, recommendation systems and the like, and deep neural networks are widely applied to various industries. Deep neural networks learn features from large-scale data and generalize the results into unknown data. However, the cost of excellent performance is the expansion of the network scale, which is mainly reflected in the aspects of the parameter quantity and the calculated quantity, and the parameter quantity is more than 500MB by taking the deep neural network VGG16 as an example, and the parameters are required to be stored in a memory and continuously read and written in the reasoning process; regarding the computation amount, the deep neural network includes a large number of multiply-accumulate operations, such as floating point operations in excess of 15G in VGG 16. Such huge parameter amounts and calculation amounts inevitably affect the reasoning speed of the deep neural network, so that the application of the deep neural network in an embedded device with limited storage capacity and calculation capacity, such as a smart phone or a wearable device, and high real-time requirements becomes a difficult problem.

In order to solve the above problems, in the prior art, the inference speed of the deep neural network is accelerated by hardware acceleration and software algorithm respectively. However, the hardware acceleration requirement has a reasonable mapping manner for different network structure configurations, which limits the generalization of the hardware acceleration. The software algorithm generally involves acceleration methods such as pruning, data quantization, distillation, low-rank decomposition and the like, the methods change the model structure of the deep neural network to a certain extent, different solutions are required to be given for different model structures, and generalization is low.

Disclosure of Invention

The application provides a deep neural network reasoning acceleration method, device, equipment and storage medium, which aim to solve one of the technical problems in the prior art at least to a certain extent.

In order to solve the above problems, the present application provides the following technical solutions:

a deep neural network reasoning acceleration method comprises the following steps:

constructing a Einsum calculation graph for representing a deep neural network model structure;

acquiring a set of possible storage formats of all intermediate results generated by the deep neural network in the reasoning calculation process from layer to layer based on the Einsum calculation graph, wherein the intermediate results are non-final output results generated by the deep neural network in the reasoning calculation process from layer to layer;

searching the possible storage format sets of all the intermediate results by adopting a local greedy search strategy to obtain an optimal storage format combination which enables the inference calculation time of the deep neural network to be shortest;

and combining and deploying the deep neural network according to the optimal storage format.

The technical scheme adopted by the embodiment of the application further comprises: the constructing the Einsum computation graph for representing the deep neural network model structure comprises the following steps:

converting the model structure of the deep neural network into a form of a calculation graph G, and replacing a multiply-accumulate operator in the calculation graph by using a Einsum operator to generate the Einsum calculation graph G _E 。

The technical scheme adopted by the embodiment of the application further comprises: the model structure of the deep neural network is converted into a form of a calculation graph G, and a multiplication and accumulation operator in the calculation graph is replaced by a Einsum operator to generate the Einsum calculation graph G _E The method comprises the following steps:

defining possible storage formats of all tensor data in the calculation graph G as a search space;

analyzing the cache hit rates of all operators related in the computation graph G, recording the storage format of tensor data passing through each operator, and removing the storage format of the first k% which enables the cache hit rate of the operator to represent the worst from the search space;

replacing the operator of the computing graph G after the worst storage format is removed with a Einsum operator so as to convert the computing graph G into the Einsum computing graph G _E 。

The technical scheme adopted by the embodiment of the application further comprises: the method for acquiring all intermediate results generated by the deep neural network in the reasoning calculation process based on the Einstum calculation graph comprises the following steps:

acquiring all intermediate results V generated by the deep neural network in the reasoning calculation process from layer to layer:

V＝{IV ₁ ,V ₂ ,...,V _k }

wherein IV ₁ Representing the 1 st intermediate result generated in the inference calculation process,

representing a first intermediate result IV ₁ Is the i-th storage format;

the possible storage format of each intermediate result in the EIinsum calculation graph is defined as a set S:

S＝{s ₁ ,s ₂ ,...,s _k }

each storage format in the set S is proposed, resulting in a set C of possible storage formats for all intermediate results:

the technical scheme adopted by the embodiment of the application further comprises: the method comprises the steps of searching a possible storage format set of all intermediate results by adopting a local greedy search strategy, and obtaining an optimal storage format combination with the shortest reasoning calculation time of the deep neural network specifically comprises the following steps:

for each intermediate result in the EIinsum calculation graph, the combination of the current intermediate result and the possible storage format of the n intermediate results is sent into the deep neural network for calculation, so that the deep neural network is obtainedThe reasoning calculation speed under each storage format in the possible storage format combinations is selected, the optimal storage format which enables the reasoning calculation speed of the deep neural network to be the highest is selected, and the optimal storage format is counted into the optimal storage format combination C ^* In (a) and (b); and the like, until the calculation of the storage formats of all the intermediate results is completed, obtaining an optimal storage format combination C containing the optimal storage formats of all the intermediate results ^* 。

The embodiment of the application adopts another technical scheme that: a deep neural network reasoning acceleration apparatus, comprising:

calculation map construction module: the Einsum calculation graph is used for constructing a model structure for representing the deep neural network;

a storage format acquisition module: a set of possible storage formats for obtaining all intermediate results generated by the deep neural network from layer to layer in an inference calculation process based on the Einsum calculation graph, wherein the intermediate results are non-final output results generated by the deep neural network from layer to layer in the inference calculation process;

a storage format optimization module: the method comprises the steps of searching a possible storage format set of all intermediate results by adopting a local greedy search strategy to obtain an optimal storage format combination which enables the inference calculation time of the deep neural network to be shortest;

a network deployment module: and the deep neural network is used for combining and deploying according to the optimal storage format.

The technical scheme adopted by the embodiment of the application further comprises: the storage format acquisition module acquires all possible storage format sets of intermediate results generated by the deep neural network in the reasoning calculation process layer by layer based on the Einsum calculation graph, wherein the possible storage format sets are specifically as follows:

V＝{VV ₁ ,IV ₂ ,...,IV _k }

representing a first intermediate result IV ₁ Is the i-th storage format;

S＝{s ₁ ,s ₂ ,...,s _k }

the technical scheme adopted by the embodiment of the application further comprises: the storage format optimization module searches the possible storage format sets of all the intermediate results by adopting a local greedy search strategy, and the optimal storage format combination which enables the inference calculation time of the deep neural network to be shortest is specifically:

for each intermediate result in the EIinsum calculation graph, sending the current intermediate result and the possible storage format combinations of the n intermediate results to the deep neural network for calculation to obtain the reasoning calculation speed of the deep neural network under each storage format in the possible storage format combinations, selecting the optimal storage format which enables the reasoning calculation speed of the deep neural network to be the fastest, and calculating the optimal storage format into an optimal storage format combination C ^* In (a) and (b); and the like, until the calculation of the storage formats of all the intermediate results is completed, obtaining an optimal storage format combination C containing the optimal storage formats of all the intermediate results ^* 。

The embodiment of the application adopts the following technical scheme: an apparatus comprising a processor, a memory coupled to the processor, wherein,

the memory stores program instructions for implementing the deep neural network reasoning acceleration method;

the processor is configured to execute the program instructions stored by the memory to control a deep neural network reasoning acceleration method.

The embodiment of the application adopts the following technical scheme: a storage medium storing program instructions executable by a processor for performing the deep neural network reasoning acceleration method.

Compared with the prior art, the beneficial effect that this application embodiment produced lies in: according to the deep neural network reasoning acceleration method, device and equipment and storage medium, through constructing the Einsum calculation graph of the deep neural network, the possible storage format sets of all intermediate results generated between layers in the reasoning calculation process of the deep neural network are obtained based on the Einsum calculation graph, the possible storage format sets of all intermediate results are searched by adopting a local greedy search strategy, the optimal storage format combination with the shortest reasoning calculation time of the deep neural network is obtained, the deep neural network is deployed according to the optimal storage format combination, and therefore the reasoning calculation speed of the deep neural network is greatly improved under the condition that the model structure of the deep neural network is not changed and the original performance of the deep neural network is not damaged, and the deep neural network is conveniently deployed at edge calculation equipment with lower calculation power. According to the embodiment of the application, a user does not need to have too much expertise for acceleration of reasoning of the deep neural network model, hardware characteristics do not need to be considered too much, and the application has higher universality.

Drawings

FIG. 1 is a flow chart of a deep neural network reasoning acceleration method of an embodiment of the present application;

FIG. 2 is a diagram showing a conventional model structure calculation;

FIG. 3 is a calculation chart of Einsum provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a change in a tensor storage format according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a deep neural network reasoning acceleration device according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a device structure according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a storage medium according to an embodiment of the present application.

Detailed Description

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The terms "first," "second," "third," and the like in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", and "a third" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise. All directional indications (such as up, down, left, right, front, back … …) in the embodiments of the present application are merely used to explain the relative positional relationship, movement, etc. between the components in a particular gesture (as shown in the drawings), and if the particular gesture changes, the directional indication changes accordingly. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

Referring to fig. 1, a flowchart of a deep neural network reasoning acceleration method according to an embodiment of the present application is shown. The deep neural network reasoning acceleration method of the embodiment of the application comprises the following steps:

s200: converting the model structure of the deep neural network into a form of a calculation graph, and replacing a multiply-accumulate operator in the calculation graph by using a Einsum (Einstein summation convention, einstein summing convention) operator to construct the Einsum calculation graph;

in this step, the deep neural network includes, but is not limited to, a network model determined by a transducer and other structures, and the computational graph is a tool for representing the logic and state of the deep learning network model in the training and reasoning process, and for the model, the model structure of the deep neural network can be represented in the form of a computational graph. The conventional calculation graph of the model structure is shown in fig. 2, and z=relu (x×y) is converted into a calculation graph G, where ReLU represents an activation function, and the data flow performs forward calculation and backward gradient calculation according to the flow direction and the operator in the graph to update the tensor state in the calculation graph G, so as to achieve the purpose of training the model.

Because the hardware memory stores data in a one-dimensional structure, the tensor retrieval mode is multidimensional in the network structure. In the reasoning calculation process of the deep neural network, different storage formats of tensor data can enable the access times to be changed when the tensor data are called, and the more the access times are, the slower the reasoning speed of the deep neural network is. Based on the above, in the embodiment of the application, the traditional calculation graph is converted into the EIinsum calculation graph, the EIinsum operator is utilized to abstract the input and each layer of output in the deep neural network, the storage formats of the input tensor and the output tensor are changed and optimized, and the optimal tensor storage format capable of enabling the model to infer at the highest speed is found, so that the access times of the deep neural network in the reasoning calculation process are reduced, and the reasoning acceleration of the deep neural network is improved.

Specifically, as can be seen from fig. 2, the deep neural networkMost of calculation of the complex model is completed by multiply-accumulate operation, the calculation process is completed by different operators, the importance degree of the different operators in the whole calculation process is different, therefore, the importance degree (namely the cache hit rate) of the operators in the deep neural network needs to be analyzed, the operators with higher importance degree are replaced by the Einsum operators according to the analysis result, and the multiply-accumulate operators in the calculation graph G are replaced by Einsum operators, so that the Einsum calculation graph G representing the deep neural network model structure is constructed _E . Specifically, as shown in fig. 3, a calculation chart of eisum is provided in the embodiment of the present application. The process of constructing the Einsum calculation map in the embodiment of the application comprises the following steps:

s201: defining possible storage formats of all tensor data in the computational graph G as a search space;

s202: analyzing the cache hit rates of all operators related in the calculation graph G, recording the storage format of tensor data passing through each operator, and removing the storage format of the first k% which enables the cache hit rate of the operator to represent the worst from the search space;

s203: replacing the operator with the worst memory format removed from the calculation graph G with a Einsum operator, and converting the calculation graph G into the Einsum calculation graph G _E 。

When the tensor multiplication operation is performed by the einum operator, the user only needs to pay attention to the dimension situations of the input tensor and the output tensor, for example, when realizing tensor multiplication c=a×b, where the shape of the input tensor a is (i,), the shape of the input tensor B is (, l), the shape of the output tensor C is (i,), i, j, k, l are numbers for indicating the shape size of the tensor, and implementing the tensor multiplication by using the einum operator can be expressed as:

Einsum(",kl→ijl",A,B)(1)

ijl in the above formula represents the shape of the output tensor C, and when the user wants to change the shape of the output tensor C, the user can only need to change the storage sequence of the character strings ijl. As shown in fig. 4, a schematic diagram of a tensor storage format change is provided in an embodiment of the present application. In fig. 4, assuming that the shape size of the output tensor C is (512,100,196) as shown in (a), if ijl of the above formula is changed to jli, the storage format of the output tensor C is changed from (a) to (b).

S210: acquiring a possible storage format set of all intermediate results generated by the deep neural network between layers in the reasoning calculation process based on the Einsum calculation graph, wherein the intermediate results are non-final output results generated by the deep neural network between layers in the reasoning calculation process;

in this step, for a deep neural network of a linear layered structure, the non-final output result generated in the inference calculation process is defined as an intermediate result. The deep neural network model can generate a large amount of intermediate results in the process of reasoning calculation, and all intermediate results V generated by the deep neural network in the process of reasoning calculation between layers are firstly obtained:

V＝{IV ₁ ,V ₂ ,...,V _k }(2)

wherein IV ₁ Representing the 1 st intermediate result generated in the reasoning calculation process, each intermediate result comprising a plurality of storage formats,

representing a first intermediate result IV ₁ Is the i-th storage format of (a). For example, for a three-dimensional tensor, the shape is (196,512,256), then the storage format of the three-dimensional tensor includes six storage formats with randomly arranged three dimensions, and the three dimensions can be randomly arranged and stored in a hard disk of a computer.

The possible storage format of each intermediate result in the EIinsum calculation map is defined as a set S:

S＝{s ₁ ,s ₂ ，...,s _k } (3)

and then, each storage format in the set S of each intermediate result is proposed, and a set C of possible storage formats of all intermediate results can be obtained:

s220: searching a possible storage format set of all intermediate results in the EIinsum calculation graph by adopting a local greedy search strategy to obtain an optimal storage format combination which enables the inference calculation time of the deep neural network to be shortest;

in this step, the local greedy search strategy is: for each intermediate result in the EIinsum calculation graph, sending the possible storage format combinations of the current intermediate result and the n intermediate results after the current intermediate result and the n intermediate results into the deep neural network to calculate, keeping the storage formats of the rest intermediate results unchanged, obtaining the reasoning calculation speed of the deep neural network model under each storage format in the possible storage format combinations, selecting the optimal storage format which enables the reasoning calculation speed of the deep neural network model to be the fastest, and counting the optimal storage format into an optimal storage format combination C ^* Is a kind of medium. And the like, until the calculation of all the intermediate results is completed, obtaining an optimal storage format combination C of the optimal storage formats containing all the intermediate results ^* Namely, expressed as:

C ^* ＝arg min _C Cost(G _E (C)) (5)

s230: and deploying the deep neural network according to the optimal storage format combination, so that the deep neural network performs reasoning calculation according to the optimal storage format combination.

Based on the above, the deep neural network reasoning acceleration method in the embodiment of the present application defines the non-final output result generated by the deep neural network between layers in the reasoning calculation process as an intermediate result by constructing the eisum calculation graph of the deep neural network, and obtains the possible storage format set of all intermediate results, searches the possible storage format set of all intermediate results in the eisum calculation graph by adopting the local greedy search strategy, so as to obtain the optimal storage format combination with the shortest reasoning calculation time of the deep neural network, and deploys the deep neural network according to the optimal storage format combination, thereby greatly improving the reasoning calculation speed of the deep neural network under the condition that the model structure of the deep neural network is not changed and the original performance of the deep neural network is not damaged, and facilitating deployment of the deep neural network in edge calculation devices such as smart phones or wearable devices with lower calculation power. According to the embodiment of the application, a user does not need to have too much expertise for acceleration of reasoning of the deep neural network model, hardware characteristics do not need to be considered too much, and the application has higher universality.

Fig. 5 is a schematic structural diagram of a deep neural network reasoning acceleration device according to an embodiment of the present application. The deep neural network reasoning acceleration apparatus 40 of the embodiment of the present application includes:

calculation map construction module 41: the Einsum calculation graph is used for constructing a model structure for representing the deep neural network;

storage format acquisition module 42: a set of possible storage formats for obtaining all intermediate results generated by the deep neural network from layer to layer in an inference calculation process based on the Einsum calculation graph, wherein the intermediate results are non-final output results generated by the deep neural network from layer to layer in the inference calculation process;

storage format optimization module 43: the method comprises the steps of searching a possible storage format set of all intermediate results by adopting a local greedy search strategy to obtain an optimal storage format combination which enables the inference calculation time of the deep neural network to be shortest;

network deployment module 44: and the deep neural network is used for combining and deploying according to the optimal storage format.

Please refer to fig. 6, which is a schematic diagram of an apparatus structure according to an embodiment of the present application. The apparatus 50 comprises:

a memory 51 storing executable program instructions;

a processor 52 connected to the memory 51;

the processor 52 is configured to call the executable program instructions stored in the memory 51 and perform the steps of: constructing a Einsum calculation graph for representing a deep neural network model structure; acquiring a set of possible storage formats of all intermediate results generated by the deep neural network in the reasoning calculation process from layer to layer based on the Einsum calculation graph, wherein the intermediate results are non-final output results generated by the deep neural network in the reasoning calculation process from layer to layer; searching the possible storage format sets of all the intermediate results by adopting a local greedy search strategy to obtain an optimal storage format combination which enables the inference calculation time of the deep neural network to be shortest; and combining and deploying the deep neural network according to the optimal storage format.

The processor 52 may also be referred to as a CPU (Central Processing Unit ). The processor 52 may be an integrated circuit chip having signal processing capabilities. Processor 52 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium of the embodiment of the present application stores program instructions 61 capable of implementing the steps of: constructing a Einsum calculation graph for representing a deep neural network model structure; acquiring a set of possible storage formats of all intermediate results generated by the deep neural network in the reasoning calculation process from layer to layer based on the Einsum calculation graph, wherein the intermediate results are non-final output results generated by the deep neural network in the reasoning calculation process from layer to layer; searching the possible storage format sets of all the intermediate results by adopting a local greedy search strategy to obtain an optimal storage format combination which enables the inference calculation time of the deep neural network to be shortest; and combining and deploying the deep neural network according to the optimal storage format. The program instructions 61 may be stored in the storage medium as a software product, and include several instructions to cause a device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program instructions, or a terminal device such as a computer, a server, a mobile phone, a tablet, or the like. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (ContentDelivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the partitioning of elements is merely a logical functional partitioning, and there may be additional partitioning in actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not implemented. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. The foregoing is only the embodiments of the present application, and not the patent scope of the present application is limited by the foregoing description, but all equivalent structures or equivalent processes using the contents of the present application and the accompanying drawings, or directly or indirectly applied to other related technical fields, which are included in the patent protection scope of the present application.

Claims

1. The deep neural network reasoning acceleration method is characterized by comprising the following steps of:

2. The deep neural network reasoning acceleration method of claim 1, wherein the constructing a Einsum computational graph representing a deep neural network model structure comprises:

3. The deep neural network reasoning acceleration method of claim 2, wherein the transforming the model structure of the deep neural network into a form of a computation graph G and replacing multiply-accumulate operators in the computation graph with eisum operators generates the eisum computation graph G _E The method comprises the following steps:

4. A deep neural network reasoning acceleration method according to any of the claims 1-3, characterized in that the acquisition of the set of possible storage formats of all intermediate results generated by the deep neural network from layer to layer during the reasoning calculation based on the eisum calculation map is in particular:

V＝{IV ₁ ,IV ₂ ,...,IV _k }

representing a first intermediate result IV ₁ Is the i-th storage format;

S＝{s ₁ ,s ₂ ,...,s _k }

5. the deep neural network reasoning acceleration method of claim 4, wherein the searching the set of possible storage formats of all intermediate results by using a local greedy search strategy, and obtaining the optimal storage format combination that makes the time of the deep neural network reasoning computation shortest is specifically:

for each intermediate result in the EIinsum calculation graph, sending the current intermediate result and the possible storage format combinations of the n intermediate results to the deep neural network for calculation to obtain the reasoning calculation speed of the deep neural network under each storage format in the possible storage format combinations, and selecting the data so that the data are obtainedThe optimal storage format with the highest reasoning calculation speed of the deep neural network is counted into an optimal storage format combination C ^* In (a) and (b); and the like, until the calculation of the storage formats of all the intermediate results is completed, obtaining an optimal storage format combination C containing the optimal storage formats of all the intermediate results ^* 。

6. A deep neural network reasoning acceleration apparatus, comprising:

7. The deep neural network inference acceleration apparatus of claim 6, wherein the storage format acquisition module acquires, based on the eisum computation graph, a set of possible storage formats for all intermediate results generated by the deep neural network from layer to layer in an inference computation process, specifically:

V＝{IV ₁ ,V ₂ ,...,V _k }

representing a first intermediate result IV ₁ Is the i-th storage format;

S＝{s ₁ ，s ₂ ，...，s _k }

8. the deep neural network reasoning acceleration apparatus of claim 7, wherein the storage format optimization module searches the set of possible storage formats of all intermediate results by adopting a local greedy search strategy, so as to obtain an optimal storage format combination with shortest time for the deep neural network reasoning calculation specifically:

9. An apparatus comprising a processor, a memory coupled to the processor, wherein,

the memory stores program instructions for implementing the deep neural network reasoning acceleration method of any of claims 1-5;

10. A storage medium storing program instructions executable by a processor for performing the deep neural network reasoning acceleration method of any one of claims 1 to 5.