CN108960414B

CN108960414B - Method for realizing single broadcast multiple operations based on deep learning accelerator

Info

Publication number: CN108960414B
Application number: CN201810804165.2A
Authority: CN
Inventors: 陈书明; 杨超; 李斌; 陈海燕; 扈啸; 张军阳; 陈伟文
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-07-20
Filing date: 2018-07-20
Publication date: 2022-06-07
Anticipated expiration: 2038-07-20
Also published as: CN108960414A

Abstract

The invention discloses a method for realizing single broadcast multiple operations based on a deep learning accelerator, which comprises the following steps: configuring a plurality of intermediate value registers for storing intermediate results for a given multiplier in a multiplier array of an accelerator; in the calculation process of executing the deep learning, when the input characteristic value needs to be multiplied by the corresponding weight value, the calculation result of the input characteristic value and the corresponding weight value is stored into the corresponding intermediate value register for use in the next calculation until the calculation of the output characteristic value is completed. The method has the advantages of simple implementation method, low cost, high data utilization rate, low energy consumption and the like, and can realize single-broadcast multi-operation of the deep learning accelerator.

Description

Method for realizing single broadcast multiple operations based on deep learning accelerator

Technical Field

The invention relates to the technical field of deep learning accelerators, in particular to a method for realizing single-broadcast multi-operation based on a deep learning accelerator.

Background

Deep Neural Networks (DNNs) are the basis for artificial intelligence applications, including various applications such as autonomous driving of automobiles, cancer detection, computer vision, speech recognition and robotics, complex gaming, and so forth. The DNN has very high precision in artificial intelligence tasks and can even exceed the accuracy of human beings, the outstanding performance of the DNN is due to the fact that the DNN can extract high-level features from original sensory data by using a statistical learning method, effective representations of an input space are obtained from a large amount of data, and the complexity of deep learning is high. In deep learning, the number of levels of the neural network is huge, the current neural network can reach 5 to 1000 levels, the numerous levels can greatly increase the required energy consumption, storage space and computational complexity, and the embedded platform for executing the DNN inference process has strict energy consumption, computational and storage cost limits. Applications such as speech recognition often have strong latency requirements when DNN inference is performed in the cloud. Therefore, how to make DNN process efficiently and improve efficiency and throughput without compromising performance accuracy or increasing hardware cost becomes a key to widely deploy DNN in artificial intelligence systems.

The cost of superior accuracy obtained by DNN is high computational complexity cost, and in order to improve computational efficiency, a computational engine such as a GPU is usually used for computation at present, and the computational efficiency is improved by realizing parallelism among data. However, in the current deep learning accelerator, a required input characteristic value is usually taken once, the data is discarded after the calculation is completed, the value is taken out again when the data is required to be used next time, but in the deep learning algorithm, the repeated utilization rate of the input characteristic value of the convolution operation is very high, the cost of one-time data taking in accelerator hardware is expensive, the data is discarded after the calculation is completed by adopting the above-mentioned data taking each time, and the data is taken again when the data is required to be used next time, so that a large amount of energy consumption waste is caused, the data in the calculation process cannot be fully utilized, and the energy consumption and the cost of the deep learning calculation are still high.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides the method for realizing the single-broadcast multi-operation based on the deep learning accelerator, which is simple in realization method, low in cost, high in data utilization rate and low in energy consumption, and can realize the single-broadcast multi-operation based on the deep learning accelerator.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a method for realizing single broadcast multiple operations based on a deep learning accelerator comprises the following steps: configuring a plurality of intermediate value registers for storing intermediate results for a given multiplier in a multiplier array of an accelerator; in the calculation process of executing the deep learning, when the input characteristic value needs to be multiplied by the corresponding weight value, the calculation result of the input characteristic value and the corresponding weight value is stored into the corresponding intermediate value register for use in the next calculation until the calculation of the current output characteristic value is completed.

As a further improvement of the invention: when convolution calculation is executed, after multiplication operation is carried out on input characteristic values and corresponding weight values each time, operation results are stored in the intermediate value registers, addition operation is carried out on the multiplication operation results and the last calculation results stored in the corresponding intermediate value registers so as to complete one multiplication and addition operation, and the addition operation results are stored back to the corresponding intermediate value registers so as to be used for the next addition operation until calculation of output characteristic values is completed.

As a further improvement of the invention: the method also comprises the steps of initializing each intermediate result register to be 0 in advance, and restoring the corresponding intermediate result register to be 0 again after completing the calculation and output of a complete output characteristic.

As a further improvement of the invention: specifically, a set of intermediate value registers is configured for each multiplier to be involved in the calculation.

As a further improvement of the invention: the multipliers that compute the respective input eigenvalues share a set of intermediate value registers.

As a further improvement of the present invention, the specific steps when performing the convolution calculation are:

s1, configuring intermediate value registers R0-Rn for each multiplier in a multiplier array, and initializing the value of each intermediate value register to 0;

s2, inputting a first input characteristic value X0, multiplying the first input characteristic value X0 with a corresponding weight value, and storing an operation result in an intermediate value register R0;

s3, inputting an ith input characteristic value Xi, multiplying the ith input characteristic value Xi by a corresponding weight value, adding the operation result with results stored in an intermediate value register R0-an intermediate value register Ri respectively, and correspondingly storing the addition operation result in each intermediate value register, wherein i is 0,1,2, …, n;

s4, circularly executing the step S3 until the multiplication and addition operation of the n input characteristic values is completed, and then completing the initialization of the convolution operation;

s5, finishing the calculation of the first output characteristic value, outputting the value of the corresponding intermediate value register R0, and restoring the value of the intermediate value register R0 to 0 again;

s6, inputting the (n + 1) th input characteristic value Xn +1, multiplying the input characteristic value Xn +1 by a corresponding weight value, storing an operation result in an intermediate value register R0, adding the operation result with values in an intermediate value memory R1-an intermediate value register Rn respectively, and correspondingly storing the addition operation result in each intermediate value memory;

s7, completing calculation of an output characteristic value, outputting a value of a corresponding intermediate value register, and restoring the value of the intermediate value register to be 0 again;

and S8, taking n as n +1, and returning to execute the steps S6 and S7 until all the output characteristic values are calculated.

Compared with the prior art, the invention has the advantages that:

1. the invention relates to a method for realizing single broadcast multiple operation based on a deep learning accelerator, which is characterized in that an intermediate value register is configured for a designated multiplier body, an intermediate calculation result of an input characteristic value participating in calculation is stored in the intermediate value register in the calculation process, the intermediate result can be directly obtained from the intermediate value register when needing to be used subsequently, and the operation of obtaining the input characteristic value is not required to be executed again, so that multiple operations can be carried out after one input characteristic value is obtained, single broadcast multiple operation is realized, multiple operations can be carried out by transmitting one input characteristic value, based on the single broadcast multiple operation mode, the access to an external input characteristic value can be reduced, the repeated input of the input characteristic value is avoided, the utilization rate of the input characteristic value is improved, a data stream can be called from a memory at low cost, the efficient multiplexing of data is realized, and the control complexity and the energy consumption required by calculation can be reduced at the same time, effectively improve the calculation efficiency

2. The method for realizing single broadcast multiple operation based on the deep learning accelerator comprises the steps of multiplying an input characteristic value by a corresponding weight value each time when convolution calculation is executed, storing an operation result in each intermediate value register, and adding the multiplication result and a result in the corresponding intermediate value register after initialization is completed, so that multiple operations can be completed by transmitting the input characteristic value once after initialization is completed, the transmission times of the input characteristic value can be effectively reduced, and efficient utilization of data is realized.

3. The method for realizing single broadcast multiple operation based on the deep learning accelerator further enables the multiplications for correspondingly calculating the input characteristic values to share one group of intermediate value registers, and when one output characteristic value is calculated, the product results of a plurality of multipliers are transmitted in sequence and then added, so that the whole number of the intermediate value registers in the multiplier array can be reduced, and the design area of hardware is reduced.

Drawings

Fig. 1 is a schematic implementation flow diagram of a method for implementing single broadcast multiple operations based on a deep learning accelerator according to the embodiment.

Fig. 2 is a schematic diagram of the structural principle of the multiplier array in the deep learning accelerator employed in the present embodiment.

Fig. 3 is a schematic diagram of a single multiplier in an embodiment of the present invention.

FIG. 4 is a schematic diagram of an embodiment of the present invention in which a plurality of multipliers share a set of intermediate value registers.

Fig. 5 is a schematic diagram illustrating the implementation principle of performing convolution operation in the embodiment of the present invention (with a convolution kernel size of 5 × 5).

Detailed Description

The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.

As shown in fig. 1, the method for implementing single broadcast multiple operations based on a deep learning accelerator in this embodiment includes: configuring a plurality of intermediate value registers for storing intermediate results for a given multiplier in a multiplier array of an accelerator; in the calculation process of executing the deep learning, when the input characteristic value needs to be multiplied by the corresponding weight value, the calculation result of the input characteristic value and the corresponding weight value is stored into the corresponding intermediate value register for use in the next calculation until the calculation of the current output characteristic value is completed. In the convolution calculation process of deep learning, one input characteristic value needs to be multiplied by a plurality of input weight values, in this embodiment, the input characteristic value is sequentially multiplied by the corresponding weight values, and an intermediate result of the input characteristic value participating in the calculation is put into an intermediate value register to be used in a subsequent calculation process.

Based on the characteristics of the convolution algorithm in deep learning, namely, an input characteristic value and a plurality of weight values are required to be respectively subjected to multiplication operation, the embodiment configures an intermediate value register for a designated multiplier body, stores an intermediate calculation result of the input characteristic value participating in calculation into the intermediate value register in the calculation process, can directly obtain the intermediate result from the intermediate value register when the intermediate result is required to be used subsequently, and does not need to execute the operation of obtaining the input characteristic value again, so that multiple operations can be carried out after one input characteristic value is obtained, single-broadcast multiple operations are realized, multiple operations can be carried out by transmitting one input characteristic value once, based on the single-broadcast multiple operations mode, the access to an external input characteristic value can be reduced, multiple repeated inputs of the input characteristic value are avoided, the utilization rate of the input characteristic value is improved, data streams can be called from a memory at low cost, and high-efficiency multiplexing of data is realized, meanwhile, the control complexity and the energy consumption required by calculation can be reduced, and the calculation efficiency is effectively improved.

In this embodiment, when performing convolution calculation, after an input eigenvalue is multiplied by a corresponding weight value each time, an operation result is stored in each intermediate value register, and the multiplication result is added to a last calculation result stored in the corresponding intermediate value register to complete one multiplication and addition operation, and the addition result is stored back in the corresponding intermediate value register for a next addition operation until the calculation of an output eigenvalue is completed.

In this embodiment, specifically, in a hardware structure, a plurality of intermediate value registers are configured for each multiplier in a multiplier array, the number of the intermediate value registers may be determined according to the scale of a convolution kernel, and the convolution kernel and an input feature map may be calculated to obtain a plurality of intermediate values, and then a plurality of intermediate value registers may be configured according to the number of the calculated intermediate values, specifically, if the scale of the convolution kernel is a b, a row and a column of the convolution kernel implement single broadcast multiple operations by rows in the above manner, and if each row has b elements, at least b intermediate value registers are needed; and then configuring a software control program, so that in the operation process, when an input characteristic value is input each time, the input characteristic value is multiplied by R required weight values respectively, so that multiplication operation of the input characteristic value and a plurality of weight values is realized, the multiplication results are stored in each intermediate value register respectively and correspondingly for subsequent calculation, then, addition operation is performed on the multiplication results of the input characteristic value and the weight values and the values in the corresponding intermediate value registers, so that the intermediate results of the input characteristic value participating in the operation are stored in the intermediate registers until an output characteristic value is completely calculated and then the calculation results are output by the intermediate registers, only one input characteristic value needs to be obtained in the whole calculation process for multiple times of operation, and the efficient utilization of the input characteristic value is realized, so that the energy consumption is reduced. The number of the intermediate value registers can be specifically set according to actual requirements.

The multiplier array in the deep learning accelerator adopted in this embodiment is shown in fig. 2, and the whole multiplier array has M rows and N columns, where the left arrow represents the data flow direction of the input feature value, the upper arrow represents the data flow direction of the weight value, and the lower side represents the data flow direction of the output feature value after the operation is completed. When the multiplier array is used for calculation, after the input characteristic values of M rows are input, the characteristic values are broadcast to the corresponding N columns of multipliers, each multiplier can perform multiplication operation with the weight value input from the upper side after obtaining the input characteristic values, the M multipliers in the corresponding column correspond to the input of M input characteristic graphs, and during convolution, the multiplication and addition results of the corresponding part need to be added, so that the multiplication and addition results in the corresponding column are added after the multiplication result operation is completed and output from the lower side, and after the multiplication and addition, other operations such as pooling, activation, normalization and the like can be further performed, and the multiplication and addition can also be used as intermediate values for caching.

In the specific embodiment shown in fig. 3, five intermediate registers R1-R4 are provided for storing intermediate multiplication results for a single multiplier in the multiplier array, and the inputs of the multipliers are input characteristic values and weight values. During operation, the multiplier accesses the input characteristic value and the corresponding weight value to multiply, adds the characteristic value and the numerical value in the corresponding intermediate value register, then puts the result back to the corresponding intermediate value register, and outputs the result in the corresponding intermediate value register after complete operation is completed.

In this embodiment, the method further includes initializing each intermediate result register to 0 in advance, and after completing the calculation of a complete output feature and outputting, restoring the corresponding intermediate result register to 0 again, where the register may be used for the next calculation.

In the deep learning algorithm convolution operation, one output characteristic value is usually obtained by the sum of products of a plurality of input characteristic graphs and corresponding weight values, and the final output characteristic value is calculated by multiplying a plurality of groups of input characteristic values and corresponding weight values and then summing. By enabling a plurality of multipliers in the multiplier array to share one group of intermediate value registers, the whole number of the intermediate value registers in the multiplier array can be reduced, and the hardware design area is reduced. As shown in fig. 4, in the embodiment, two multiplication units share a set of intermediate value registers R0-R4, each multiplication unit respectively accesses an input feature value and a corresponding weight value, the calculation result is transmitted to an addition unit, the addition unit performs an addition operation according to the value of each intermediate value register, and the intermediate value registers R0-R4 store the intermediate value of the multiplication and addition result.

In this embodiment, the specific steps when performing convolution calculation are as follows:

Through the process, multiple operations can be completed by one-time transmission of the input characteristic value after initialization is completed, the transmission times of the input characteristic value can be effectively reduced, and efficient utilization of data is realized.

Taking an example of a convolution kernel size of 5 × 5, the data multiply-add operation and the storage of the intermediate value in one row of the convolution kernel are shown in fig. 5, where X0 to X9 represent input characteristic values, R0 to R4 represent intermediate value registers, the initial values are 0, and C0 to C14 correspond to the case where the operation in fig. 5(a) is just started, and each execution cycle includes:

in the C0 period, the X0 value is input and multiplied by the corresponding weight value, and the result is put in R0;

in the C1-C2 period, the X1 value is input and multiplied by the weight value, and is added with the corresponding R0 and R1 median values, and then the results are respectively and correspondingly placed in R0 and R1;

the same period as C1-C2, each period is sequentially multiplied and added until C14, and the initialization of convolution operation is completed;

in the following cycle, the value of one intermediate value register is output each time, and then the register value is set to 0.

FIG. 5(b) corresponds to the cycle immediately after FIG. 5(a), where the previous value in R0 has been completely calculated and initialized to 0, and then used to store the result of X5 and the corresponding weight value; during the period C15-C19, a complete output characteristic value can be output in R1.

Fig. 5(c) corresponds to the cycle immediately after fig. 5(b), in which the X6 eigenvalue needs to be completed by five multiply-add operations, wherein the output eigenvalue has been completed when the multiply-add operation corresponding to R2 is completed.

And performing five times of multiply-add operation on the subsequent input characteristic values in the same manner as the manner, outputting the output characteristic values after the operation is completed, setting the corresponding intermediate value register to be 0, and finally completing the calculation of all the output characteristic values.

The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims

1. A method for realizing single broadcast multiple operations based on a deep learning accelerator is characterized by comprising the following steps: configuring a plurality of intermediate value registers for storing intermediate results for a given multiplier in a multiplier array of an accelerator; in the calculation process of executing deep learning, when the input characteristic value and the corresponding weight value need to be multiplied, the calculation result of the input characteristic value and the corresponding weight value is stored into the corresponding intermediate value register to be used in the next calculation until the calculation of the output characteristic value is completed;

when convolution calculation is executed, after multiplication operation is carried out on input characteristic values and corresponding weight values each time, operation results are stored in the intermediate value registers, addition operation is carried out on the multiplication operation results and the last calculation results stored in the corresponding intermediate value registers so as to complete one multiplication and addition operation, and the addition operation results are stored back to the corresponding intermediate value registers so as to be used for the next addition operation until calculation of output characteristic values is completed;

the specific steps when performing the convolution calculation are:

2. The method as claimed in claim 1, further comprising initializing each of the intermediate result registers to 0 in advance, and restoring the corresponding intermediate result register to 0 again after completing the calculation and output of a complete output feature.

3. The method for realizing single broadcast multiple operation based on the deep learning accelerator as claimed in claim 1 or 2, wherein a set of intermediate value registers is configured for each multiplier to be involved in calculation.

4. The method for realizing single broadcast multiple operation based on the deep learning accelerator as claimed in claim 1 or 2, wherein: the multipliers that compute the respective input eigenvalues share a set of intermediate value registers.