CN111275180A

CN111275180A - Convolution operation structure for reducing data migration and power consumption of deep neural network

Info

Publication number: CN111275180A
Application number: CN202010130325.7A
Authority: CN
Inventors: 娄冕; 苏若皓; 杨靓; 崔媛媛; 张海金; 郭娜娜; 刘思源; 黄九余; 田超
Original assignee: Xian Microelectronics Technology Institute
Current assignee: Xian Microelectronics Technology Institute
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2020-06-12
Anticipated expiration: 2040-02-28
Also published as: CN111275180B

Abstract

The invention discloses a convolution operation structure for reducing deep neural network data migration and power consumption, which comprises a multiplier and an adder, wherein the input end of the multiplier is respectively connected with a multi-channel check device MUX1 and a multi-channel check device MUX2, the output end of the multiplier and the output end of a multi-channel check device MUX1 are connected with the input end of the adder through a multi-channel check device MUX3, the input end of the adder is also connected with the input end of a multi-channel check device MUX4, the output ends of the multi-channel check device MUX1, the multi-channel check device MUX2, the multiplier, the check device MUX3, the check device MUX4 and the input end of the adder are respectively connected with a register reg1, the output end of the adder is connected with the register reg2, and the output end of the register reg2 is connected with the input end. The method is suitable for all current convolutional neural network models, effectively reduces the dynamic power consumption of global computation on the premise of meeting the data parallelism to the maximum extent, and has a simple control structure and strong universality.

Description

Convolution operation structure for reducing data migration and power consumption of deep neural network

Technical Field

The invention belongs to the technical field of integrated circuit design and special hardware accelerators, and particularly relates to a convolution operation structure for reducing deep neural network data migration and power consumption.

Background

In recent years, with the important breakthrough of Deep Neural Networks (DNNs) in speech and image recognition, they have become the basis of many modern artificial intelligence applications. The advantageous performance of DNNs benefits from the ability to perform high-level feature extraction on large amounts of data to obtain efficient representations of the same type of data. One common form of DNNs is Convolutional Neural Networks (CNNs), the body of which is composed of a number of convolutional layers, each of which is a higher dimensional abstraction of the input feature map (ifmap), and the number of convolutional layers of CNNs has been evolved from the first 2-layer LeNet to the present 53-layer ResNet, and its multiply-accumulate (MAC) calculation has increased from 341K to 3.9G. Therefore, the performance of the neural network is improved at the main cost of the rapid increase of the calculated amount, the calculation needs to be completed by a high-efficiency hardware platform, the traditional CPU is limited by the calculation resources and cannot provide effective data throughput, the GPU can ensure the real-time performance of calculation output by virtue of a huge calculation array, but the high power consumption is difficult to be widely applied. The advent and successful application of special acceleration engines represented by the Diannao series of the Carmbrian era provides an effective solution for high-efficiency calculation of artificial intelligence, and becomes an important research field of artificial intelligence at present.

Currently, the main objective of the design of the special acceleration engine is to improve the multiplexing of the feature map data (ifmap) and the convolution kernel weight (weight) as much as possible, so that the frequent information access from a low-level storage system (such as on-chip cache and off-chip DRAM) can be reduced, and the access delay and the power consumption overhead can be effectively reduced. The classical data multiplexing system has 3 types: (1) and (2) adopting a weight fixing system, wherein each processing unit (PE) fixedly stores a convolution kernel weight, each data in the characteristic diagram is broadcast to all PE arrays according to a period, multiplied by the weight inside the PE and then sent to the next PE unit for accumulation, and after a plurality of periods, the flow of the last PE array outputs the one-by-one result of convolution operation. The structure fixes the weight in the register in the PE, so that delay and power consumption expense caused by weight access are avoided, but the scheme needs to frequently perform partial and migration among PEs, and the whole PE array can only output one convolution result in each period. (2) A partial and fixed system is adopted, the result flow migration is different from the result flow migration by weight fixing, the system requires that each convolution operation result is fixedly output by one PE unit, the characteristic data can be multiplexed with horizontal and vertical adjacent PE units, the system mainly reduces the power consumption generated by the flow of partial and result among different PEs after multiplication, however, the structure does not achieve the maximum reuse degree of the characteristic data, and meanwhile, the reusability of the weight on the PE layer is not considered. (3) The method adopts a line fixed system, the system requires each PE to read a line of data of an input characteristic diagram and a convolution kernel, and is different from the former two systems which calculate output characteristic diagram data one by one, wherein each period of PE generates an intermediate result of a plurality of data of the output characteristic diagram in parallel, but the architecture requires to read in complete characteristic diagram information according to lines because of larger size difference of different application characteristic diagrams, so that the flexibility and the adaptability of a storage structure in a PE array are poor.

The different hardware acceleration engine systems aim at reducing repeated data access, and strive to compress huge power consumption overhead incurred by multiply-accumulate in theoretical calculation, however, the different hardware acceleration engine systems are single and dig reusability from elements such as a value and an input feature diagram, and the computing efficiency cannot be further improved.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a convolution operation structure for reducing deep neural network data migration and power consumption, which effectively combines the reusability of two calculation elements, can greatly reduce the access overhead, calculation overhead and power consumption overhead, and has a strong application value.

The invention adopts the following technical scheme:

the convolution operation structure comprises a multiplier and an adder, wherein the input end of the multiplier is respectively connected with a multi-channel check MUX1 and a multi-channel check MUX2, the output end of the multiplier and the output end of the multi-channel check MUX1 are connected with the input end of the adder through the multi-channel check MUX3, the input end of the adder is further connected with the input end of the multi-channel check MUX4, the output ends of the multi-channel check MUX1, the multi-channel check MUX2, the multiplier, the multi-channel check MUX3, the multi-channel check MUX4 and the input end of the adder are respectively connected with a register reg1, the output end of the adder is connected with the register reg2, and the output end of the register reg2 is connected with the input end of the multi-channel check MUX4 and used for achieving multiplication.

Specifically, in the input direction, the feature data f is input_xyAnd the reusable characteristic data f _ reuse among PEs is used as one end input of the multiplier after passing through the multi-way check device MUX1, and the weight w is input_ijAnd the weight w _ reuse multiplexed among PEs passes through the multiplexer MUX2 and then is input as the other end of the multiplier.

Specifically, the output of the multiplier and the output of the MUX1 are input as one end of the adder after passing through the multi-way check MUX3, and the multi-way check MUX3 is configured to, when the input feature data is a 0 value, obtain a multiplication result of 0, activate no multiplier, and bypass the 0 value to the adder.

Further, the other end input of the adder is provided by a multi-channel check MUX4, the inputs of the multi-channel check MUX4 are a system initial value 0 and an inter _ psum of the PE output result in the previous period, respectively, when the PE performs the first operation of each convolution operation, the initial value 0 is selected for addition operation, and then the inter _ psum is selected to complete the accumulation operation.

Specifically, in the input direction, the control signals provide the select signals of MUX1/MUX2/MUX3/MUX4, and provide the enable signals of the multiplier, adder, registers reg1 and reg 2.

Further, in the output direction, registers reg1 and reg2 latch the input signature data and input weights, respectively.

Further, the input feature data f_xyNumber of accesses N_{ifmap_opt}The calculation is as follows:

wherein the content of the first and second substances,

for rounding-down operations, R_iw、S_iwHeight, width, H, of convolution kernel_if、W_ifThe height and width of the input feature map are shown as Δ T.

Compared with the prior art, the invention has at least the following beneficial effects:

the invention relates to a convolution operation structure for reducing data migration and power consumption of a deep neural network, which provides a hardware form for data multiplexing inside a PE array and among the PE arrays, and can realize the purpose of reducing data migration and dynamic power consumption on the premise of less resource expenditure by adding MUX (multiplexer) of characteristic data and weight in the input direction of the traditional PE and adding latch of the characteristic data and the weight in the output direction.

Further, by providing the multiplexers MUX1 and MUX2 in the input direction, a second data multiplexing channel is provided for each of the two operands of the multiplier. MUX1 can pass not only the raw input feature data to the multiplier, but also multiplexes the input feature data from adjacent PE arrays to the multiplier; likewise, the MUX2 can pass not only the raw weight values to the multipliers, but also can pass the used raw weight values to the multipliers in a multiplexed manner again. The multiplexing selection structure of the input direction is beneficial to converting the access to the bottom layer memory into the data acquisition among the PEs, thereby effectively reducing the dynamic power consumption caused by the movement of the data among different storage layers.

Further, for the multiplier, when one operand is 0, the result can be directly made to be 0 without multiplication operation, so that the dynamic power consumption of the multiplier can be reduced, and the MUX3 functions to directly feed the input characteristic data to the adder as the output result of the multiplier when the input characteristic data is 0. Meanwhile, since the adder needs to complete the accumulation operation of partial sums, the MUX4 is used to provide another operand for the adder, where the operand may be an initial value 0 of the first operation or a historical addition result latched in a subsequent addition operation, so as to avoid power consumption overhead caused by repeatedly writing and reading the historical addition result from the underlying memory.

Further, the control signal mainly acts on three aspects, firstly, the operand sources of the multiplier and the adder are precisely controlled by controlling the MUX1/MUX2/MUX3/MUX4, and the input characteristic data and the weight value which are already stored in the PE array are multiplexed to the maximum extent; secondly, by controlling the enabling of the multiplier and the adder, the enabling is closed when multiplication and addition operation are not needed, so that the power consumption expense of an operation part is saved; finally, the enabling of the registers reg1 and reg2 is controlled, the time for latching the input characteristic data and the addition operation result can be accurately controlled, and the operation input correctness of the multiplier and the adder in the next period is ensured.

Furthermore, an optimal calculation mode under the structure is given, namely under a PE array structure with the same size as a convolution kernel, the maximum degree of data multiplexing under the structure can be realized according to the calculation sequence of the input feature diagram firstly in the longitudinal direction and then in the transverse direction.

In conclusion, the method is suitable for all current convolutional neural network models, effectively reduces the dynamic power consumption of global computation on the premise of meeting the data parallelism to the maximum extent, and has a simple control structure and strong universality.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a basic schematic diagram of a convolution operation;

FIG. 2 is a conventional calculation flow of convolution operations;

FIG. 3 is a flow chart of the optimization calculation proposed by the present invention;

FIG. 4 is a diagram illustrating the trend of the optimized variation of the access compression rate of the input feature data;

FIG. 5 shows a PE structure according to the present invention;

FIG. 6 is a diagram of the interconnection of PE arrays according to the present invention.

Detailed Description

The invention provides a convolution operation structure for reducing deep neural network data migration and power consumption, which is used for fully analyzing a current mainstream convolution calculation data multiplexing system and providing a data multiplexing method combining elements such as a weight value and an input characteristic diagram. The invention carries out qualitative description of the calculation process of the method, and simultaneously gives a quantitative evaluation formula of the characteristic data access compression ratio after the method is adopted to confirm the validity of the scheme, and finally gives a concrete implementation structure of a hardware level. The invention can realize the purposes of reducing data migration among PEs and reducing power consumption without increasing excessive resources on a hardware level, and the scheme has good expansibility, so the method has higher application value in a deep neural network which takes convolution operation as a main characteristic.

Referring to FIG. 1, the height, width and depth of the input feature maps (input fmaps) are H_if、W_ifAnd C, the height, width and depth of the convolution kernel (filter or weight) are R respectively_iw、S_iwAnd C, selecting areas with the same size as the convolution kernels from the top points of the input feature graph to carry out multiplication and accumulation operation one by one, then carrying out the same operation by step length delta T sliding, and outputting O of the feature graph₀₀The element values are:

wherein, f is a data element in the input characteristic diagram, w is a weight value in a convolution kernel, and subscripts represent horizontal and vertical coordinates of the data element;

referring to fig. 2, generally, the convolution operation employs a data parallel method, that is, a plurality of PE computing units are used to perform data multiply-add operation simultaneously, so as to reduce the time consumption of the whole operation and ensure real-time performance.

The design structure provided by the invention adopts the PE number consistent with the size of a convolution kernel, namely, each PE unit simultaneously executes a convolution operation, and FIG. 2 shows 12 cycles (cycle 1-cycle 12) and 9 PE units (PE units)₀～PE₈) The sequence of operations, in which the step Δ T is 1, each PE unit shares the same weight w in each cycle, for example, cycle1 cycles₀Execution of f₀₀*w₀₀And PE₈Execution of f₂₂*w₀₀After the cycle9 of the 9 th cycle is finished, one parallel convolution of the PE array is finished; after that, starting from cycle10, the weights of the convolution kernel are repeatedly broadcasted to each PE unit again to perform a new round of convolution operation again.

Number of accesses N to input feature data to underlying memory_{ifmap_ini}Is calculated as follows

Wherein the content of the first and second substances,

and

respectively representing the sliding times of the convolution kernel in the horizontal direction and the vertical direction of the input feature diagram, and rounding the result downwards.

Through the calculation flow chart of fig. 2, it can be found that there is a multiplexing situation of the input feature data on the basis of weight sharing, as shown in fig. 3. The optimization calculation method comprises the following steps:

cycle1 cycle PE₁Use of f₁₀May shift left to PE in cycle2 cycle₀The same applies to the remaining PE units. Therefore, the multiplexing of the input characteristic data can directly complete the information exchange among the PE units without obtaining data by a lower storage level with larger delay and power consumption, thereby reducing the data migration among different storage levels and saving the power consumption.

cycle 4-cycle 6 cycle PE₀～PE₂The data flow marked by the arrows in fig. 3 can be multiplexed with the feature data in the same manner as the feature data used in cycles 1 to 3 and cycles PE3 to PE 5.

On one hand, the time consumption of calculation is increased and the data reusability is greatly reduced by adopting the excessively small number of PEs, and on the other hand, the time consumption is reduced due to the increase of calculation power, but the power consumption overhead is increased sharply.

The invention adopts the number of PE with the same scale as the size of the convolution kernel, because the size of the convolution kernel in the deep neural network is generally smaller, the time consumption and the power consumption can be balanced. The reason why the procedure of multiplexing the feature data is that the convolution kernels are sequentially slid longitudinally and transversely in the feature map is that, as can be seen from the process of fig. 3, the calculation results of the PEs adjacent to the right side of each PE are in the longitudinal direction of the feature map, and the sources of the feature data are from the rightmost PE (in the figure, the PE is referred to as PE in the figure)₈) Therefore, each input feature data can be called into the PE array only once from the low-layer memory, and the multiplexing degree of the input feature data is greatly improved.

On the basis of the multiplexing of the characteristic data, the reusability of the weight values is continuously examined, and as can be seen from FIG. 3, the weight values start from cycle10 cycle and are going to be w₀₀Starting to broadcast to all PEs repeatedly, a small register set can be set in the PE array to latch the weights, since the convolution kernel is small in size.

After the method is adopted, the access times N of the input characteristic data in the bottom-layer memory_{ifmap_opt}The calculation is as follows:

wherein the content of the first and second substances,

to round-down operations.

If the formula (2) is divided by the formula (1), and a convolution kernel and a feature map size which are commonly used in the deep neural network are selected, the step value △ T is set to be 1, and the variation trend of the access compression rate of the input feature data is obtained as shown in fig. 4.

As can be seen from the figure, when the feature map size is not changed, the compression ratio tends to be lower as the convolution kernel size becomes larger, which is that the convolution kernel size becomes larger so that the scale of each convolution operation becomes larger, and thus the overall multiplexing degree increases; meanwhile, when the size of the convolution kernel is not changed, the compression rate also shows a lower trend as the size of the feature map is larger, because the number of times of sliding of the convolution kernel in one direction is increased, and accordingly, the data multiplexing degree is improved. In addition, in both of the above-mentioned trends, the compression rate tends to be saturated, because the proportion of data reuse tends to be closer and closer to the limit as the calculation scale increases, thereby exhibiting a characteristic that the trend becomes slower.

The power consumption difference generated by data migration of different storage levels is large, the power consumption of the PE is 1-2 orders of magnitude lower than that of data obtained from an off-chip DRAM, the compression rate also reflects the ratio of converting off-chip DRAM access into inter-PE access, and the effect of the scheme on reducing the global dynamic power consumption is indirectly verified.

Referring to fig. 5, the convolution operation structure for reducing deep neural network data migration and power consumption according to the present invention uses a multiplier and an adder as core components to implement multiply-accumulate operations of convolution operations.

The input ends of the multiplier and the multi-path check device MUX1 are connected with the input end of the adder through the multi-path check device MUX3, the output end of the multiplier and the output end of the multi-path check device MUX1 are connected with the input end of the adder, the input end of the adder is further connected with the input end of the multi-path check device MUX4, the output ends of the multi-path check device MUX1, the multi-path check device MUX2, the multiplier, the multi-path check device MUX3, the output end of the multi-path check device MUX4 and the input end of the adder are respectively connected with the register reg1, the output end of the adder is connected with the register reg2, and the output end of.

In the input direction, inputting the characteristic data f_xyAnd the reusable characteristic data f _ reuse among PEs is used as one end input of the multiplier after passing through the multi-way check device MUX1, and the weight w is input_ijAnd the weight w _ reuse multiplexed among PEs passes through the multiplexer MUX2 and then is input as the other end of the multiplier.

The output of the multiplier and the output of the MUX1 are input as one end of the adder after passing through the multi-way check MUX3, and the multi-way check MUX3 functions such that when the input characteristic data is 0 value, the multiplication result is still 0, so that the 0 value can be directly input to the adder without activating the multiplier.

The other end input of the adder is provided by a multi-channel check device MUX4, the input of the multi-channel check device MUX4 is respectively a system initial value '0' and an inter _ psum of the PE output result in the previous period, only when the PE executes the first operation of each convolution operation, the initial value '0' is selected for addition operation, and then the inter _ psum is selected to complete the accumulation operation.

The input direction and the control signal provide selection signals of MUX1/MUX2/MUX3/MUX4 on one hand, and also provide enable signals of the multiplier, the adder, the register reg1 and the register reg2 on the other hand, so that the purposes of low power consumption control and accurate control of data multiplexing correctness are achieved.

In the output direction, reg1 and reg2 respectively latch the input feature data and the input weights, so as to transfer information to adjacent PEs at the right time to achieve the effects of reducing data migration and reducing power consumption.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Please refer to fig. 6, which shows the PE₀And PE₁For example, PE₀And PE₁Receiving the same control signal and weight w_ijWhile each PE receives different input characteristic data f_xy(ii) a In the structure, in order to realize weight multiplexing, a small-capacity FIFO is integrated and used for storing convolution kernel values broadcasted on a weight bus and broadcasting the weights to each PE unit instead of the weight bus after one convolution operation is finished.

In order to realize the multiplexing of the input characteristic data, a multiplex selector MUX5 is added between PEs, and the multiplex selector MUX5 selects the PEs under the action of a control bus according to the diagram in FIG. 3₀The multiplexed data is derived from the PEs₁Or PE₃F _ reuse.

Further, fig. 6, although only the interconnection relationship between two adjacent PEs is described, may be extended to all PE array structures.

In conclusion, the convolution operation structure for reducing the data migration and the power consumption of the deep neural network realizes a neural network coprocessor prototype in a model building mode, the coprocessor is formed in a calculation array form, a partial mode and a fixed mode are adopted for data multiplexing, and the data migration rate and the dynamic power consumption are effectively reduced through multiplexing of characteristic data and weight values among arrays. The structure provided by the invention can ensure the correctness of the convolution operation result, simultaneously obviously reduces the access frequency of the low-layer memory, further improves the energy efficiency ratio of the deep neural network, and has higher practical value and universality.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The convolution operation structure is characterized by comprising a multiplier and an adder, wherein the input end of the multiplier is respectively connected with a multi-way check device MUX1 and a multi-way check device MUX2, the output end of the multiplier and the output end of the multi-way check device MUX1 are connected with the input end of the adder through the multi-way check device MUX3, the input end of the adder is further connected with the input end of the multi-way check device MUX4, the output ends of the multi-way check device MUX1, the multi-way check device MUX2, the multiplier, the check device MUX3, the check device MUX4 and the input end of the adder are respectively connected with a register reg1, the output end of the adder is connected with the register reg2, and the output end of the register reg2 is connected with the input end of the multi-way check device MUX4 and used.

2. The convolutional arithmetic structure for reducing data migration and power consumption of deep neural network as claimed in claim 1, wherein in the input direction, the characteristic data f is input_xyAnd the reusable characteristic data f _ reuse among PEs is used as one end input of the multiplier after passing through the multi-way check device MUX1, and the weight w is input_ijAnd the weight w _ reuse multiplexed among PEs passes through the multiplexer MUX2 and then is input as the other end of the multiplier.

3. The convolution operation structure for reducing data migration and power consumption of deep neural network as claimed in claim 1, wherein the output of the multiplier and the output of the MUX1 are inputted as one end of the adder after passing through the multi-way check MUX3, and the multi-way check MUX3 is configured to bypass the 0 value to the adder when the input characteristic data is 0 value, and the multiplication result is 0 without activating the multiplier.

4. The architecture of claim 3, wherein the other input of the adder is provided by a multi-way check MUX4, the inputs of the multi-way check MUX4 are a system initial value 0 and an inter _ psum of the last PE cycle, respectively, and when the PE performs the first operation of each convolution operation, the initial value 0 is selected for addition, and then the inter _ psum is selected to complete the accumulation operation.

5. The convolutional arithmetic structure for reducing data migration and power consumption of a deep neural network as claimed in claim 1, wherein in the input direction, the control signal provides the selection signal of MUX1/MUX2/MUX3/MUX4 and provides the enable signals of the multiplier, adder, register reg1 and reg 2.

6. The convolutional arithmetic structure for reducing data migration and power consumption of a deep neural network as claimed in claim 5, wherein in the output direction, registers reg1 and reg2 latch the input feature data and the input weight respectively.

7. The convolutional arithmetic structure for reducing data migration and power consumption of deep neural network as claimed in claim 6, wherein the input feature data f_xyNumber of accesses N_{ifmap_opt}The calculation is as follows:

wherein the content of the first and second substances,