CN111275180B

CN111275180B - Convolution operation structure for reducing data migration and power consumption of deep neural network

Info

Publication number: CN111275180B
Application number: CN202010130325.7A
Authority: CN
Inventors: 娄冕; 苏若皓; 杨靓; 崔媛媛; 张海金; 郭娜娜; 刘思源; 黄九余; 田超
Original assignee: Xian Microelectronics Technology Institute
Current assignee: Xian Microelectronics Technology Institute
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2023-04-07
Anticipated expiration: 2040-02-28
Also published as: CN111275180A

Abstract

The invention discloses a convolution operation structure for reducing deep neural network data migration and power consumption, which comprises a multiplier and an adder, wherein the input end of the multiplier is respectively connected with a multi-channel check device MUX1 and a multi-channel check device MUX2, the output end of the multiplier and the output end of the multi-channel check device MUX1 are connected with the input end of the adder through a multi-channel check device MUX3, the input end of the adder is also connected with the input end of a multi-channel check device MUX4, the multi-channel check device MUX1, the multi-channel check device MUX2, the multiplier, the multi-channel check device MUX3, the output end of the multi-channel check device MUX4 and the input end of the adder are respectively connected with a register reg1, the output end of the adder is connected with the register reg2, and the output end of the register reg2 is connected with the input end of the multi-channel check device MUX4, so that multiplication and accumulation operations of convolution operation are realized. The method is suitable for all current convolutional neural network models, effectively reduces the dynamic power consumption of global computation on the premise of meeting the data parallelism to the maximum extent, and has a simple control structure and strong universality.

Description

Convolution operation structure for reducing data migration and power consumption of deep neural network

Technical Field

The invention belongs to the technical field of integrated circuit design and special hardware accelerators, and particularly relates to a convolution operation structure for reducing deep neural network data migration and power consumption.

Background

In recent years, with the important breakthrough of Deep Neural Networks (DNNs) in speech and image recognition, they have become the basis of many modern artificial intelligence applications. The advantageous performance of DNNs benefits from the ability to perform high-level feature extraction on large amounts of data to obtain efficient representations of the same type of data. One common form of DNNs is Convolutional Neural Networks (CNNs), the body of which is composed of a number of convolutional layers, each of which is a higher dimensional abstraction of the input feature map (ifmap), and the number of convolutional layers of CNNs has been evolved from the first 2-layer LeNet to the present 53-layer ResNet, and its multiply-accumulate (MAC) calculation has increased from 341K to 3.9G. Therefore, the performance of the neural network is improved at the main cost of the rapid increase of the calculated amount, the calculation needs to be completed by a high-efficiency hardware platform, the traditional CPU is limited by the calculation resources and cannot provide effective data throughput, the GPU can ensure the real-time performance of calculation output by virtue of a huge calculation array, but the high power consumption is difficult to be widely applied. The advent and successful application of special acceleration engines represented by the Diannao series of the Carmbrian era provides an effective solution for high-efficiency calculation of artificial intelligence, and becomes an important research field of artificial intelligence at present.

Currently, the main objective of the design of the special acceleration engine is to improve the multiplexing of the feature map data (ifmap) and the convolution kernel weight (weight) as much as possible, so that the frequent information access from a low-level storage system (such as on-chip cache and off-chip DRAM) can be reduced, and the access delay and the power consumption overhead can be effectively reduced. The classical data multiplexing system has 3 types: (1) And (2) adopting a weight fixing system, wherein each processing unit (PE) fixedly stores a convolution kernel weight, each data in the characteristic diagram is broadcast to all PE arrays according to a period, multiplied by the weight inside the PE and then sent to the next PE unit for accumulation, and after a plurality of periods, the flow of the last PE array outputs the one-by-one result of convolution operation. The structure fixes the weight in the register in the PE, so that delay and power consumption expense caused by weight access are avoided, but the scheme needs to frequently perform partial and migration among PEs, and the whole PE array can only output one convolution result in each period. (2) A partial and fixed system is adopted, the result flow migration is different from the result flow migration by weight fixing, the system requires that each convolution operation result is fixedly output by one PE unit, the characteristic data can be multiplexed with horizontal and vertical adjacent PE units, the system mainly reduces the power consumption generated by the flow of partial and result among different PEs after multiplication, however, the structure does not achieve the maximum reuse degree of the characteristic data, and meanwhile, the reusability of the weight on the PE layer is not considered. (3) The method adopts a line fixed system, the system requires each PE to read a line of data of an input characteristic diagram and a convolution kernel, and is different from the former two systems which calculate output characteristic diagram data one by one, wherein each period of PE generates an intermediate result of a plurality of data of the output characteristic diagram in parallel, but the architecture requires to read in complete characteristic diagram information according to lines because of larger size difference of different application characteristic diagrams, so that the flexibility and the adaptability of a storage structure in a PE array are poor.

The different hardware acceleration engine systems aim at reducing repeated data access, and strive to compress huge power consumption overhead incurred by multiply-accumulate in theoretical calculation, however, the different hardware acceleration engine systems are single and dig reusability from elements such as a value and an input feature diagram, and the computing efficiency cannot be further improved.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a convolution operation structure for reducing deep neural network data migration and power consumption, which effectively combines the reusability of two calculation elements, can greatly reduce the access overhead, calculation overhead and power consumption overhead, and has a strong application value.

The invention adopts the following technical scheme:

the convolution operation structure comprises a multiplier and an adder, wherein the input end of the multiplier is connected with a multi-channel selector MUX1 and a multi-channel selector MUX2 respectively, the output end of the multiplier and the output end of the multi-channel selector MUX1 are connected with the input end of the adder through a multi-channel selector MUX3, the input end of the adder is further connected with the input end of a multi-channel selector MUX4, the multi-channel selector MUX1, the multi-channel selector MUX2, the multiplier, the multi-channel selector MUX3, the output end of the multi-channel selector MUX4 and the input end of the adder are connected with a register reg1 respectively, the output end of the adder is connected with a register reg2, and the output end of the register reg2 is connected with the input end of the multi-channel MUX4 and used for achieving multiplication and accumulation operations of convolution operation.

Specifically, in the input direction, the feature data f is input _xy And the reusable characteristic data f _ reuse among PEs is used as one end input of the multiplier after passing through the multi-channel check device MUX1, and the weight w is input _ij And the reusable weight w _ reuse among the PEs is input as the other end of the multiplier after passing through the multi-way check device MUX 2.

Specifically, the output of the multiplier and the output of the MUX1 are input as one end of the adder after passing through the multi-way check unit MUX3, and the multi-way check unit MUX3 is used for bypassing the 0 value to the adder when the input characteristic data is the 0 value, and the multiplication result is 0 without activating the multiplier.

Furthermore, the other end input of the adder is provided by a multi-channel check device MUX4, the input of the multi-channel check device MUX4 is respectively a system initial value 0 and an inter _ psum of a PE output result in the previous period, when the PE executes the first operation of each convolution operation, the initial value 0 is selected for addition operation, and then the inter _ psum is selected to complete the accumulation operation.

Specifically, in the input direction, the control signal provides the selection signal of MUX1/MUX2/MUX3/MUX4, and provides the enable signals of the multiplier, adder, registers reg1 and reg 2.

Further, in the output direction, registers reg1 and reg2 respectively latch the input feature data and the input weight.

Further, the input feature data f _xy Number of accesses N _{ifmap_opt} The calculation is as follows:

wherein, the first and the second end of the pipe are connected with each other,

for rounding-down operations, R _iw 、S _iw Height, width, H, of convolution kernel _if 、W _if The height and width of the input feature map are shown as Δ T.

Compared with the prior art, the invention has at least the following beneficial effects:

the invention relates to a convolution operation structure for reducing data migration and power consumption of a deep neural network, which provides a hardware form for data multiplexing inside a PE array and among the PE arrays, and can realize the purpose of reducing data migration and dynamic power consumption on the premise of less resource expenditure by adding MUX (multiplexer) of characteristic data and weight in the input direction of the traditional PE and adding latch of the characteristic data and the weight in the output direction.

Further, by setting the multiplexers MUX1 and MUX2 in the input direction, a second data multiplexing channel is provided for each of the two operands of the multiplier. MUX1 can not only transmit the original input feature data to the multiplier, but also can transmit the input feature data from the adjacent PE array to the multiplier in a multiplexing way; similarly, MUX2 can not only pass the raw weight values to the multiplier, but also can pass the used raw weight values to the multiplier again in a multiplexing manner. The multiplexing selection structure of the input direction is beneficial to converting the access to the bottom layer memory into the data acquisition among the PEs, thereby effectively reducing the dynamic power consumption caused by the movement of the data among different storage layers.

Further, for the multiplier, when one operand is 0, the result can be directly made to be 0 without multiplication operation, so that the dynamic power consumption of the multiplier can be reduced, and the MUX3 has the function of directly sending the input characteristic data to the adder as the output result of the multiplier when the input characteristic data is 0. Meanwhile, because the adder needs to complete the accumulation operation of partial sums, the MUX4 is used for providing another operand for the adder, the operand can be an initial value 0 of the first operation and also can be a historical addition result latched in the subsequent addition operation, and power consumption expense caused by repeatedly writing and reading the historical addition result from a bottom-layer memory is avoided.

Furthermore, the control signal mainly acts on three aspects, firstly, the operand sources of the multiplier and the adder are accurately controlled by controlling the MUX1/MUX2/MUX3/MUX4, and the input characteristic data and the weight value which are already stored in the PE array are multiplexed to the maximum extent; secondly, by controlling the enabling of the multiplier and the adder, the enabling is closed when multiplication and addition operation are not needed, so that the power consumption expense of an operation part is saved; and finally, by controlling the enabling of the registers reg1 and reg2, the time for latching the input characteristic data and the addition operation result can be accurately controlled, and the operation input correctness of the multiplier and the adder in the next period is ensured.

Furthermore, an optimal calculation mode under the structure is given, namely under a PE array structure with the same size as a convolution kernel, the maximum degree of data multiplexing under the structure can be realized according to the calculation sequence of the input feature diagram firstly in the longitudinal direction and then in the transverse direction.

In conclusion, the method is suitable for all current convolutional neural network models, effectively reduces the dynamic power consumption of global computation on the premise of meeting the data parallelism to the maximum extent, and has a simple control structure and strong universality.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a basic schematic diagram of a convolution operation;

FIG. 2 is a conventional calculation flow of convolution operations;

FIG. 3 is a flow chart of the optimization calculation proposed by the present invention;

FIG. 4 is a diagram illustrating the trend of the optimized variation of the access compression rate of the input feature data;

FIG. 5 shows a PE structure according to the present invention;

FIG. 6 is a diagram of the interconnection of PE arrays according to the present invention.

Detailed Description

The invention provides a convolution operation structure for reducing deep neural network data migration and power consumption, which is used for fully analyzing a current mainstream convolution calculation data multiplexing system and providing a data multiplexing method combining elements such as a weight value and an input characteristic diagram. The invention carries out qualitative description of the calculation process of the method, and simultaneously gives a quantitative evaluation formula of the access compression ratio of the characteristic data after the method is adopted to confirm the validity of the scheme, and finally gives a concrete implementation structure of a hardware level. The invention can realize the purposes of reducing data migration among PEs and reducing power consumption without increasing excessive resources on a hardware level, and the scheme has good expansibility, so the method has higher application value in a deep neural network which takes convolution operation as a main characteristic.

Referring to FIG. 1, the height, width and depth of the input feature maps (input fmaps) are H _if 、W _if And C, the height, width and depth of the convolution kernel (filter or weight) are R respectively _iw 、S _iw And C, selecting areas with the same size as the convolution kernels from the top points of the input feature graph to carry out multiplication and accumulation operation one by one, then carrying out the same operation by step length delta T sliding, and outputting O of the feature graph ₀₀ The element values are:

wherein, f is a data element in the input characteristic diagram, w is a weight value in a convolution kernel, and subscripts represent horizontal and vertical coordinates of the data element;

referring to fig. 2, generally, the convolution operation employs a data parallel method, that is, a plurality of PE computing units are used to perform data multiply-add operation simultaneously, so as to reduce the time consumption of the whole operation and ensure real-time performance.

The design structure proposed by the present invention adopts the same PE number as the convolution kernel, i.e. each PE unit executes a convolution operation at the same time, and fig. 2 shows 12 cycles (cycle 1 to cycle 12) and 9 PE units (PE) ₀ ～PE ₈ ) Order of operationIn this case, the step Δ T is 1, and each PE unit in each cycle shares the same weight w, such as cycle1 cycle, PE ₀ Execution of f ₀₀ *w ₀₀ And PE ₈ Execution of f ₂₂ *w ₀₀ Each subsequent PE unit in each cycle performs a multiplication operation, and accumulates the multiplication result in the previous cycle, taking fig. 1 as an example, after the 9 th cycle9 is finished, one parallel convolution of the PE array is finished; after that, starting from cycle10, the weights of the convolution kernel are repeatedly broadcast to each PE unit again to perform a new round of convolution operation again.

Number of accesses N to input feature data to underlying memory _{ifmap_ini} Is calculated as follows

Wherein the content of the first and second substances,

and &>

Respectively representing the sliding times of the convolution kernel in the horizontal direction and the vertical direction of the input feature diagram, and rounding the result downwards.

Through the calculation flow chart of fig. 2, it can be found that there is a multiplexing situation of the input feature data on the basis of weight sharing, as shown in fig. 3. The optimization calculation method comprises the following steps:

cycle1 cycle PE ₁ Use of f ₁₀ Can shift to the left of PE in cycle2 cycle ₀ The same applies to the remaining PE units. Therefore, the multiplexing of the input characteristic data can directly complete the information exchange among the PE units without obtaining data by a lower storage level with larger delay and power consumption, thereby reducing the data migration among different storage levels and saving the power consumption.

cycle 4-cycle 6 cycle PE ₀ ～PE ₂ The data flow direction indicated by the arrow in FIG. 3 can be the same as the data flow direction used by the cycles PE3 to PE5, which are completely consistent with the characteristic data used by the cycles PE 1 to PE3And multiplexing the characteristic data.

On one hand, the time consumption of calculation is increased and the data reusability is greatly reduced by adopting the excessively small number of PEs, and on the other hand, the time consumption is reduced due to the increase of calculation power, but the power consumption overhead is increased sharply.

The invention adopts the number of PE with the same scale as the size of the convolution kernel, because the size of the convolution kernel in the deep neural network is generally smaller, the time consumption and the power consumption can be balanced. The reason why the procedure of multiplexing the feature data is that the convolution kernels are sequentially slid longitudinally and transversely in the feature map is that, as can be seen from the process of fig. 3, the calculation results of the PEs adjacent to the right side of each PE are in the longitudinal direction of the feature map, and the sources of the feature data are from the rightmost PE (in the figure, the PE is referred to as PE in the figure) ₈ ) Therefore, each input feature data can be called into the PE array only once from the low-layer memory, and the multiplexing degree of the input feature data is greatly improved.

On the basis of the multiplexing of the characteristic data, the reusability of the weight value is continuously considered, and as can be seen from FIG. 3, the weight value starts from cycle10 cycle and starts from w ₀₀ Starting to broadcast to all PEs repeatedly, a small register set can be set in the PE array to latch the weights, since the convolution kernel is small in size.

After the method is adopted, the access times N of the input characteristic data in the bottom-layer memory _{ifmap_opt} The calculation is as follows:

wherein the content of the first and second substances,

to round-down operations.

If the formula (2) is divided by the formula (1), and a convolution kernel and a feature map size which are commonly used in the deep neural network are selected, the step value Δ T is set to be 1, and the obtained access compression rate change trend of the input feature data is shown in fig. 4.

As can be seen from the figure, when the feature map size is not changed, the compression ratio tends to be lower as the convolution kernel size becomes larger, which is that the convolution kernel size becomes larger so that the scale of each convolution operation becomes larger, and thus the overall multiplexing degree increases; meanwhile, when the size of the convolution kernel is not changed, the compression rate also shows a lower trend as the size of the feature map is larger, because the number of times of sliding of the convolution kernel in one direction is increased, and accordingly, the data multiplexing degree is improved. In addition, in both of the above-mentioned trends, the compression ratio tends to be saturated, because as the scale of calculation increases, the proportion of data reuse tends to approach the limit more and more, and thus the trend becomes gentle.

The power consumption difference generated by data migration of different storage layers is large, the power consumption of data obtained from the off-chip DRAM among the PEs is 1-2 orders of magnitude lower than that of the data obtained from the off-chip DRAM, the compression rate also reflects the ratio of converting off-chip DRAM access into inter-PE access, and the effect of the scheme on reducing the global dynamic power consumption is indirectly verified.

Referring to fig. 5, the convolution operation structure for reducing deep neural network data migration and power consumption according to the present invention uses a multiplier and an adder as core components to implement multiply-accumulate operations of convolution operations.

The input end of the multiplier is respectively connected with the multipath check device MUX1 and the multipath check device MUX2, the output end of the multiplier and the output end of the multipath check device MUX1 are connected with the input end of the adder through the multipath check device MUX3, the input end of the adder is also connected with the input end of the multipath check device MUX4, the output ends of the multipath check device MUX1, the multipath check device MUX2, the multiplier, the multipath check device MUX3, the multipath check device MUX4 and the input end of the adder are respectively connected with the register reg1, the output end of the adder is connected with the register reg2, and the output end of the register reg2 is connected with the input end of the multipath check device MUX 4.

In the input direction, inputting the characteristic data f _xy And the reusable characteristic data f _ reuse among the PEs is used as one end input of the multiplier after passing through the multi-channel check device MUX1, and the weight w is input _ij And the reusable weight w _ reuse among PEs is used as another multiplier after passing through the multi-channel check device MUX2One end is input.

The output of the multiplier and the output of the MUX1 are input as one end of the adder after passing through the multi-channel check device MUX3, and the multi-channel check device MUX3 has the function that when the input characteristic data is 0 value, the multiplication result is still 0, so that the 0 value can be directly input to the adder by-pass without activating the multiplier.

The other end input of the adder is provided by a multi-channel check device MUX4, the input of the multi-channel check device MUX4 is respectively a system initial value '0' and an inter _ psum of a PE output result of a previous period, only when the PE executes the first operation of each convolution operation, the initial value '0' is selected for addition operation, and then the inter _ psum is selected to complete accumulation operation.

The input direction and the control signal provide selection signals of MUX1/MUX2/MUX3/MUX4 on one hand, and also provide enable signals of a multiplier, an adder and registers reg1 and reg2 on the other hand, so that the purposes of low power consumption control and accurate control of data multiplexing correctness are achieved.

In the output direction, reg1 and reg2 respectively latch the input feature data and the input weight, so as to transmit information to adjacent PEs at a correct time to achieve the effects of reducing data migration and reducing power consumption.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Please refer to fig. 6, which shows the PE ₀ And PE ₁ For example, PE ₀ And PE ₁ Receiving the same control signal and weight w _ij While each PE receives different input characteristic data f _xy (ii) a In the structure, in order to realize weight multiplexing, a small-capacity FIFO is integrated and used for storing convolution kernel values broadcasted on a weight bus and broadcasting the weights to each PE unit instead of the weight bus after one convolution operation is finished.

In order to realize the multiplexing of the input characteristic data, a multiplex selector MUX5 is added between the PEs, and the multiplex selector MUX5 selects the PEs under the action of a control bus according to the diagram shown in FIG. 3 ₀ The multiplexed data is derived from the PEs ₁ Or PE ₃ F _ reuse.

Further, fig. 6, although only the interconnection relationship between two adjacent PEs is described, may be extended to all PE array structures.

In conclusion, the convolution operation structure for reducing the data migration and the power consumption of the deep neural network realizes a neural network coprocessor prototype in a model building mode, the coprocessor is formed in a calculation array form, a partial mode and a fixed mode are adopted for data multiplexing, and the data migration rate and the dynamic power consumption are effectively reduced through multiplexing of characteristic data and weight values among arrays. The structure provided by the invention can ensure the correctness of the convolution operation result, simultaneously obviously reduces the access frequency of the low-layer memory, further improves the energy efficiency ratio of the deep neural network, and has higher practical value and universality.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The convolution operation structure for reducing deep neural network data migration and power consumption is characterized by comprising a multiplier and an adder, wherein the input end of the multiplier is respectively connected with a multi-channel check device MUX1 and a multi-channel check device MUX2, and the output end of the multiplier and the output end of the multi-channel check device MUX1 are connected with a multi-channel check device MUX3 is connected with the input end of the adder, the input end of the adder is also connected with the input end of a multi-channel selector MUX4, the output ends of the multi-channel selector MUX1, the multi-channel selector MUX2, the multiplier, the multi-channel selector MUX3 and the multi-channel selector MUX4 and the input end of the adder are respectively connected with a register reg1, the output end of the adder is connected with a register reg2, and the output end of the register reg2 is connected with the input end of the multi-channel selector MUX4 and used for realizing multiplication and accumulation operations of convolution operation; the height, width and depth of the input feature maps (input fmaps) are respectively H _if 、W _if And C, the height, width and depth of the convolution kernel (filter or weight) are R respectively _iw 、S _iw And C, selecting the areas with the same size as the convolution kernels from the vertexes of the input feature graph to carry out multiplication and accumulation operation one by one, then, the same operation is carried out by sliding with the step length delta T, and O of the characteristic diagram is output ₀₀ The element values are:

wherein, f is the data element in the input characteristic diagram, w is the weight value in the convolution kernel, and the subscript represents the horizontal and vertical coordinates.

2. The convolutional arithmetic structure for reducing data migration and power consumption of deep neural network as claimed in claim 1, wherein in the input direction, the characteristic data f is input _xy And the reusable characteristic data f _ reuse among the PEs is used as one end input of the multiplier after passing through the multi-channel check device MUX1, and the weight w is input _ij And the reusable weight w _ reuse among the PEs is input as the other end of the multiplier after passing through the multi-way check device MUX 2.

3. The convolution operation structure for reducing data migration and power consumption of deep neural network according to claim 1, wherein an output of the multiplier and an output of the MUX1 are input as one end of the adder after passing through a multi-way check unit MUX3, and the multi-way check unit MUX3 is configured to bypass the 0 value to the adder when the input feature data is 0 value, and the multiplier is not activated.

4. The structure of claim 3, wherein the other input of the adder is provided by a multi-way check unit MUX4, the inputs of the multi-way check unit MUX4 are a system initial value 0 and an inter _ psum output result of the last PE, when the PE performs the first operation of each convolution operation, the initial value 0 is selected for addition, and the inter _ psum is selected to complete the accumulation operation.

5. The convolutional arithmetic structure for reducing data migration and power consumption of a deep neural network as claimed in claim 1, wherein in the input direction, the control signal provides the selection signal of MUX1/MUX2/MUX3/MUX4 and provides the enable signals of the multiplier, the adder, and the registers reg1 and reg 2.

6. The convolutional arithmetic structure for reducing data migration and power consumption of a deep neural network as claimed in claim 5, wherein in the output direction, registers reg1 and reg2 respectively latch the input feature data and the input weight.

7. The convolutional arithmetic structure for reducing data migration and power consumption of deep neural network as claimed in claim 6, wherein the input feature data f _xy Number of accesses N _{ifmap_opt} The calculation is as follows:

wherein the content of the first and second substances,

for rounding-down operations, R _iw 、S _iw Height, width, H, of convolution kernel _if 、W _if The height and width of the input feature map are shown as Δ T. />