CN111275180B - Convolution operation structure for reducing data migration and power consumption of deep neural network - Google Patents

Convolution operation structure for reducing data migration and power consumption of deep neural network Download PDF

Info

Publication number
CN111275180B
CN111275180B CN202010130325.7A CN202010130325A CN111275180B CN 111275180 B CN111275180 B CN 111275180B CN 202010130325 A CN202010130325 A CN 202010130325A CN 111275180 B CN111275180 B CN 111275180B
Authority
CN
China
Prior art keywords
input
adder
multiplier
check device
power consumption
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010130325.7A
Other languages
Chinese (zh)
Other versions
CN111275180A (en
Inventor
娄冕
苏若皓
杨靓
崔媛媛
张海金
郭娜娜
刘思源
黄九余
田超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Microelectronics Technology Institute
Original Assignee
Xian Microelectronics Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Microelectronics Technology Institute filed Critical Xian Microelectronics Technology Institute
Priority to CN202010130325.7A priority Critical patent/CN111275180B/en
Publication of CN111275180A publication Critical patent/CN111275180A/en
Application granted granted Critical
Publication of CN111275180B publication Critical patent/CN111275180B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a convolution operation structure for reducing deep neural network data migration and power consumption, which comprises a multiplier and an adder, wherein the input end of the multiplier is respectively connected with a multi-channel check device MUX1 and a multi-channel check device MUX2, the output end of the multiplier and the output end of the multi-channel check device MUX1 are connected with the input end of the adder through a multi-channel check device MUX3, the input end of the adder is also connected with the input end of a multi-channel check device MUX4, the multi-channel check device MUX1, the multi-channel check device MUX2, the multiplier, the multi-channel check device MUX3, the output end of the multi-channel check device MUX4 and the input end of the adder are respectively connected with a register reg1, the output end of the adder is connected with the register reg2, and the output end of the register reg2 is connected with the input end of the multi-channel check device MUX4, so that multiplication and accumulation operations of convolution operation are realized. The method is suitable for all current convolutional neural network models, effectively reduces the dynamic power consumption of global computation on the premise of meeting the data parallelism to the maximum extent, and has a simple control structure and strong universality.

Description

Convolution operation structure for reducing data migration and power consumption of deep neural network
Technical Field
The invention belongs to the technical field of integrated circuit design and special hardware accelerators, and particularly relates to a convolution operation structure for reducing deep neural network data migration and power consumption.
Background
In recent years, with the important breakthrough of Deep Neural Networks (DNNs) in speech and image recognition, they have become the basis of many modern artificial intelligence applications. The advantageous performance of DNNs benefits from the ability to perform high-level feature extraction on large amounts of data to obtain efficient representations of the same type of data. One common form of DNNs is Convolutional Neural Networks (CNNs), the body of which is composed of a number of convolutional layers, each of which is a higher dimensional abstraction of the input feature map (ifmap), and the number of convolutional layers of CNNs has been evolved from the first 2-layer LeNet to the present 53-layer ResNet, and its multiply-accumulate (MAC) calculation has increased from 341K to 3.9G. Therefore, the performance of the neural network is improved at the main cost of the rapid increase of the calculated amount, the calculation needs to be completed by a high-efficiency hardware platform, the traditional CPU is limited by the calculation resources and cannot provide effective data throughput, the GPU can ensure the real-time performance of calculation output by virtue of a huge calculation array, but the high power consumption is difficult to be widely applied. The advent and successful application of special acceleration engines represented by the Diannao series of the Carmbrian era provides an effective solution for high-efficiency calculation of artificial intelligence, and becomes an important research field of artificial intelligence at present.
Currently, the main objective of the design of the special acceleration engine is to improve the multiplexing of the feature map data (ifmap) and the convolution kernel weight (weight) as much as possible, so that the frequent information access from a low-level storage system (such as on-chip cache and off-chip DRAM) can be reduced, and the access delay and the power consumption overhead can be effectively reduced. The classical data multiplexing system has 3 types: (1) And (2) adopting a weight fixing system, wherein each processing unit (PE) fixedly stores a convolution kernel weight, each data in the characteristic diagram is broadcast to all PE arrays according to a period, multiplied by the weight inside the PE and then sent to the next PE unit for accumulation, and after a plurality of periods, the flow of the last PE array outputs the one-by-one result of convolution operation. The structure fixes the weight in the register in the PE, so that delay and power consumption expense caused by weight access are avoided, but the scheme needs to frequently perform partial and migration among PEs, and the whole PE array can only output one convolution result in each period. (2) A partial and fixed system is adopted, the result flow migration is different from the result flow migration by weight fixing, the system requires that each convolution operation result is fixedly output by one PE unit, the characteristic data can be multiplexed with horizontal and vertical adjacent PE units, the system mainly reduces the power consumption generated by the flow of partial and result among different PEs after multiplication, however, the structure does not achieve the maximum reuse degree of the characteristic data, and meanwhile, the reusability of the weight on the PE layer is not considered. (3) The method adopts a line fixed system, the system requires each PE to read a line of data of an input characteristic diagram and a convolution kernel, and is different from the former two systems which calculate output characteristic diagram data one by one, wherein each period of PE generates an intermediate result of a plurality of data of the output characteristic diagram in parallel, but the architecture requires to read in complete characteristic diagram information according to lines because of larger size difference of different application characteristic diagrams, so that the flexibility and the adaptability of a storage structure in a PE array are poor.
The different hardware acceleration engine systems aim at reducing repeated data access, and strive to compress huge power consumption overhead incurred by multiply-accumulate in theoretical calculation, however, the different hardware acceleration engine systems are single and dig reusability from elements such as a value and an input feature diagram, and the computing efficiency cannot be further improved.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a convolution operation structure for reducing deep neural network data migration and power consumption, which effectively combines the reusability of two calculation elements, can greatly reduce the access overhead, calculation overhead and power consumption overhead, and has a strong application value.
The invention adopts the following technical scheme:
the convolution operation structure comprises a multiplier and an adder, wherein the input end of the multiplier is connected with a multi-channel selector MUX1 and a multi-channel selector MUX2 respectively, the output end of the multiplier and the output end of the multi-channel selector MUX1 are connected with the input end of the adder through a multi-channel selector MUX3, the input end of the adder is further connected with the input end of a multi-channel selector MUX4, the multi-channel selector MUX1, the multi-channel selector MUX2, the multiplier, the multi-channel selector MUX3, the output end of the multi-channel selector MUX4 and the input end of the adder are connected with a register reg1 respectively, the output end of the adder is connected with a register reg2, and the output end of the register reg2 is connected with the input end of the multi-channel MUX4 and used for achieving multiplication and accumulation operations of convolution operation.
Specifically, in the input direction, the feature data f is input xy And the reusable characteristic data f _ reuse among PEs is used as one end input of the multiplier after passing through the multi-channel check device MUX1, and the weight w is input ij And the reusable weight w _ reuse among the PEs is input as the other end of the multiplier after passing through the multi-way check device MUX 2.
Specifically, the output of the multiplier and the output of the MUX1 are input as one end of the adder after passing through the multi-way check unit MUX3, and the multi-way check unit MUX3 is used for bypassing the 0 value to the adder when the input characteristic data is the 0 value, and the multiplication result is 0 without activating the multiplier.
Furthermore, the other end input of the adder is provided by a multi-channel check device MUX4, the input of the multi-channel check device MUX4 is respectively a system initial value 0 and an inter _ psum of a PE output result in the previous period, when the PE executes the first operation of each convolution operation, the initial value 0 is selected for addition operation, and then the inter _ psum is selected to complete the accumulation operation.
Specifically, in the input direction, the control signal provides the selection signal of MUX1/MUX2/MUX3/MUX4, and provides the enable signals of the multiplier, adder, registers reg1 and reg 2.
Further, in the output direction, registers reg1 and reg2 respectively latch the input feature data and the input weight.
Further, the input feature data f xy Number of accesses N ifmap_opt The calculation is as follows:
Figure BDA0002395607530000041
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002395607530000042
for rounding-down operations, R iw 、S iw Height, width, H, of convolution kernel if 、W if The height and width of the input feature map are shown as Δ T.
Compared with the prior art, the invention has at least the following beneficial effects:
the invention relates to a convolution operation structure for reducing data migration and power consumption of a deep neural network, which provides a hardware form for data multiplexing inside a PE array and among the PE arrays, and can realize the purpose of reducing data migration and dynamic power consumption on the premise of less resource expenditure by adding MUX (multiplexer) of characteristic data and weight in the input direction of the traditional PE and adding latch of the characteristic data and the weight in the output direction.
Further, by setting the multiplexers MUX1 and MUX2 in the input direction, a second data multiplexing channel is provided for each of the two operands of the multiplier. MUX1 can not only transmit the original input feature data to the multiplier, but also can transmit the input feature data from the adjacent PE array to the multiplier in a multiplexing way; similarly, MUX2 can not only pass the raw weight values to the multiplier, but also can pass the used raw weight values to the multiplier again in a multiplexing manner. The multiplexing selection structure of the input direction is beneficial to converting the access to the bottom layer memory into the data acquisition among the PEs, thereby effectively reducing the dynamic power consumption caused by the movement of the data among different storage layers.
Further, for the multiplier, when one operand is 0, the result can be directly made to be 0 without multiplication operation, so that the dynamic power consumption of the multiplier can be reduced, and the MUX3 has the function of directly sending the input characteristic data to the adder as the output result of the multiplier when the input characteristic data is 0. Meanwhile, because the adder needs to complete the accumulation operation of partial sums, the MUX4 is used for providing another operand for the adder, the operand can be an initial value 0 of the first operation and also can be a historical addition result latched in the subsequent addition operation, and power consumption expense caused by repeatedly writing and reading the historical addition result from a bottom-layer memory is avoided.
Furthermore, the control signal mainly acts on three aspects, firstly, the operand sources of the multiplier and the adder are accurately controlled by controlling the MUX1/MUX2/MUX3/MUX4, and the input characteristic data and the weight value which are already stored in the PE array are multiplexed to the maximum extent; secondly, by controlling the enabling of the multiplier and the adder, the enabling is closed when multiplication and addition operation are not needed, so that the power consumption expense of an operation part is saved; and finally, by controlling the enabling of the registers reg1 and reg2, the time for latching the input characteristic data and the addition operation result can be accurately controlled, and the operation input correctness of the multiplier and the adder in the next period is ensured.
Furthermore, an optimal calculation mode under the structure is given, namely under a PE array structure with the same size as a convolution kernel, the maximum degree of data multiplexing under the structure can be realized according to the calculation sequence of the input feature diagram firstly in the longitudinal direction and then in the transverse direction.
In conclusion, the method is suitable for all current convolutional neural network models, effectively reduces the dynamic power consumption of global computation on the premise of meeting the data parallelism to the maximum extent, and has a simple control structure and strong universality.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a basic schematic diagram of a convolution operation;
FIG. 2 is a conventional calculation flow of convolution operations;
FIG. 3 is a flow chart of the optimization calculation proposed by the present invention;
FIG. 4 is a diagram illustrating the trend of the optimized variation of the access compression rate of the input feature data;
FIG. 5 shows a PE structure according to the present invention;
FIG. 6 is a diagram of the interconnection of PE arrays according to the present invention.
Detailed Description
The invention provides a convolution operation structure for reducing deep neural network data migration and power consumption, which is used for fully analyzing a current mainstream convolution calculation data multiplexing system and providing a data multiplexing method combining elements such as a weight value and an input characteristic diagram. The invention carries out qualitative description of the calculation process of the method, and simultaneously gives a quantitative evaluation formula of the access compression ratio of the characteristic data after the method is adopted to confirm the validity of the scheme, and finally gives a concrete implementation structure of a hardware level. The invention can realize the purposes of reducing data migration among PEs and reducing power consumption without increasing excessive resources on a hardware level, and the scheme has good expansibility, so the method has higher application value in a deep neural network which takes convolution operation as a main characteristic.
Referring to FIG. 1, the height, width and depth of the input feature maps (input fmaps) are H if 、W if And C, the height, width and depth of the convolution kernel (filter or weight) are R respectively iw 、S iw And C, selecting areas with the same size as the convolution kernels from the top points of the input feature graph to carry out multiplication and accumulation operation one by one, then carrying out the same operation by step length delta T sliding, and outputting O of the feature graph 00 The element values are:
Figure BDA0002395607530000061
wherein, f is a data element in the input characteristic diagram, w is a weight value in a convolution kernel, and subscripts represent horizontal and vertical coordinates of the data element;
referring to fig. 2, generally, the convolution operation employs a data parallel method, that is, a plurality of PE computing units are used to perform data multiply-add operation simultaneously, so as to reduce the time consumption of the whole operation and ensure real-time performance.
The design structure proposed by the present invention adopts the same PE number as the convolution kernel, i.e. each PE unit executes a convolution operation at the same time, and fig. 2 shows 12 cycles (cycle 1 to cycle 12) and 9 PE units (PE) 0 ~PE 8 ) Order of operationIn this case, the step Δ T is 1, and each PE unit in each cycle shares the same weight w, such as cycle1 cycle, PE 0 Execution of f 00 *w 00 And PE 8 Execution of f 22 *w 00 Each subsequent PE unit in each cycle performs a multiplication operation, and accumulates the multiplication result in the previous cycle, taking fig. 1 as an example, after the 9 th cycle9 is finished, one parallel convolution of the PE array is finished; after that, starting from cycle10, the weights of the convolution kernel are repeatedly broadcast to each PE unit again to perform a new round of convolution operation again.
Number of accesses N to input feature data to underlying memory ifmap_ini Is calculated as follows
Figure BDA0002395607530000071
Wherein the content of the first and second substances,
Figure BDA0002395607530000072
and &>
Figure BDA0002395607530000073
Respectively representing the sliding times of the convolution kernel in the horizontal direction and the vertical direction of the input feature diagram, and rounding the result downwards.
Through the calculation flow chart of fig. 2, it can be found that there is a multiplexing situation of the input feature data on the basis of weight sharing, as shown in fig. 3. The optimization calculation method comprises the following steps:
cycle1 cycle PE 1 Use of f 10 Can shift to the left of PE in cycle2 cycle 0 The same applies to the remaining PE units. Therefore, the multiplexing of the input characteristic data can directly complete the information exchange among the PE units without obtaining data by a lower storage level with larger delay and power consumption, thereby reducing the data migration among different storage levels and saving the power consumption.
cycle 4-cycle 6 cycle PE 0 ~PE 2 The data flow direction indicated by the arrow in FIG. 3 can be the same as the data flow direction used by the cycles PE3 to PE5, which are completely consistent with the characteristic data used by the cycles PE 1 to PE3And multiplexing the characteristic data.
On one hand, the time consumption of calculation is increased and the data reusability is greatly reduced by adopting the excessively small number of PEs, and on the other hand, the time consumption is reduced due to the increase of calculation power, but the power consumption overhead is increased sharply.
The invention adopts the number of PE with the same scale as the size of the convolution kernel, because the size of the convolution kernel in the deep neural network is generally smaller, the time consumption and the power consumption can be balanced. The reason why the procedure of multiplexing the feature data is that the convolution kernels are sequentially slid longitudinally and transversely in the feature map is that, as can be seen from the process of fig. 3, the calculation results of the PEs adjacent to the right side of each PE are in the longitudinal direction of the feature map, and the sources of the feature data are from the rightmost PE (in the figure, the PE is referred to as PE in the figure) 8 ) Therefore, each input feature data can be called into the PE array only once from the low-layer memory, and the multiplexing degree of the input feature data is greatly improved.
On the basis of the multiplexing of the characteristic data, the reusability of the weight value is continuously considered, and as can be seen from FIG. 3, the weight value starts from cycle10 cycle and starts from w 00 Starting to broadcast to all PEs repeatedly, a small register set can be set in the PE array to latch the weights, since the convolution kernel is small in size.
After the method is adopted, the access times N of the input characteristic data in the bottom-layer memory ifmap_opt The calculation is as follows:
Figure BDA0002395607530000081
wherein the content of the first and second substances,
Figure BDA0002395607530000082
to round-down operations.
If the formula (2) is divided by the formula (1), and a convolution kernel and a feature map size which are commonly used in the deep neural network are selected, the step value Δ T is set to be 1, and the obtained access compression rate change trend of the input feature data is shown in fig. 4.
As can be seen from the figure, when the feature map size is not changed, the compression ratio tends to be lower as the convolution kernel size becomes larger, which is that the convolution kernel size becomes larger so that the scale of each convolution operation becomes larger, and thus the overall multiplexing degree increases; meanwhile, when the size of the convolution kernel is not changed, the compression rate also shows a lower trend as the size of the feature map is larger, because the number of times of sliding of the convolution kernel in one direction is increased, and accordingly, the data multiplexing degree is improved. In addition, in both of the above-mentioned trends, the compression ratio tends to be saturated, because as the scale of calculation increases, the proportion of data reuse tends to approach the limit more and more, and thus the trend becomes gentle.
The power consumption difference generated by data migration of different storage layers is large, the power consumption of data obtained from the off-chip DRAM among the PEs is 1-2 orders of magnitude lower than that of the data obtained from the off-chip DRAM, the compression rate also reflects the ratio of converting off-chip DRAM access into inter-PE access, and the effect of the scheme on reducing the global dynamic power consumption is indirectly verified.
Referring to fig. 5, the convolution operation structure for reducing deep neural network data migration and power consumption according to the present invention uses a multiplier and an adder as core components to implement multiply-accumulate operations of convolution operations.
The input end of the multiplier is respectively connected with the multipath check device MUX1 and the multipath check device MUX2, the output end of the multiplier and the output end of the multipath check device MUX1 are connected with the input end of the adder through the multipath check device MUX3, the input end of the adder is also connected with the input end of the multipath check device MUX4, the output ends of the multipath check device MUX1, the multipath check device MUX2, the multiplier, the multipath check device MUX3, the multipath check device MUX4 and the input end of the adder are respectively connected with the register reg1, the output end of the adder is connected with the register reg2, and the output end of the register reg2 is connected with the input end of the multipath check device MUX 4.
In the input direction, inputting the characteristic data f xy And the reusable characteristic data f _ reuse among the PEs is used as one end input of the multiplier after passing through the multi-channel check device MUX1, and the weight w is input ij And the reusable weight w _ reuse among PEs is used as another multiplier after passing through the multi-channel check device MUX2One end is input.
The output of the multiplier and the output of the MUX1 are input as one end of the adder after passing through the multi-channel check device MUX3, and the multi-channel check device MUX3 has the function that when the input characteristic data is 0 value, the multiplication result is still 0, so that the 0 value can be directly input to the adder by-pass without activating the multiplier.
The other end input of the adder is provided by a multi-channel check device MUX4, the input of the multi-channel check device MUX4 is respectively a system initial value '0' and an inter _ psum of a PE output result of a previous period, only when the PE executes the first operation of each convolution operation, the initial value '0' is selected for addition operation, and then the inter _ psum is selected to complete accumulation operation.
The input direction and the control signal provide selection signals of MUX1/MUX2/MUX3/MUX4 on one hand, and also provide enable signals of a multiplier, an adder and registers reg1 and reg2 on the other hand, so that the purposes of low power consumption control and accurate control of data multiplexing correctness are achieved.
In the output direction, reg1 and reg2 respectively latch the input feature data and the input weight, so as to transmit information to adjacent PEs at a correct time to achieve the effects of reducing data migration and reducing power consumption.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Please refer to fig. 6, which shows the PE 0 And PE 1 For example, PE 0 And PE 1 Receiving the same control signal and weight w ij While each PE receives different input characteristic data f xy (ii) a In the structure, in order to realize weight multiplexing, a small-capacity FIFO is integrated and used for storing convolution kernel values broadcasted on a weight bus and broadcasting the weights to each PE unit instead of the weight bus after one convolution operation is finished.
In order to realize the multiplexing of the input characteristic data, a multiplex selector MUX5 is added between the PEs, and the multiplex selector MUX5 selects the PEs under the action of a control bus according to the diagram shown in FIG. 3 0 The multiplexed data is derived from the PEs 1 Or PE 3 F _ reuse.
Further, fig. 6, although only the interconnection relationship between two adjacent PEs is described, may be extended to all PE array structures.
In conclusion, the convolution operation structure for reducing the data migration and the power consumption of the deep neural network realizes a neural network coprocessor prototype in a model building mode, the coprocessor is formed in a calculation array form, a partial mode and a fixed mode are adopted for data multiplexing, and the data migration rate and the dynamic power consumption are effectively reduced through multiplexing of characteristic data and weight values among arrays. The structure provided by the invention can ensure the correctness of the convolution operation result, simultaneously obviously reduces the access frequency of the low-layer memory, further improves the energy efficiency ratio of the deep neural network, and has higher practical value and universality.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (7)

1. The convolution operation structure for reducing deep neural network data migration and power consumption is characterized by comprising a multiplier and an adder, wherein the input end of the multiplier is respectively connected with a multi-channel check device MUX1 and a multi-channel check device MUX2, and the output end of the multiplier and the output end of the multi-channel check device MUX1 are connected with a multi-channel check device MUX3 is connected with the input end of the adder, the input end of the adder is also connected with the input end of a multi-channel selector MUX4, the output ends of the multi-channel selector MUX1, the multi-channel selector MUX2, the multiplier, the multi-channel selector MUX3 and the multi-channel selector MUX4 and the input end of the adder are respectively connected with a register reg1, the output end of the adder is connected with a register reg2, and the output end of the register reg2 is connected with the input end of the multi-channel selector MUX4 and used for realizing multiplication and accumulation operations of convolution operation; the height, width and depth of the input feature maps (input fmaps) are respectively H if 、W if And C, the height, width and depth of the convolution kernel (filter or weight) are R respectively iw 、S iw And C, selecting the areas with the same size as the convolution kernels from the vertexes of the input feature graph to carry out multiplication and accumulation operation one by one, then, the same operation is carried out by sliding with the step length delta T, and O of the characteristic diagram is output 00 The element values are:
Figure FDA0004051615270000011
wherein, f is the data element in the input characteristic diagram, w is the weight value in the convolution kernel, and the subscript represents the horizontal and vertical coordinates.
2. The convolutional arithmetic structure for reducing data migration and power consumption of deep neural network as claimed in claim 1, wherein in the input direction, the characteristic data f is input xy And the reusable characteristic data f _ reuse among the PEs is used as one end input of the multiplier after passing through the multi-channel check device MUX1, and the weight w is input ij And the reusable weight w _ reuse among the PEs is input as the other end of the multiplier after passing through the multi-way check device MUX 2.
3. The convolution operation structure for reducing data migration and power consumption of deep neural network according to claim 1, wherein an output of the multiplier and an output of the MUX1 are input as one end of the adder after passing through a multi-way check unit MUX3, and the multi-way check unit MUX3 is configured to bypass the 0 value to the adder when the input feature data is 0 value, and the multiplier is not activated.
4. The structure of claim 3, wherein the other input of the adder is provided by a multi-way check unit MUX4, the inputs of the multi-way check unit MUX4 are a system initial value 0 and an inter _ psum output result of the last PE, when the PE performs the first operation of each convolution operation, the initial value 0 is selected for addition, and the inter _ psum is selected to complete the accumulation operation.
5. The convolutional arithmetic structure for reducing data migration and power consumption of a deep neural network as claimed in claim 1, wherein in the input direction, the control signal provides the selection signal of MUX1/MUX2/MUX3/MUX4 and provides the enable signals of the multiplier, the adder, and the registers reg1 and reg 2.
6. The convolutional arithmetic structure for reducing data migration and power consumption of a deep neural network as claimed in claim 5, wherein in the output direction, registers reg1 and reg2 respectively latch the input feature data and the input weight.
7. The convolutional arithmetic structure for reducing data migration and power consumption of deep neural network as claimed in claim 6, wherein the input feature data f xy Number of accesses N ifmap_opt The calculation is as follows:
Figure FDA0004051615270000021
wherein the content of the first and second substances,
Figure FDA0004051615270000022
for rounding-down operations, R iw 、S iw Height, width, H, of convolution kernel if 、W if The height and width of the input feature map are shown as Δ T. />
CN202010130325.7A 2020-02-28 2020-02-28 Convolution operation structure for reducing data migration and power consumption of deep neural network Active CN111275180B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010130325.7A CN111275180B (en) 2020-02-28 2020-02-28 Convolution operation structure for reducing data migration and power consumption of deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010130325.7A CN111275180B (en) 2020-02-28 2020-02-28 Convolution operation structure for reducing data migration and power consumption of deep neural network

Publications (2)

Publication Number Publication Date
CN111275180A CN111275180A (en) 2020-06-12
CN111275180B true CN111275180B (en) 2023-04-07

Family

ID=70999267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010130325.7A Active CN111275180B (en) 2020-02-28 2020-02-28 Convolution operation structure for reducing data migration and power consumption of deep neural network

Country Status (1)

Country Link
CN (1) CN111275180B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486200A (en) * 2021-07-12 2021-10-08 北京大学深圳研究生院 Data processing method, processor and system of sparse neural network
CN113791754A (en) * 2021-09-10 2021-12-14 中科寒武纪科技股份有限公司 Arithmetic circuit, chip and board card

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190756A (en) * 2018-09-10 2019-01-11 中国科学院计算技术研究所 Arithmetic unit based on Winograd convolution and the neural network processor comprising the device
KR20190065144A (en) * 2017-12-01 2019-06-11 한국전자통신연구원 Processing element and operating method thereof in neural network
CN110659014A (en) * 2018-06-29 2020-01-07 赛灵思公司 Multiplier and neural network computing platform

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190065144A (en) * 2017-12-01 2019-06-11 한국전자통신연구원 Processing element and operating method thereof in neural network
CN110659014A (en) * 2018-06-29 2020-01-07 赛灵思公司 Multiplier and neural network computing platform
CN109190756A (en) * 2018-09-10 2019-01-11 中国科学院计算技术研究所 Arithmetic unit based on Winograd convolution and the neural network processor comprising the device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张军阳 ; 郭阳 ; .二维矩阵卷积在向量处理器中的设计与实现.国防科技大学学报.(第03期),全文. *

Also Published As

Publication number Publication date
CN111275180A (en) 2020-06-12

Similar Documents

Publication Publication Date Title
CN106940815B (en) Programmable convolutional neural network coprocessor IP core
CN111459877B (en) Winograd YOLOv2 target detection model method based on FPGA acceleration
CN110378468B (en) Neural network accelerator based on structured pruning and low bit quantization
KR102443546B1 (en) matrix multiplier
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN111667051B (en) Neural network accelerator applicable to edge equipment and neural network acceleration calculation method
CN108241890B (en) Reconfigurable neural network acceleration method and architecture
US10698657B2 (en) Hardware accelerator for compressed RNN on FPGA
CN106875011B (en) Hardware architecture of binary weight convolution neural network accelerator and calculation flow thereof
CN108564168B (en) Design method for neural network processor supporting multi-precision convolution
CN111414994B (en) FPGA-based Yolov3 network computing acceleration system and acceleration method thereof
CN107993186A (en) 3D CNN acceleration method and system based on Winograd algorithm
CN102122275A (en) Configurable processor
CN111275180B (en) Convolution operation structure for reducing data migration and power consumption of deep neural network
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
Que et al. Mapping large LSTMs to FPGAs with weight reuse
CN110414672B (en) Convolution operation method, device and system
CN116710912A (en) Matrix multiplier and control method thereof
Xie et al. High throughput CNN accelerator design based on FPGA
Huang et al. A high performance multi-bit-width booth vector systolic accelerator for NAS optimized deep learning neural networks
CN111079078A (en) Lower triangular equation parallel solving method for structural grid sparse matrix
Shang et al. LACS: A high-computational-efficiency accelerator for CNNs
CN113312285B (en) Convolutional neural network accelerator and working method thereof
CN115081600A (en) Conversion unit for executing Winograd convolution, integrated circuit device and board card

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant