CN111275180A - Convolution operation structure for reducing data migration and power consumption of deep neural network - Google Patents

Convolution operation structure for reducing data migration and power consumption of deep neural network Download PDF

Info

Publication number
CN111275180A
CN111275180A CN202010130325.7A CN202010130325A CN111275180A CN 111275180 A CN111275180 A CN 111275180A CN 202010130325 A CN202010130325 A CN 202010130325A CN 111275180 A CN111275180 A CN 111275180A
Authority
CN
China
Prior art keywords
input
check device
adder
multiplier
power consumption
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010130325.7A
Other languages
Chinese (zh)
Other versions
CN111275180B (en
Inventor
娄冕
苏若皓
杨靓
崔媛媛
张海金
郭娜娜
刘思源
黄九余
田超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Microelectronics Technology Institute
Original Assignee
Xian Microelectronics Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Microelectronics Technology Institute filed Critical Xian Microelectronics Technology Institute
Priority to CN202010130325.7A priority Critical patent/CN111275180B/en
Publication of CN111275180A publication Critical patent/CN111275180A/en
Application granted granted Critical
Publication of CN111275180B publication Critical patent/CN111275180B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a convolution operation structure for reducing deep neural network data migration and power consumption, which comprises a multiplier and an adder, wherein the input end of the multiplier is respectively connected with a multi-channel check device MUX1 and a multi-channel check device MUX2, the output end of the multiplier and the output end of a multi-channel check device MUX1 are connected with the input end of the adder through a multi-channel check device MUX3, the input end of the adder is also connected with the input end of a multi-channel check device MUX4, the output ends of the multi-channel check device MUX1, the multi-channel check device MUX2, the multiplier, the check device MUX3, the check device MUX4 and the input end of the adder are respectively connected with a register reg1, the output end of the adder is connected with the register reg2, and the output end of the register reg2 is connected with the input end. The method is suitable for all current convolutional neural network models, effectively reduces the dynamic power consumption of global computation on the premise of meeting the data parallelism to the maximum extent, and has a simple control structure and strong universality.

Description

Convolution operation structure for reducing data migration and power consumption of deep neural network
Technical Field
The invention belongs to the technical field of integrated circuit design and special hardware accelerators, and particularly relates to a convolution operation structure for reducing deep neural network data migration and power consumption.
Background
In recent years, with the important breakthrough of Deep Neural Networks (DNNs) in speech and image recognition, they have become the basis of many modern artificial intelligence applications. The advantageous performance of DNNs benefits from the ability to perform high-level feature extraction on large amounts of data to obtain efficient representations of the same type of data. One common form of DNNs is Convolutional Neural Networks (CNNs), the body of which is composed of a number of convolutional layers, each of which is a higher dimensional abstraction of the input feature map (ifmap), and the number of convolutional layers of CNNs has been evolved from the first 2-layer LeNet to the present 53-layer ResNet, and its multiply-accumulate (MAC) calculation has increased from 341K to 3.9G. Therefore, the performance of the neural network is improved at the main cost of the rapid increase of the calculated amount, the calculation needs to be completed by a high-efficiency hardware platform, the traditional CPU is limited by the calculation resources and cannot provide effective data throughput, the GPU can ensure the real-time performance of calculation output by virtue of a huge calculation array, but the high power consumption is difficult to be widely applied. The advent and successful application of special acceleration engines represented by the Diannao series of the Carmbrian era provides an effective solution for high-efficiency calculation of artificial intelligence, and becomes an important research field of artificial intelligence at present.
Currently, the main objective of the design of the special acceleration engine is to improve the multiplexing of the feature map data (ifmap) and the convolution kernel weight (weight) as much as possible, so that the frequent information access from a low-level storage system (such as on-chip cache and off-chip DRAM) can be reduced, and the access delay and the power consumption overhead can be effectively reduced. The classical data multiplexing system has 3 types: (1) and (2) adopting a weight fixing system, wherein each processing unit (PE) fixedly stores a convolution kernel weight, each data in the characteristic diagram is broadcast to all PE arrays according to a period, multiplied by the weight inside the PE and then sent to the next PE unit for accumulation, and after a plurality of periods, the flow of the last PE array outputs the one-by-one result of convolution operation. The structure fixes the weight in the register in the PE, so that delay and power consumption expense caused by weight access are avoided, but the scheme needs to frequently perform partial and migration among PEs, and the whole PE array can only output one convolution result in each period. (2) A partial and fixed system is adopted, the result flow migration is different from the result flow migration by weight fixing, the system requires that each convolution operation result is fixedly output by one PE unit, the characteristic data can be multiplexed with horizontal and vertical adjacent PE units, the system mainly reduces the power consumption generated by the flow of partial and result among different PEs after multiplication, however, the structure does not achieve the maximum reuse degree of the characteristic data, and meanwhile, the reusability of the weight on the PE layer is not considered. (3) The method adopts a line fixed system, the system requires each PE to read a line of data of an input characteristic diagram and a convolution kernel, and is different from the former two systems which calculate output characteristic diagram data one by one, wherein each period of PE generates an intermediate result of a plurality of data of the output characteristic diagram in parallel, but the architecture requires to read in complete characteristic diagram information according to lines because of larger size difference of different application characteristic diagrams, so that the flexibility and the adaptability of a storage structure in a PE array are poor.
The different hardware acceleration engine systems aim at reducing repeated data access, and strive to compress huge power consumption overhead incurred by multiply-accumulate in theoretical calculation, however, the different hardware acceleration engine systems are single and dig reusability from elements such as a value and an input feature diagram, and the computing efficiency cannot be further improved.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a convolution operation structure for reducing deep neural network data migration and power consumption, which effectively combines the reusability of two calculation elements, can greatly reduce the access overhead, calculation overhead and power consumption overhead, and has a strong application value.
The invention adopts the following technical scheme:
the convolution operation structure comprises a multiplier and an adder, wherein the input end of the multiplier is respectively connected with a multi-channel check MUX1 and a multi-channel check MUX2, the output end of the multiplier and the output end of the multi-channel check MUX1 are connected with the input end of the adder through the multi-channel check MUX3, the input end of the adder is further connected with the input end of the multi-channel check MUX4, the output ends of the multi-channel check MUX1, the multi-channel check MUX2, the multiplier, the multi-channel check MUX3, the multi-channel check MUX4 and the input end of the adder are respectively connected with a register reg1, the output end of the adder is connected with the register reg2, and the output end of the register reg2 is connected with the input end of the multi-channel check MUX4 and used for achieving multiplication.
Specifically, in the input direction, the feature data f is inputxyAnd the reusable characteristic data f _ reuse among PEs is used as one end input of the multiplier after passing through the multi-way check device MUX1, and the weight w is inputijAnd the weight w _ reuse multiplexed among PEs passes through the multiplexer MUX2 and then is input as the other end of the multiplier.
Specifically, the output of the multiplier and the output of the MUX1 are input as one end of the adder after passing through the multi-way check MUX3, and the multi-way check MUX3 is configured to, when the input feature data is a 0 value, obtain a multiplication result of 0, activate no multiplier, and bypass the 0 value to the adder.
Further, the other end input of the adder is provided by a multi-channel check MUX4, the inputs of the multi-channel check MUX4 are a system initial value 0 and an inter _ psum of the PE output result in the previous period, respectively, when the PE performs the first operation of each convolution operation, the initial value 0 is selected for addition operation, and then the inter _ psum is selected to complete the accumulation operation.
Specifically, in the input direction, the control signals provide the select signals of MUX1/MUX2/MUX3/MUX4, and provide the enable signals of the multiplier, adder, registers reg1 and reg 2.
Further, in the output direction, registers reg1 and reg2 latch the input signature data and input weights, respectively.
Further, the input feature data fxyNumber of accesses Nifmap_optThe calculation is as follows:
Figure BDA0002395607530000041
wherein the content of the first and second substances,
Figure BDA0002395607530000042
for rounding-down operations, Riw、SiwHeight, width, H, of convolution kernelif、WifThe height and width of the input feature map are shown as Δ T.
Compared with the prior art, the invention has at least the following beneficial effects:
the invention relates to a convolution operation structure for reducing data migration and power consumption of a deep neural network, which provides a hardware form for data multiplexing inside a PE array and among the PE arrays, and can realize the purpose of reducing data migration and dynamic power consumption on the premise of less resource expenditure by adding MUX (multiplexer) of characteristic data and weight in the input direction of the traditional PE and adding latch of the characteristic data and the weight in the output direction.
Further, by providing the multiplexers MUX1 and MUX2 in the input direction, a second data multiplexing channel is provided for each of the two operands of the multiplier. MUX1 can pass not only the raw input feature data to the multiplier, but also multiplexes the input feature data from adjacent PE arrays to the multiplier; likewise, the MUX2 can pass not only the raw weight values to the multipliers, but also can pass the used raw weight values to the multipliers in a multiplexed manner again. The multiplexing selection structure of the input direction is beneficial to converting the access to the bottom layer memory into the data acquisition among the PEs, thereby effectively reducing the dynamic power consumption caused by the movement of the data among different storage layers.
Further, for the multiplier, when one operand is 0, the result can be directly made to be 0 without multiplication operation, so that the dynamic power consumption of the multiplier can be reduced, and the MUX3 functions to directly feed the input characteristic data to the adder as the output result of the multiplier when the input characteristic data is 0. Meanwhile, since the adder needs to complete the accumulation operation of partial sums, the MUX4 is used to provide another operand for the adder, where the operand may be an initial value 0 of the first operation or a historical addition result latched in a subsequent addition operation, so as to avoid power consumption overhead caused by repeatedly writing and reading the historical addition result from the underlying memory.
Further, the control signal mainly acts on three aspects, firstly, the operand sources of the multiplier and the adder are precisely controlled by controlling the MUX1/MUX2/MUX3/MUX4, and the input characteristic data and the weight value which are already stored in the PE array are multiplexed to the maximum extent; secondly, by controlling the enabling of the multiplier and the adder, the enabling is closed when multiplication and addition operation are not needed, so that the power consumption expense of an operation part is saved; finally, the enabling of the registers reg1 and reg2 is controlled, the time for latching the input characteristic data and the addition operation result can be accurately controlled, and the operation input correctness of the multiplier and the adder in the next period is ensured.
Furthermore, an optimal calculation mode under the structure is given, namely under a PE array structure with the same size as a convolution kernel, the maximum degree of data multiplexing under the structure can be realized according to the calculation sequence of the input feature diagram firstly in the longitudinal direction and then in the transverse direction.
In conclusion, the method is suitable for all current convolutional neural network models, effectively reduces the dynamic power consumption of global computation on the premise of meeting the data parallelism to the maximum extent, and has a simple control structure and strong universality.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a basic schematic diagram of a convolution operation;
FIG. 2 is a conventional calculation flow of convolution operations;
FIG. 3 is a flow chart of the optimization calculation proposed by the present invention;
FIG. 4 is a diagram illustrating the trend of the optimized variation of the access compression rate of the input feature data;
FIG. 5 shows a PE structure according to the present invention;
FIG. 6 is a diagram of the interconnection of PE arrays according to the present invention.
Detailed Description
The invention provides a convolution operation structure for reducing deep neural network data migration and power consumption, which is used for fully analyzing a current mainstream convolution calculation data multiplexing system and providing a data multiplexing method combining elements such as a weight value and an input characteristic diagram. The invention carries out qualitative description of the calculation process of the method, and simultaneously gives a quantitative evaluation formula of the characteristic data access compression ratio after the method is adopted to confirm the validity of the scheme, and finally gives a concrete implementation structure of a hardware level. The invention can realize the purposes of reducing data migration among PEs and reducing power consumption without increasing excessive resources on a hardware level, and the scheme has good expansibility, so the method has higher application value in a deep neural network which takes convolution operation as a main characteristic.
Referring to FIG. 1, the height, width and depth of the input feature maps (input fmaps) are Hif、WifAnd C, the height, width and depth of the convolution kernel (filter or weight) are R respectivelyiw、SiwAnd C, selecting areas with the same size as the convolution kernels from the top points of the input feature graph to carry out multiplication and accumulation operation one by one, then carrying out the same operation by step length delta T sliding, and outputting O of the feature graph00The element values are:
Figure BDA0002395607530000061
wherein, f is a data element in the input characteristic diagram, w is a weight value in a convolution kernel, and subscripts represent horizontal and vertical coordinates of the data element;
referring to fig. 2, generally, the convolution operation employs a data parallel method, that is, a plurality of PE computing units are used to perform data multiply-add operation simultaneously, so as to reduce the time consumption of the whole operation and ensure real-time performance.
The design structure provided by the invention adopts the PE number consistent with the size of a convolution kernel, namely, each PE unit simultaneously executes a convolution operation, and FIG. 2 shows 12 cycles (cycle 1-cycle 12) and 9 PE units (PE units)0~PE8) The sequence of operations, in which the step Δ T is 1, each PE unit shares the same weight w in each cycle, for example, cycle1 cycles0Execution of f00*w00And PE8Execution of f22*w00After the cycle9 of the 9 th cycle is finished, one parallel convolution of the PE array is finished; after that, starting from cycle10, the weights of the convolution kernel are repeatedly broadcasted to each PE unit again to perform a new round of convolution operation again.
Number of accesses N to input feature data to underlying memoryifmap_iniIs calculated as follows
Figure BDA0002395607530000071
Wherein the content of the first and second substances,
Figure BDA0002395607530000072
and
Figure BDA0002395607530000073
respectively representing the sliding times of the convolution kernel in the horizontal direction and the vertical direction of the input feature diagram, and rounding the result downwards.
Through the calculation flow chart of fig. 2, it can be found that there is a multiplexing situation of the input feature data on the basis of weight sharing, as shown in fig. 3. The optimization calculation method comprises the following steps:
cycle1 cycle PE1Use of f10May shift left to PE in cycle2 cycle0The same applies to the remaining PE units. Therefore, the multiplexing of the input characteristic data can directly complete the information exchange among the PE units without obtaining data by a lower storage level with larger delay and power consumption, thereby reducing the data migration among different storage levels and saving the power consumption.
cycle 4-cycle 6 cycle PE0~PE2The data flow marked by the arrows in fig. 3 can be multiplexed with the feature data in the same manner as the feature data used in cycles 1 to 3 and cycles PE3 to PE 5.
On one hand, the time consumption of calculation is increased and the data reusability is greatly reduced by adopting the excessively small number of PEs, and on the other hand, the time consumption is reduced due to the increase of calculation power, but the power consumption overhead is increased sharply.
The invention adopts the number of PE with the same scale as the size of the convolution kernel, because the size of the convolution kernel in the deep neural network is generally smaller, the time consumption and the power consumption can be balanced. The reason why the procedure of multiplexing the feature data is that the convolution kernels are sequentially slid longitudinally and transversely in the feature map is that, as can be seen from the process of fig. 3, the calculation results of the PEs adjacent to the right side of each PE are in the longitudinal direction of the feature map, and the sources of the feature data are from the rightmost PE (in the figure, the PE is referred to as PE in the figure)8) Therefore, each input feature data can be called into the PE array only once from the low-layer memory, and the multiplexing degree of the input feature data is greatly improved.
On the basis of the multiplexing of the characteristic data, the reusability of the weight values is continuously examined, and as can be seen from FIG. 3, the weight values start from cycle10 cycle and are going to be w00Starting to broadcast to all PEs repeatedly, a small register set can be set in the PE array to latch the weights, since the convolution kernel is small in size.
After the method is adopted, the access times N of the input characteristic data in the bottom-layer memoryifmap_optThe calculation is as follows:
Figure BDA0002395607530000081
wherein the content of the first and second substances,
Figure BDA0002395607530000082
to round-down operations.
If the formula (2) is divided by the formula (1), and a convolution kernel and a feature map size which are commonly used in the deep neural network are selected, the step value △ T is set to be 1, and the variation trend of the access compression rate of the input feature data is obtained as shown in fig. 4.
As can be seen from the figure, when the feature map size is not changed, the compression ratio tends to be lower as the convolution kernel size becomes larger, which is that the convolution kernel size becomes larger so that the scale of each convolution operation becomes larger, and thus the overall multiplexing degree increases; meanwhile, when the size of the convolution kernel is not changed, the compression rate also shows a lower trend as the size of the feature map is larger, because the number of times of sliding of the convolution kernel in one direction is increased, and accordingly, the data multiplexing degree is improved. In addition, in both of the above-mentioned trends, the compression rate tends to be saturated, because the proportion of data reuse tends to be closer and closer to the limit as the calculation scale increases, thereby exhibiting a characteristic that the trend becomes slower.
The power consumption difference generated by data migration of different storage levels is large, the power consumption of the PE is 1-2 orders of magnitude lower than that of data obtained from an off-chip DRAM, the compression rate also reflects the ratio of converting off-chip DRAM access into inter-PE access, and the effect of the scheme on reducing the global dynamic power consumption is indirectly verified.
Referring to fig. 5, the convolution operation structure for reducing deep neural network data migration and power consumption according to the present invention uses a multiplier and an adder as core components to implement multiply-accumulate operations of convolution operations.
The input ends of the multiplier and the multi-path check device MUX1 are connected with the input end of the adder through the multi-path check device MUX3, the output end of the multiplier and the output end of the multi-path check device MUX1 are connected with the input end of the adder, the input end of the adder is further connected with the input end of the multi-path check device MUX4, the output ends of the multi-path check device MUX1, the multi-path check device MUX2, the multiplier, the multi-path check device MUX3, the output end of the multi-path check device MUX4 and the input end of the adder are respectively connected with the register reg1, the output end of the adder is connected with the register reg2, and the output end of.
In the input direction, inputting the characteristic data fxyAnd the reusable characteristic data f _ reuse among PEs is used as one end input of the multiplier after passing through the multi-way check device MUX1, and the weight w is inputijAnd the weight w _ reuse multiplexed among PEs passes through the multiplexer MUX2 and then is input as the other end of the multiplier.
The output of the multiplier and the output of the MUX1 are input as one end of the adder after passing through the multi-way check MUX3, and the multi-way check MUX3 functions such that when the input characteristic data is 0 value, the multiplication result is still 0, so that the 0 value can be directly input to the adder without activating the multiplier.
The other end input of the adder is provided by a multi-channel check device MUX4, the input of the multi-channel check device MUX4 is respectively a system initial value '0' and an inter _ psum of the PE output result in the previous period, only when the PE executes the first operation of each convolution operation, the initial value '0' is selected for addition operation, and then the inter _ psum is selected to complete the accumulation operation.
The input direction and the control signal provide selection signals of MUX1/MUX2/MUX3/MUX4 on one hand, and also provide enable signals of the multiplier, the adder, the register reg1 and the register reg2 on the other hand, so that the purposes of low power consumption control and accurate control of data multiplexing correctness are achieved.
In the output direction, reg1 and reg2 respectively latch the input feature data and the input weights, so as to transfer information to adjacent PEs at the right time to achieve the effects of reducing data migration and reducing power consumption.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Please refer to fig. 6, which shows the PE0And PE1For example, PE0And PE1Receiving the same control signal and weight wijWhile each PE receives different input characteristic data fxy(ii) a In the structure, in order to realize weight multiplexing, a small-capacity FIFO is integrated and used for storing convolution kernel values broadcasted on a weight bus and broadcasting the weights to each PE unit instead of the weight bus after one convolution operation is finished.
In order to realize the multiplexing of the input characteristic data, a multiplex selector MUX5 is added between PEs, and the multiplex selector MUX5 selects the PEs under the action of a control bus according to the diagram in FIG. 30The multiplexed data is derived from the PEs1Or PE3F _ reuse.
Further, fig. 6, although only the interconnection relationship between two adjacent PEs is described, may be extended to all PE array structures.
In conclusion, the convolution operation structure for reducing the data migration and the power consumption of the deep neural network realizes a neural network coprocessor prototype in a model building mode, the coprocessor is formed in a calculation array form, a partial mode and a fixed mode are adopted for data multiplexing, and the data migration rate and the dynamic power consumption are effectively reduced through multiplexing of characteristic data and weight values among arrays. The structure provided by the invention can ensure the correctness of the convolution operation result, simultaneously obviously reduces the access frequency of the low-layer memory, further improves the energy efficiency ratio of the deep neural network, and has higher practical value and universality.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (7)

1. The convolution operation structure is characterized by comprising a multiplier and an adder, wherein the input end of the multiplier is respectively connected with a multi-way check device MUX1 and a multi-way check device MUX2, the output end of the multiplier and the output end of the multi-way check device MUX1 are connected with the input end of the adder through the multi-way check device MUX3, the input end of the adder is further connected with the input end of the multi-way check device MUX4, the output ends of the multi-way check device MUX1, the multi-way check device MUX2, the multiplier, the check device MUX3, the check device MUX4 and the input end of the adder are respectively connected with a register reg1, the output end of the adder is connected with the register reg2, and the output end of the register reg2 is connected with the input end of the multi-way check device MUX4 and used.
2. The convolutional arithmetic structure for reducing data migration and power consumption of deep neural network as claimed in claim 1, wherein in the input direction, the characteristic data f is inputxyAnd the reusable characteristic data f _ reuse among PEs is used as one end input of the multiplier after passing through the multi-way check device MUX1, and the weight w is inputijAnd the weight w _ reuse multiplexed among PEs passes through the multiplexer MUX2 and then is input as the other end of the multiplier.
3. The convolution operation structure for reducing data migration and power consumption of deep neural network as claimed in claim 1, wherein the output of the multiplier and the output of the MUX1 are inputted as one end of the adder after passing through the multi-way check MUX3, and the multi-way check MUX3 is configured to bypass the 0 value to the adder when the input characteristic data is 0 value, and the multiplication result is 0 without activating the multiplier.
4. The architecture of claim 3, wherein the other input of the adder is provided by a multi-way check MUX4, the inputs of the multi-way check MUX4 are a system initial value 0 and an inter _ psum of the last PE cycle, respectively, and when the PE performs the first operation of each convolution operation, the initial value 0 is selected for addition, and then the inter _ psum is selected to complete the accumulation operation.
5. The convolutional arithmetic structure for reducing data migration and power consumption of a deep neural network as claimed in claim 1, wherein in the input direction, the control signal provides the selection signal of MUX1/MUX2/MUX3/MUX4 and provides the enable signals of the multiplier, adder, register reg1 and reg 2.
6. The convolutional arithmetic structure for reducing data migration and power consumption of a deep neural network as claimed in claim 5, wherein in the output direction, registers reg1 and reg2 latch the input feature data and the input weight respectively.
7. The convolutional arithmetic structure for reducing data migration and power consumption of deep neural network as claimed in claim 6, wherein the input feature data fxyNumber of accesses Nifmap_optThe calculation is as follows:
Figure FDA0002395607520000021
wherein the content of the first and second substances,
Figure FDA0002395607520000022
for rounding-down operations, Riw、SiwHeight, width, H, of convolution kernelif、WifThe height and width of the input feature map are shown as Δ T.
CN202010130325.7A 2020-02-28 2020-02-28 Convolution operation structure for reducing data migration and power consumption of deep neural network Active CN111275180B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010130325.7A CN111275180B (en) 2020-02-28 2020-02-28 Convolution operation structure for reducing data migration and power consumption of deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010130325.7A CN111275180B (en) 2020-02-28 2020-02-28 Convolution operation structure for reducing data migration and power consumption of deep neural network

Publications (2)

Publication Number Publication Date
CN111275180A true CN111275180A (en) 2020-06-12
CN111275180B CN111275180B (en) 2023-04-07

Family

ID=70999267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010130325.7A Active CN111275180B (en) 2020-02-28 2020-02-28 Convolution operation structure for reducing data migration and power consumption of deep neural network

Country Status (1)

Country Link
CN (1) CN111275180B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190756A (en) * 2018-09-10 2019-01-11 中国科学院计算技术研究所 Arithmetic unit based on Winograd convolution and the neural network processor comprising the device
KR20190065144A (en) * 2017-12-01 2019-06-11 한국전자통신연구원 Processing element and operating method thereof in neural network
CN110659014A (en) * 2018-06-29 2020-01-07 赛灵思公司 Multiplier and neural network computing platform

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190065144A (en) * 2017-12-01 2019-06-11 한국전자통신연구원 Processing element and operating method thereof in neural network
CN110659014A (en) * 2018-06-29 2020-01-07 赛灵思公司 Multiplier and neural network computing platform
CN109190756A (en) * 2018-09-10 2019-01-11 中国科学院计算技术研究所 Arithmetic unit based on Winograd convolution and the neural network processor comprising the device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张军阳;郭阳;: "二维矩阵卷积在向量处理器中的设计与实现" *

Also Published As

Publication number Publication date
CN111275180B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN106940815B (en) Programmable convolutional neural network coprocessor IP core
CN110378468B (en) Neural network accelerator based on structured pruning and low bit quantization
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
Liang et al. Evaluating fast algorithms for convolutional neural networks on FPGAs
CN111459877B (en) Winograd YOLOv2 target detection model method based on FPGA acceleration
KR102443546B1 (en) matrix multiplier
CN108241890B (en) Reconfigurable neural network acceleration method and architecture
CN108205701B (en) System and method for executing convolution calculation
CN103984560B (en) Based on extensive coarseness imbedded reconfigurable system and its processing method
CN102122275A (en) Configurable processor
CN111414994B (en) FPGA-based Yolov3 network computing acceleration system and acceleration method thereof
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
CN110851779B (en) Systolic array architecture for sparse matrix operations
Que et al. Optimizing reconfigurable recurrent neural networks
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
CN109144469A (en) Pipeline organization neural network matrix operation framework and method
Que et al. Recurrent neural networks with column-wise matrix–vector multiplication on FPGAs
CN110414672B (en) Convolution operation method, device and system
CN116710912A (en) Matrix multiplier and control method thereof
Xie et al. High throughput CNN accelerator design based on FPGA
Shrivastava et al. A survey of hardware architectures for generative adversarial networks
Huang et al. A high performance multi-bit-width booth vector systolic accelerator for NAS optimized deep learning neural networks
CN111275180B (en) Convolution operation structure for reducing data migration and power consumption of deep neural network
Shang et al. LACS: A high-computational-efficiency accelerator for CNNs
CN109948787B (en) Arithmetic device, chip and method for neural network convolution layer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant