CN112070217B - Internal storage bandwidth optimization method of convolutional neural network accelerator - Google Patents

Internal storage bandwidth optimization method of convolutional neural network accelerator Download PDF

Info

Publication number
CN112070217B
CN112070217B CN202011102647.7A CN202011102647A CN112070217B CN 112070217 B CN112070217 B CN 112070217B CN 202011102647 A CN202011102647 A CN 202011102647A CN 112070217 B CN112070217 B CN 112070217B
Authority
CN
China
Prior art keywords
data
cache
row
addr
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011102647.7A
Other languages
Chinese (zh)
Other versions
CN112070217A (en
Inventor
李幼萌
王亚博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202011102647.7A priority Critical patent/CN112070217B/en
Publication of CN112070217A publication Critical patent/CN112070217A/en
Application granted granted Critical
Publication of CN112070217B publication Critical patent/CN112070217B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention provides an internal memory bandwidth optimization method of a convolutional neural network accelerator, which comprises the following steps that step 1, a CACHE (CACHE memory) with the same size as the number of computing units is arranged between an internal memory RAM (random access memory) storing characteristic data to be computed and the computing units, and the CACHE data are firstly transmitted into the computing units in each computing period; step 2, the memory read-write control unit is used for reading and preparing the data used in the next period: the buffer block moves up and down preferentially relative to the memory area, one line of data which is different from the previous period in the buffer block is discarded each time, other data sequentially move forward, and the data newly covered by the buffer block is added into an empty line; step 3, when the buffer memory block longitudinally moves to the boundary, adding the data newly covered by the buffer memory block into an empty column; the data reading method based on special rule transfer memory address with data multiplexing as a core of the invention achieves the aim of matching all calculation units with the least bandwidth and memory as possible, and the efficiency of the calculation units is the highest.

Description

Internal storage bandwidth optimization method of convolutional neural network accelerator
Technical Field
The invention belongs to the technical field of convolutional neural networks, and particularly relates to an internal storage bandwidth optimization method of a convolutional neural network accelerator.
Background
When the neural network is convolved, the neural network of a plurality of layers can be divided into three types, namely an input layer, an hidden layer and an output layer, wherein the input layer only comprises one layer and directly receives two-dimensional image input and is responsible for transmitting data to be processed to the hidden layer in a matrix form, and the output layer mainly outputs classification labels by using a logic function or a normalization function (softmaxfunction). Whereas the hidden layer generally comprises three common structures of a convolutional layer, a pooling layer and a fully connected layer, wherein the convolutional layer and the pooling layer are unique to a convolutional neural network. The convolution kernels in the convolution layer contain weight coefficients, but the pooling layer may be considered not to contain weight coefficients, so in the literature the pooling layer may not be considered as a separate layer. Taking the Lenet-5 example, the order in the hidden layer is: convolution layer-pooling layer-convolution layer (also known as fully connected layer) -fully connected layer.
The convolutional neural network is continuously popularized and used, and some constraint factors exist in the actual scene, such as ultrahigh calculation amount, high time delay caused by massive data access and energy consumption at an embedded end and the like, so that the convolutional neural network is greatly limited from being applied in daily life of people, and even if the convolutional neural network has been developed and matured, the convolutional neural network can be used for bringing convenience to people in many times in life. The use of FPGAs (field programmable gate arrays) to design and implement convolutional neural network accelerators can solve the relevant problems. Convolutional neural networks can be thought of as being composed of numerous neurons, which in turn can be composed of basic logic resources. Due to the characteristics of the convolutional neural network and the advantages possessed by the FPGA, the designed convolutional neural network accelerator can directionally solve core problems of mismatching of calculated amount and a calculating unit, mismatching of data amount and bandwidth and the like. In the existing convolutional neural network accelerator, the calculation process of the convolutional neural network is parallelized as much as possible, for example, more calculation units are introduced in the calculation process of the convolutional layer with extremely large calculation amount to accelerate the calculation, but the memory and the bandwidth are correspondingly expanded to give support, and the bandwidth and multiple access energy consumption are wasted because more repeated data exist in two adjacent convolutional operations. This problem is mainly caused by the inherent characteristics of the memory BRAM.
Generally, in the design of a convolutional neural network accelerator, weight data and intermediate data are temporarily stored in BRAM (Block RAM) with high-speed read-write performance, but BRAM can only read data from one address per clock cycle, so that the bandwidth when data reading must be increased to parallelize more actions in one cycle, and more BRAM resources are needed to increase the bandwidth. However, due to the special operation in the convolution layer, many identical data are used in adjacent convolution operations, so that more calculations can be parallelized as much as possible by using a data multiplexing method, but in doing so, situations of calculation waiting caused by that data cannot be multiplexed can occur in some cases, for example, when data are read in line in a two-dimensional data block, the data in two adjacent calculations are not identical, and all the data need to be waited for reloading.
Disclosure of Invention
In order to optimize the bandwidth problem caused by adding the computing units in the convolutional neural network accelerator, the invention provides a data reading method based on special rule transfer memory addresses with data multiplexing as a core, so that the memory access frequency is minimum while the reuse rate of data in a buffer area is maximum, so that all the computing units are matched with the least bandwidth and the memory as much as possible, and the computing unit efficiency is highest.
The invention is implemented by adopting the following technology:
an internal storage bandwidth optimization method of a convolutional neural network accelerator, wherein an optimization module for reducing bandwidth requirements of the accelerator is arranged in a convolutional layer of the neural network, and the optimization module comprises the following steps:
step 1, setting a CACHE with the same size as the number of the computing units between an internal storage RAM storing the characteristic data to be computed and the computing units, wherein each data in the CACHE is directly connected to the computing units, and firstly, the data in the CACHE is transmitted to the computing units in each computing period;
step 2, the memory read-write control unit is used for reading and preparing the data used in the next period: the buffer block moves up and down preferentially relative to the memory area, one line of data which is different from the previous period in the buffer block is discarded each time, other data sequentially move forward, and the data newly covered by the buffer block is added into an empty line;
and 3, when the buffer block longitudinally moves to the boundary, transversely moving the buffer block by one step, discarding a row of data which is different from the previous period in the buffer block, laterally moving other data in sequence, and adding the data newly covered by the buffer block into an empty column.
Further, the optimization module realizes two-dimensional characteristic data caching by adopting the following steps:
step 0, initializing CACHE, and preparing data used in a first computing period; the process starts, the address vector of the data which should be stored in the first row of the CACHE is transmitted to the internal storage RAM by the memory read-write control unit, the address vector of the second row is submitted in the next period, the received data vector of the first row is put into the first row of the CACHE, and the process is repeated until the CACHE is filled up and the mark initialization is completed, and the data in the CACHE corresponds to the upper left corner data of the internal storage RAM one by one at the moment; wherein: the first value of the variable AddrBase, addrBase with the maintenance address set is equal to CoreLen minus 1, the last element of each address vector is equal to AddrBase value, the other elements are decremented in sequence, and then AddrBase value is changed to Addr 0 ,Addr 0 Is calculated as a common formulaInMapCol in equation 1 is a super parameter, and is the number of columns of the two-dimensional characteristic data, as shown in equation (1).
Addr 0 =AddrBase+InMapCol (1)
Step 1: after initialization is completed, the computing unit starts to work, the first operation is completed by using the data in the CACHE, the memory read-write control unit continues to execute the computation and transmission of the addresses, the received data vector is placed in the last line of the CACHE, the original last line is transferred to the next-to-last line, the data is moved upwards according to the rule, the original first line data is discarded, and the Counter is used at the same time 1 Starting to work, and automatically increasing each time the transfer of the address vector is executed until the address vector is not smaller than InMapRow minus CoreLen;
step 2: the step 1 is finished, namely the sliding of the CACHE relative to the two-dimensional characteristic data reaches the bottom of the row, and the CACHE is moved to the right by one row at the moment; at this time, the calculation mode of the address vector is changed, the value of addrBase is increased by 1 to be used as the last element of the address vector, and other elements are sequentially decreased by InMapCol forwards, so that the data vector obtained by the address vector is put into a CACHE in a new mode: moving each row of CACHE forward by one element position, discarding the first element of each row, filling the first element of the new data vector in the last element position of the first row of CACHE, filling the second element of the data vector in the last element position of the second row of CACHE, and sequentially performing; thereafter, the value of addrBase is corrected to Addr 1 ,Addr 1 The calculation of (2) is shown in formula (2) and the variable Counter 2 Self-increasing, and entering into a step 3:
Addr 1 =AddrBase-CoreLen×InMapCol (2)
step 3: the step is similar to step 1, in which the value of the variable addrBase is changed to Addr after each address vector calculation is completed 2 ,Addr 2 As shown in the formula (2), the received data vector is placed in the first row of the CACHE, the original first row moves down to become a new second row, the other rows move down in sequence, and the data of the last row is discarded. Simultaneous Counter 1 Will self-subtract 1 each time until itAnd if the number is not more than 1, the method proceeds to step 4.
Addr 2 =AddrBase-InMapCol (3)
Step 4: the step is similar to step 2, the calculation mode of the address vector in the step is changed into the value of AddrBase plus 1 as the last element of the address vector, other elements are sequentially increased by InMapCol forward, then each row of the CACHE is moved forward by one element position, the row head element of each row is discarded, then the last element of the new data vector is filled in the position of the last element of the first row of the CACHE, the last element of the data vector is filled in the position of the last element of the second row of the CACHE, and the steps are sequentially carried out, and after the filling is completed, the Counter is used 2 Make a judgment if Counter 2 Smaller than Counter, then Counter 2 Self-increment 1, correcting the value of AddBase to Addr 3 ,Addr 3 The calculation of (2) is shown as a formula (5), and the step (1) is carried out; otherwise, the characteristic data is scanned once, whether other convolution kernels need to be convolved with the characteristic data is judged, the ScanIndex variable records the scanned times of a group of two-dimensional characteristic data, if the ScanIndex is not smaller than CoreGroup, the two-dimensional characteristic data is used, related variables need to be initialized, otherwise, the AddBase value is corrected to be Addr 4 ,Addr 4 As shown in formula (6), and proceeds to step 5:
Counter=InMapCol-CoreLen (4)
Addr 3 =AddrBase+CoreLen×InMapCol (5)
Addr 4 =AddrBase-CoreLen+CoreLen×InMapCol (6)
step 5: in order to eliminate the step of initializing the CACHE, the sliding path can be reversely moved back along the CACHE, and the initial sliding position can be at any position because one convolution kernel corresponds to one group of two-dimensional characteristic data. At this time, the CACHE corresponds to the upper right corner data of the two-dimensional feature data. Generally similar to step 1, except that the first element of the address vector is equal to the value of the variable AddrBase, the other elements being sequentially incremented; simultaneous Counter 1 Form of (2)The state and condition are the same as those of step 1, and the step 6 is entered when the state and condition are not smaller than Counter ', and the calculation of Counter' is shown in the formula (7):
Counter′=InMapRow-CoreLen (7)
step 6: this step is similar to step 2, except that the value of AddrBase is subtracted from 1 as the first element of the address vector, and the other elements are decremented back by inapcol in sequence. The data of each row of the CACHE is moved backwards by one element position, the tail element of each row is abandoned, then the first element of the new data vector is filled in the first element position of the last row of the CACHE, and the second element of the data vector is filled in the first element position of the last-second row of the CACHE, and the steps are sequentially carried out. Thereafter, the value of addrBase is corrected to Addr 1 . Counter 2 Self-subtracting 1, and entering step 7;
step 7: the step is basically the same as step 3, except that the address vector is calculated in the same manner as step 5, the first element of the address vector is equal to the value of the variable AddrBase, and the other elements are sequentially incremented. When the Counter counts 1 If the number of the water is not more than 1, the step 8 is entered;
step 8: the method is similar to the step 4, the value of the address vector in the step is self-subtracted by 1 as the first element of the address vector, other elements are sequentially subtracted by InMapCol, each row of the CACHE is moved backwards by one element position, the tail element of each row is abandoned, then the first element of the new data vector is filled in the first element position of the first row of the CACHE, and the second element of the data vector is filled in the first element position of the second row of the CACHE. Counter for Counter after filling 2 Make a judgment if Counter 2 Greater than 1, counter 2 From minus 1, correct the value of addrBase to Addr 3 Step 5 is entered; otherwise, the characteristic data is scanned once, if ScanIndex is not smaller than CoreGroup, the two-dimensional characteristic data is used, related variables are initialized, otherwise, the value of the correction variable AddBase is Addr 5 ,Addr 5 The calculation method of (a) is as formula (8)Shown, then step 1 is entered;
Addr 5 =AddrBase+CoreLen+CoreLen×InMapCol (8)
advantageous effects
In the application of the convolutional neural network accelerator, the method meets the requirement of an added calculation unit by using as little bandwidth as possible, and realizes that the buffer area size is minimum under the condition of the same data reuse rate in the calculation of a convolutional layer; at the same buffer size, the time to complete the data buffering is minimal.
Drawings
FIG. 1 is a schematic diagram of a memory architecture in accordance with the present invention;
FIG. 2 is a schematic diagram of step transition and partial variables;
FIG. 3 is a step transition flow diagram;
FIG. 4 is a schematic diagram of a LeNet-5 network instantiation model;
FIG. 5 is a schematic diagram of a convolution layer module overall;
FIG. 6 is a schematic diagram of an initialization state;
FIG. 7 is a schematic diagram of State 1;
FIG. 8 is a state 2 schematic;
FIG. 9 is a state 3 schematic;
fig. 10 state 4 is a schematic diagram.
The specific embodiment is as follows:
the technology and method of the present invention will be described in detail below with reference to the following examples and drawings, which are provided to illustrate the constitution of the present invention, but are not intended to limit the scope of the present invention.
The main content of the method is that the characteristic data and the weight coefficient stored in the external DRAM are firstly cached in the internal storage RAM connected with each layer of the convolutional neural network accelerator, and then the optimization processing is carried out between the internal storage RAM and the computing unit in the following mode, and the overall memory structure is shown in figure 1.
The specific optimization scheme is that a block of CACHE with the same size as the number of the computing units is realized between an internal storage RAM for storing the feature data to be computed and the computing units, each data in the CACHE is directly connected to the computing units, the data in the CACHE is firstly transmitted into the computing units in each computing period, and then the data used in the next period are read and prepared by a memory read-write control unit through the following steps: the buffer block moves up and down preferentially relative to the memory area, one line of data which is different from the previous period in the buffer block is discarded each time, other data sequentially move forward, and the data newly covered by the buffer block is added into an empty line; when the data is longitudinally moved to the boundary, the data is moved to the right (or to the left, and the directions are consistent in the process of scanning the two-dimensional characteristic data at the same time), one row of data which is different from the previous period in the cache block is discarded, other data are sequentially moved aside, and the data newly covered by the cache block are added to the empty row. This movement ensures that the bandwidth requirements are the lowest and the data multiplexing rate the highest in the case of meeting the same computational unit data requirements. The conversion between steps and some of the variables are shown in fig. 2.
Step 0: this step is the initial phase, initializing the CACHE, ready for the data used in the first calculation cycle. The process starts, the address vector of the data which should be stored in the first row of the CACHE is transmitted to the internal storage RAM by the memory read-write control unit, the address vector of the second row is submitted in the next period, the received data vector of the first row is put into the first row of the CACHE, and the process is repeated until the CACHE is filled up and the mark initialization is completed, and at the moment, the data in the CACHE corresponds to the upper left corner data of the internal storage RAM one by one. The first value of the variable AddrBase, addrBase in which the maintenance address is set is equal to CoreLen (the number of columns of the superparameter, CACHE) minus 1, the last element of each address vector is equal to the value of AddrBase, the other elements are decremented in turn, and the value of AddrBase is changed to Addr 0 ,Addr 0 As shown in formula (1), inMapCol in formula 1 is a super parameter, which is the number of columns of the two-dimensional characteristic data.
Addr 0 =AddrBase+InMapCol (1)
Step 1: after initialization is completed, the computing unit begins to operate, completing the first operation using the data in the CACHE. The memory read-write control unit continues to execute the calculation and transfer of the addresses, the received data vector is placed in the last line of the CACHE, and the original last line is transferred to the last but one lineTwo lines, the data is moved up according to the rule, the original first line data is discarded, and the Counter is used at the same time 1 Starting to work, the transfer of the address vector is automatically increased every time, until the address vector is not smaller than InMapRow (super parameter, the number of lines of two-dimensional characteristic data) minus CoreLen, and then step 2 is performed.
Step 2: at the end of step 1, the sliding of the CACHE relative to the two-dimensional characteristic data has reached the bottom of the column, at which point the CACHE is moved one column to the right. At this time, the calculation mode of the address vector is changed, the value of addrBase is increased by 1 to be used as the last element of the address vector, and other elements are sequentially decreased by InMapCol forwards, so that the data vector obtained by the address vector is put into a CACHE in a new mode: the data of each row of the CACHE is moved forward by one element position, the head element of each row is abandoned, then the first element of the new data vector is filled in the last element position of the first row of the CACHE, and the second element of the data vector is filled in the last element position of the second row of the CACHE, and the steps are sequentially carried out. Thereafter, the value of addrBase is corrected to Addr 1 ,Addr 1 The calculation of (2) is shown in formula (2) and the variable Counter 2 Self-increasing and entering step 3.
Addr 1 =AddrBase-CoreLen×InMapCol (2)
Step 3: the step is similar to step 1, in which the value of the variable addrBase is changed to Addr after each address vector calculation is completed 2 ,Addr 2 As shown in the formula (2), the received data vector is placed in the first row of the CACHE, the original first row moves down to become a new second row, the other rows move down in sequence, and the data of the last row is discarded. Simultaneous Counter 1 Step 4 will be entered each time it is subtracted by 1 until it is not more than 1.
Addr 2 =AddrBase-InMapCol (3)
Step 4: the method is similar to step 2, the calculation mode of the address vector in the step is changed into the value of AddrBase plus 1 as the last element of the address vector, other elements increment InMapCol forward in sequence, then move each row of CACHE forward by one element position, discard the first element of each rowFilling the last element of the new data vector in the last element position of the first row of the CACHE, filling the last element of the data vector in the last element position of the second row of the CACHE, sequentially performing the steps, and counting the Counter after filling 2 Make a judgment if Counter 2 Smaller than Counter, then Counter 2 Self-increment 1, correcting the value of AddBase to Addr 3 ,Addr 3 The calculation of (2) is shown as a formula (5), and the step (1) is carried out; otherwise, the characteristic data is scanned once, if so, it is necessary to determine whether other convolution kernels need to perform convolution calculation with the other convolution kernels, the ScanIndex variable records the number of times a group of two-dimensional characteristic data is scanned, if the ScanIndex is not less than CoreGroup (super parameter, the number of groups of the convolution kernels in the layer), if so, it is necessary to initialize related variables, otherwise, the value of AddrBase is corrected to Addr 4 ,Addr 4 The calculation of (2) is as shown in equation (6) and proceeds to step 5.
Counter=InMapCol-CoreLen (4)
Addr 3 =AddrBase+CoreLen×InMapCol (5)
Addr 4 =AddrBase-CoreLen+CoreLen×InMapCol (6)
Step 5: in order to eliminate the step of initializing the CACHE, the sliding path can be reversely moved back along the CACHE, and the initial sliding position can be at any position because one convolution kernel corresponds to one group of two-dimensional characteristic data. At this time, the CACHE corresponds to the upper right corner data of the two-dimensional feature data. Generally similar to step 1, except that the first element of the address vector is equal to the value of the variable AddrBase, the other elements are sequentially incremented. Simultaneous Counter 1 The state and condition of (2) are the same as those of step 1, and when the state and condition is not smaller than Counter ', the step 6 is entered, and the Counter' is calculated as shown in a formula (7).
Counter′=InMapRow-CoreLen (7)
Step 6: this step is similar to step 2, except that the value of AddrBase is subtracted from 1 as the first element of the address vector, and the other elements are decremented back by inapcol in sequence. CACH is carried outE, each row of data is moved backwards by one element position, the tail element of each row is abandoned, then the first element of the new data vector is filled in the first element position of the last row of the CACHE, and the second element of the data vector is filled in the first element position of the penultimate row of the CACHE, and the steps are sequentially carried out. Thereafter, the value of addrBase is corrected to Addr 1 . Counter 2 From 1, and go to step 7.
Step 7: the step is basically the same as step 3, except that the address vector is calculated in the same manner as step 5, the first element of the address vector is equal to the value of the variable AddrBase, and the other elements are sequentially incremented. When the Counter counts 1 If not, the process proceeds to step 8.
Step 8: the method is similar to the step 4, the value of the address vector in the step is self-subtracted by 1 as the first element of the address vector, other elements are sequentially subtracted by InMapCol, each row of the CACHE is moved backwards by one element position, the tail element of each row is abandoned, then the first element of the new data vector is filled in the first element position of the first row of the CACHE, and the second element of the data vector is filled in the first element position of the second row of the CACHE. Counter for Counter after filling 2 Make a judgment if Counter 2 Greater than 1, counter 2 From minus 1, correct the value of addrBase to Addr 3 Step 5 is entered; otherwise, the characteristic data is scanned once, if ScanIndex is not smaller than CoreGroup, the two-dimensional characteristic data is used, related variables are initialized, otherwise, the value of the correction variable AddBase is Addr 5 ,Addr 5 The calculation method of (2) is shown in the formula (8), and then step 1 is performed.
Addr 5 =AddrBase+CoreLen+CoreLen×InMapCol (8)
The above steps are steps that a feature map undergoes after calculation in a certain convolution layer, and the calculation result needs to be temporarily stored in another internal storage RAM to wait for accumulation output of other data, and when the data is output, the relative position between the data needs to be ensured to be correct, and the same method as the above steps needs to be used for outputting addresses corresponding to the data. The conversion flow between the above steps is shown in fig. 3. The bit width of each data is recorded as W, the size of the convolution kernel is N multiplied by N, the data is buffered by using the flow, after the initialization is completed in N cycles, each cycle only needs to replace N data in the CACHE, namely, only the bandwidth of (N multiplied by W) bit is needed to meet the data requirement of simultaneously carrying out (N multiplied by W) bit calculation in each cycle, not only can the calculation unit in each cycle be in a working state at the future time, the calculation resource is utilized to the maximum, but also the reuse rate of the data is maximized, the bandwidth used for matching with the calculation resource is reduced as much as possible, and the resource waste of repeated data reading is avoided.
The general symbol interpretation herein is shown in Table 1:
table 1 common symbol interpretation table
Figure BDA0002725907190000081
The method is utilized in a convolutional neural network accelerator for realizing longitudinal expansion and transverse folding, wherein the longitudinal expansion is to map each layer of the convolutional neural network, the transverse folding is to multiplex resources in the same layer, and all the works of the layer are completed by changing parameters in the resources. Because the foundation construction of the hidden layer is a convolution layer, a pooling layer and a full connection layer, the hidden layer is respectively realized as three modules capable of transmitting parameters, and the accelerator of the required convolution neural network can be formed by instantiating the corresponding modules at proper positions. Fig. 4 is an exemplary model diagram of a LeNet-5 network under this scenario.
The above technical solution is used in the design of the module of the convolution layer, and a schematic diagram of the module is shown in fig. 5:
the principles and flows in the operation of the instantiated convolutional layer are consistent, so the following description will be made only with reference to the C2 layer parameters. In the beginning stage of the layer, the BRAM at the front end of the module already receives 6 groups of 14x14 two-dimensional characteristic data transmitted by the upper layer, the ending signal finish of the previous layer is set high, and the beginning signal start of the layer is activated. When the start signal is set high, the memory read-write control module starts to work, addr_reg is equal to 4, the addr_reg continuously updates the address vector addr_vector, the updating mode is the same as that in the following state 1, the address vector addr_vector is transmitted to the BRAM, the returned data vector data_vector is put into the cache_reg until the cache_reg of 5x5 is filled, as shown in fig. 6, and then the init_finish signal is set high, so that the next stage is ready to enter. Since the addr_vector is 5 address lengths, each address length is considered as an element, and thus only 5 address vectors need to be updated. And then waits (25-5) clock cycles to ensure that the initialization of the convolution kernel is complete and begins to enter the first state of the state machine. The state of the state machine is represented by a 3-bit 2-system number state, which can represent 8 numbers, 0 to 7, and the 8 numbers represent 8 states respectively corresponding to 8 steps in the technical scheme.
The principles and flows in the operation of the instantiated convolutional layer are consistent, so the following description will be made only with reference to the C2 layer parameters. In the beginning stage of the layer, the BRAM at the front end of the module already receives 6 groups of 14x14 two-dimensional characteristic data transmitted by the upper layer, the ending signal finish of the previous layer is set high, and the beginning signal start of the layer is activated. When the start signal is set high, the memory read-write control module starts to work, addr_reg is equal to 4, the addr_reg continuously updates the address vector addr_vector, the updating mode is the same as that in the following state 1, the address vector addr_vector is transmitted to the BRAM, the returned data vector data_vector is put into the cache_reg until the cache_reg of 5x5 is filled, as shown in fig. 6, and then the init_finish signal is set high, so that the next stage is ready to enter. Since the addr_vector is 5 address lengths, each address length is considered as an element, and thus only 5 address vectors need to be updated. And then waits (25-5) clock cycles to ensure that the initialization of the convolution kernel is complete and begins to enter the first state of the state machine. The state of the state machine is represented by a 3-bit 2-system number state, which can represent 8 numbers, 0 to 7, and the 8 numbers represent 8 states respectively corresponding to 8 steps in the technical scheme.
State 2 (state=1): this state does not cycle itself, let addr_reg=addr_reg+1, then let
addr_vector[4]=addr_reg,
addr_vector[3]=addr_reg-14,
addr_vector[2]=addr_reg-14x2,
addr_vector[1]=addr_reg-14x3,
addr_vector[0]=addr_reg-14x4
A new addr-vector is formed, and then a BRAM obtains a new data-vector, so that
cache_reg[0]={cache_reg[0][1],cache_reg[0][2],cache_reg[0][3],cache_reg[0][4],data_vector[0]},
cache_reg[1]={cache_reg[1][1],cache_reg[1][2],cache_reg[1][3],cache_reg[1][4],data_vector[1]},
cache_reg[2]={cache_reg[2][1],cache_reg[2][2],cache_reg[2][3],cache_reg[2][4],data_vector[2]},
cache_reg[3]={cache_reg[3][1],cache_reg[3][2],cache_reg[3][3],cache_reg[3][4],data_vector[3]},
cache_reg[4]={cache_reg[4][1],cache_reg[4][2],cache_reg[4][3],cache_reg[4][4],data_vector[4]}
The above actions can be regarded as sliding the two-dimensional cache_reg from left to right in the lowest line of the BRAM of the two-dimensional feature data, and acquiring the data of the corresponding position, as shown in fig. 8.
After the above operation is completed, the state=2 is entered into the state 3 after the count 2=count 2+1 and the addr_reg=addr_reg-5×14.
State 3 (state=2): the value of the register addr_reg maintaining the address information is increased or decreased by 14 per clock cycle, i.e. the number of columns of the two-dimensional characteristic data, and then is made
addr_vector[4]=addr_reg,
addr_vector[3]=addr_reg-1,
addr_vector[2]=addr_reg-2,
addr_vector[1]=addr_reg-3,
addr_vector[0]=addr_reg-4
A new addr-vector is formed, and then a BRAM obtains a new data-vector, so that
cache_reg[4]=cache_reg[3],
cache_reg[3]=cache_reg[2],
cache_reg[2]=cache_reg[1],
cache_reg[1]=cache_reg[0],
cache_reg[0]=data_vector,
The above actions can be regarded as sliding the two-dimensional cache_reg from bottom to top in a column of the BRAM of the two-dimensional feature data, and continuously acquiring the data of the corresponding position, as shown in fig. 9.
After the above actions are executed, judging whether the count1 is larger than 1, if so, enabling the count1 to be equal to the count1-1; otherwise, let state=3, enter state 4.
State 4 (state=3): this state does not cycle itself, let addr_reg=addr_reg+1, then let
addr_vector[4]=addr_reg,
addr_vector[3]=addr_reg+14,
addr_vector[2]=addr_reg+14x2,
addr_vector[1]=addr_reg+14x3,
addr_vector[0]=addr_reg+14x4
A new addr-vector is formed, and then a BRAM obtains a new data-vector, so that
cache_reg[0]={cache_reg[0][1],cache_reg[0][2],cache_reg[0][3],cache_reg[0][4],data_vector[4]},
cache_reg[1]={cache_reg[1][1],cache_reg[1][2],cache_reg[1][3],cache_reg[1][4],data_vector[3]},
cache_reg[2]={cache_reg[2][1],cache_reg[2][2],cache_reg[2][3],cache_reg[2][4],data_vector[2]},
cache_reg[3]={cache_reg[3][1],cache_reg[3][2],cache_reg[3][3],cache_reg[3][4],data_vector[1]},
cache_reg[4]={cache_reg[4][1],cache_reg[4][2],cache_reg[4][3],cache_reg[4][4],data_vector[0]}
The above actions can be regarded as that the two-dimensional cache_reg slides one frame from left to right on the uppermost line of the BRAM of the two-dimensional feature data, and data of corresponding positions are acquired, as shown in fig. 10.
After the above operations are executed, judging whether the count2 is smaller than (14-5), if so, making the count 2=count 2+1, addr_reg=addr_reg+5x14, and then making the command state enter the state 1, state=0; if not, the register one_pic=one_pic+1 is made, whether the one_pic is smaller than 16 is judged, if so, addr_reg=addr_reg-5+5x14 is made, the state 5 is entered, and the state=4; otherwise, the signal round_finish is set high, related data is initialized, and the completion signal is transmitted, so that the next-stage action starts to be executed.
State 5 (state=4), state 6 (state=5), state 7 (state=6), state 8 (state=7), and the four states are mirror-symmetrical to the states 1 to 4, and the flow is identical.
The invention is not limited to the embodiments described above. The above description of specific embodiments is intended to describe and illustrate the technical aspects of the present invention, and is intended to be illustrative only and not limiting. Numerous specific modifications can be made by those skilled in the art without departing from the spirit of the invention and scope of the claims, which are within the scope of the invention.

Claims (1)

1. The internal storage bandwidth optimization method of the convolutional neural network accelerator is characterized in that an optimization module for reducing the bandwidth requirement of the accelerator is arranged in a convolutional layer of the neural network, and the optimization module adopts the following steps:
step 1, setting a CACHE with the same size as the number of the computing units between an internal storage RAM storing the characteristic data to be computed and the computing units, wherein each data in the CACHE is directly connected to the computing units, and firstly, the data in the CACHE is transmitted to the computing units in each computing period;
step 2, the memory read-write control unit is used for reading and preparing the data used in the next period: the buffer block moves up and down preferentially relative to the memory area, one line of data which is different from the previous period in the buffer block is discarded each time, other data sequentially move forward, and the data newly covered by the buffer block is added into an empty line;
and 3, when the buffer block longitudinally moves to the boundary, transversely moving the buffer block by one step, discarding a row of data which is different from the previous period in the buffer block, laterally moving other data sequences, and adding the data newly covered by the buffer block into an empty column, wherein the optimization module adopts the following steps to realize two-dimensional characteristic data buffer:
step 0, initializing CACHE, and preparing data used in a first computing period; the process starts, the address vector of the data which should be stored in the first row of the CACHE is transmitted to the internal storage RAM by the memory read-write control unit, the address vector of the second row is submitted in the next period, the received data vector of the first row is put into the first row of the CACHE, and the process is repeated until the CACHE is filled up and the mark initialization is completed, and the data in the CACHE corresponds to the upper left corner data of the internal storage RAM one by one at the moment; wherein: the first value of the variable AddrBase, addrBase with the maintenance address set is equal to CoreLen minus 1, the last element of each address vector is equal to AddrBase value, the other elements are decremented in sequence, and then AddrBase value is changed to Addr 0 ,Addr 0 The calculation of (1) is shown in a formula (1), wherein InMapCol is a super parameter and is the number of columns of the two-dimensional characteristic data;
Addr 0 =AddrBase+InMapCol (1)
step 1: after initialization is completed, the computing unit starts to work, the first operation is completed by using the data in the CACHE, the memory read-write control unit continues to execute the computation and transmission of the addresses, the received data vector is placed in the last line of the CACHE, the original last line is transferred to the next-to-last line, the data is moved upwards according to the rule, the original first line data is discarded, and the Counter is used at the same time 1 Starting to work, and automatically increasing each time the transfer of the address vector is executed until the address vector is not smaller than InMapRow minus CoreLen;
step 2: the step 1 is finished, namely the sliding of the CACHE relative to the two-dimensional characteristic data reaches the bottom of the row, and the CACHE is moved to the right by one row at the moment; at this time, the calculation mode of the address vector is changed, and the value of addrBase is added with 1 to be the most significant of the address vectorThe latter element, the other elements decrease InMapCol forward in turn, the data vector from this address vector is put into CACHE in new way: moving each row of CACHE forward by one element position, discarding the first element of each row, filling the first element of the new data vector in the last element position of the first row of CACHE, filling the second element of the data vector in the last element position of the second row of CACHE, and sequentially performing; thereafter, the value of addrBase is corrected to Addr 1 ,Addr 1 The calculation of (2) is shown in formula (2) and the variable Counter 2 Self-increasing, and entering into a step 3:
Addr 1 =AddrBase-CoreLen×InMapCol (2)
step 3: the step is similar to step 1, in which the value of the variable addrBase is changed to Addr after each address vector calculation is completed 2 ,Addr 2 As shown in the formula (2), the received data vector is placed in the first row of the CACHE, the original first row moves downwards to form a new second row, the other rows move downwards in sequence, and the data of the last row is discarded; simultaneous Counter 1 Step 4 is carried out after each self-subtracting 1 until the self-subtracting 1 is not more than 1;
Addr 2 =AddrBase-InMapCol (3)
step 4: the step is similar to step 2, the calculation mode of the address vector in the step is changed into the value of AddrBase plus 1 as the last element of the address vector, other elements are sequentially increased by InMapCol forward, then each row of the CACHE is moved forward by one element position, the row head element of each row is discarded, then the last element of the new data vector is filled in the position of the last element of the first row of the CACHE, the last element of the data vector is filled in the position of the last element of the second row of the CACHE, and the steps are sequentially carried out, and after the filling is completed, the Counter is used 2 Make a judgment if Counter 2 Smaller than Counter, then Counter 2 Self-increment 1, correcting the value of AddBase to Addr 3 ,Addr 3 The calculation of (2) is shown as a formula (5), and the step (1) is carried out; otherwise, the characteristic data is scanned once, and then judgment is neededIf there are other convolution kernels to be convolved with, the ScanIndex variable records the number of times a group of two-dimensional feature data is scanned, if the ScanIndex is not less than CoreGroup, it indicates that the two-dimensional feature data is used, the related variables are initialized, otherwise, the AddBase value is corrected to Addr 4 ,Addr 4 As shown in formula (6), and proceeds to step 5:
Counter=InMapCol-CoreLen (4)
Addr 3 =AddrBase+CoreLen×InMapCol (5)
Addr 4 =AddrBase-CoreLen+CoreLen×InMapCol (6)
step 5: in order to omit the step of initializing the CACHE, the sliding path can be reversely moved back along the CACHE, and the initial sliding position can be at any position because one convolution kernel corresponds to one group of two-dimensional characteristic data; at this time, CACHE corresponds to the upper right corner data of the two-dimensional characteristic data one by one; generally similar to step 1, except that the first element of the address vector is equal to the value of the variable AddrBase, the other elements being sequentially incremented; simultaneous Counter 1 The state and condition of (2) are the same as those of step 1, and when the state and condition is not smaller than Counter ', step 6 is performed, and the Counter' is calculated as shown in formula (7):
Counter′=InMapRow-CoreLen (7)
step 6: this step is similar to step 2, except that the value of addrBase is self-subtracted by 1 as the first element of the address vector, and the other elements are sequentially decremented by InMapCol backwards; moving each row of CACHE backward by one element position, discarding the row tail element of each row, filling the first element of the new data vector on the first element position of the last row of the CACHE, and filling the second element of the data vector on the first element position of the penultimate row of the CACHE, wherein the steps are sequentially carried out; thereafter, the value of addrBase is corrected to Addr 1; Counter 2 Self-subtracting 1, and entering step 7;
step 7: this step is substantially the same as step 3, except that the address vector is calculated in the same manner as step 5, the first element of the address vectorThe value equal to the variable AddrBase, and the other elements are sequentially increased; when the Counter counts 1 If the number of the water is not more than 1, the step 8 is entered;
step 8: the method is similar to the step 4, the calculation mode of the address vector in the step is that the value of AddrBase is reduced by 1 as the first element of the address vector, other elements are sequentially reduced by InMapCol backwards, each row of the CACHE is moved backwards by one element position, the row tail element of each row is abandoned, then the first element of the new data vector is filled in the first element position of the first row of the CACHE, and the second element of the data vector is filled in the first element position of the second row of the CACHE, and the steps are sequentially carried out; counter for Counter after filling 2 Make a judgment if Counter 2 Greater than 1, counter 2 From minus 1, correct the value of addrBase to Addr 3 Step 5 is entered; otherwise, the characteristic data is scanned once, if ScanIndex is not smaller than CoreGroup, the two-dimensional characteristic data is used, related variables are initialized, otherwise, the value of the correction variable AddBase is Addr 5 ,Addr 5 The calculation method of (1) is shown in the formula (8), and then the step (1) is carried out;
Addr 5 =AddrBase+CoreLen+CoreLen×InMapCol (8)。
CN202011102647.7A 2020-10-15 2020-10-15 Internal storage bandwidth optimization method of convolutional neural network accelerator Active CN112070217B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011102647.7A CN112070217B (en) 2020-10-15 2020-10-15 Internal storage bandwidth optimization method of convolutional neural network accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011102647.7A CN112070217B (en) 2020-10-15 2020-10-15 Internal storage bandwidth optimization method of convolutional neural network accelerator

Publications (2)

Publication Number Publication Date
CN112070217A CN112070217A (en) 2020-12-11
CN112070217B true CN112070217B (en) 2023-06-06

Family

ID=73655691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011102647.7A Active CN112070217B (en) 2020-10-15 2020-10-15 Internal storage bandwidth optimization method of convolutional neural network accelerator

Country Status (1)

Country Link
CN (1) CN112070217B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435570B (en) * 2021-05-07 2024-05-31 西安电子科技大学 Programmable convolutional neural network processor, method, device, medium and terminal

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108171317A (en) * 2017-11-27 2018-06-15 北京时代民芯科技有限公司 A kind of data-reusing convolutional neural networks accelerator based on SOC
CN109104197A (en) * 2018-11-12 2018-12-28 合肥工业大学 The coding and decoding circuit and its coding and decoding method of non-reduced sparse data applied to convolutional neural networks
WO2019127838A1 (en) * 2017-12-29 2019-07-04 国民技术股份有限公司 Method and apparatus for realizing convolutional neural network, terminal, and storage medium
CN110705702A (en) * 2019-09-29 2020-01-17 东南大学 Dynamic extensible convolutional neural network accelerator
CN111340198A (en) * 2020-03-26 2020-06-26 上海大学 Neural network accelerator with highly-multiplexed data based on FPGA (field programmable Gate array)

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108171317A (en) * 2017-11-27 2018-06-15 北京时代民芯科技有限公司 A kind of data-reusing convolutional neural networks accelerator based on SOC
WO2019127838A1 (en) * 2017-12-29 2019-07-04 国民技术股份有限公司 Method and apparatus for realizing convolutional neural network, terminal, and storage medium
CN109104197A (en) * 2018-11-12 2018-12-28 合肥工业大学 The coding and decoding circuit and its coding and decoding method of non-reduced sparse data applied to convolutional neural networks
CN110705702A (en) * 2019-09-29 2020-01-17 东南大学 Dynamic extensible convolutional neural network accelerator
CN111340198A (en) * 2020-03-26 2020-06-26 上海大学 Neural network accelerator with highly-multiplexed data based on FPGA (field programmable Gate array)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于改进动态配置的FPGA卷积神经网络加速器的优化方法;陈朋 等;高技术通讯;第30卷(第3期);全文 *

Also Published As

Publication number Publication date
CN112070217A (en) 2020-12-11

Similar Documents

Publication Publication Date Title
JP7329533B2 (en) Method and accelerator apparatus for accelerating operations
CN110097174B (en) Method, system and device for realizing convolutional neural network based on FPGA and row output priority
CN108985450B (en) Vector processor-oriented convolution neural network operation vectorization method
CN108304922B (en) Computing device and computing method for neural network computing
CN108182471B (en) Convolutional neural network reasoning accelerator and method
CN108171317B (en) Data multiplexing convolution neural network accelerator based on SOC
CN107657581B (en) Convolutional neural network CNN hardware accelerator and acceleration method
US7580567B2 (en) Method and apparatus for two dimensional image processing
WO2020156508A1 (en) Method and device for operating on basis of chip with operation array, and chip
CN110852428A (en) Neural network acceleration method and accelerator based on FPGA
CN112070217B (en) Internal storage bandwidth optimization method of convolutional neural network accelerator
CN111144556A (en) Hardware circuit of range batch processing normalization algorithm for deep neural network training and reasoning
WO2022110386A1 (en) Data processing method and artificial intelligence processor
CN111783967A (en) Data double-layer caching method suitable for special neural network accelerator
CN111340198A (en) Neural network accelerator with highly-multiplexed data based on FPGA (field programmable Gate array)
CN114092338B (en) Image zooming fast calculation method
CN113743587B (en) Convolutional neural network pooling calculation method, system and storage medium
CN113254391B (en) Neural network accelerator convolution calculation and data loading parallel method and device
CN104050635B (en) System and method for nonlinear filter real-time processing of image with adjustable template size
CN111191774B (en) Simplified convolutional neural network-oriented low-cost accelerator architecture and processing method thereof
CN111340224B (en) Accelerated design method of CNN (computer network) suitable for low-resource embedded chip
CN114372012B (en) Universal and configurable high-energy-efficiency pooling calculation single-row output system and method
CN112905526B (en) FPGA implementation method for multiple types of convolution
CN214586992U (en) Neural network accelerating circuit, image processor and three-dimensional imaging electronic equipment
CN112001492B (en) Mixed running water type acceleration architecture and acceleration method for binary weight DenseNet model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant