CN111582451B - Image recognition interlayer parallel pipeline type binary convolution neural network array architecture - Google Patents

Image recognition interlayer parallel pipeline type binary convolution neural network array architecture Download PDF

Info

Publication number
CN111582451B
CN111582451B CN202010383601.0A CN202010383601A CN111582451B CN 111582451 B CN111582451 B CN 111582451B CN 202010383601 A CN202010383601 A CN 202010383601A CN 111582451 B CN111582451 B CN 111582451B
Authority
CN
China
Prior art keywords
layer
calculation
convolution
convolutional
layers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010383601.0A
Other languages
Chinese (zh)
Other versions
CN111582451A (en
Inventor
陈松
刘百成
康一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202010383601.0A priority Critical patent/CN111582451B/en
Publication of CN111582451A publication Critical patent/CN111582451A/en
Application granted granted Critical
Publication of CN111582451B publication Critical patent/CN111582451B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an image recognition interlayer parallel assembly line type binarization convolution neural network array framework, which comprises: five calculation layers of an M1 layer, an M2 layer, an M3 layer, an M4 layer and an M5 layer are arranged in sequence, and an interlayer pipeline is formed, wherein: the M1 layer, the M2 layer and the M3 layer respectively comprise the calculation of two convolution layers, a two-stage pipeline is formed in each layer, and a maximum pooling layer is arranged at the tail end of each layer to complete pooling calculation; the M4 layer and the M5 layer each contain 1 and two fully connected layer computations; and each convolution layer and each full connection layer are internally provided with a control unit connected with the global controller and a memory for storing the weight parameters and the binary coding parameters. The framework can improve the parallelism of image recognition calculation, reduce the requirement of weight storage, effectively avoid multiplication calculation, reduce power consumption and improve energy efficiency.

Description

Image recognition interlayer parallel pipeline type binary convolution neural network array architecture
Technical Field
The invention relates to the field of a binarization convolutional neural network, in particular to an image recognition interlayer parallel pipelined binarization convolutional neural network array architecture.
Background
Biology considers that brain neurons and synapses of organisms form a network that can be used to generate biological awareness, help organisms generate thinking and actions. Based on the method, scientists researching the artificial neural network abstract a mathematical model from the artificial neural network, abstract neurons of human brains from the aspect of information processing, establish a simple mathematical model, and form the network according to different connection modes. At present, the artificial neural network is widely applied and has application in the fields of voice recognition, image recognition, target detection and the like. In the process of artificial neural network research, scientists propose the concept of convolutional neural network, which is an artificial neural network comprising a deep structure and consists of a feedforward neural network and a negative feedback neural network, wherein only feedforward neural network calculation is carried out during recognition, and negative feedback neural network calculation is required during training. The study of the convolutional neural network is inspired by the study of visual cells, and the neuron in the primary visual cortex is found to respond to simple characteristics in a visual environment, the visual cortex has simple cells and complex cells, the simple cells have strong reaction to specific spatial positions and preference directions, and invariance in the complex space can be realized by pooling the input of the simple cells. It follows that in convolutional neural networks, the underlying computation is a convolution computation and a pooling computation. Convolution calculation is a calculation process that uses a convolution kernel of a specific size to extract features in a specific area, mainly by multiply-accumulate. The pooling calculation is a down-sampling process, and the down-sampling can remove unimportant feature elements, reduce the scale of the feature map, reduce calculation parameters, and simultaneously can retain the important features of the feature map so as not to influence the subsequent calculation.
With the further research, the scale of the convolutional neural network gradually increases, which results in that the convolutional neural network needs more storage resources and the consumption of computing resources is continuously increased. Therefore, research to reduce the storage requirement and the computation requirement of the convolutional neural network becomes a hot spot of the convolutional neural network research. At present, the mainstream methods for reducing the storage requirement and the calculation requirement of the convolutional neural network include pruning, singular value decomposition, quantization, pulse neural network and the like. Pruning can find relatively unimportant connection between two adjacent layers during training and reset the weight of the connection to 0, namely the connection is equivalent to shearing, so that the storage and calculation times of weight parameters are reduced in the calculation process; singular value decomposition is generally applied to a full connection layer, and two large-scale matrix multiplication can be converted into three smaller-scale matrix multiplication in a singular value decomposition mode, so that the storage requirement and the calculation requirement can be reduced; the quantization neural network uses less bit number to represent the original floating point numerical value, generally 11bit, 8bit, 5bit, 3bit, 2bit, 1bit and the like can be used, and the network which adopts 1bit and completes calculation by using +1 state and-1 state is called a binary convolution neural network; the impulse neural network is closer to the working mode of the biological neural network, if the membrane potential of a certain presynaptic neuron exceeds a preset voltage threshold value in calculation, a pulse is emitted backwards, otherwise, the corresponding postsynaptic neuron keeps a non-working state because no input pulse exists, no pulse exists in hardware acceleration, namely no dynamic power consumption exists, only static power consumption exists, and therefore the total power consumption can be reduced.
In order to achieve the real-time image processing effect, researchers generally design accelerators with a GPU, an FPGA, and an ASIC. However, the method is limited by the large storage requirement and calculation requirement of the convolutional neural network, the image recognition consumes more resources, many hardware are difficult to meet the storage requirement, the calculation parallelism is low, and high energy efficiency cannot be realized, so that it is very important to design an interlayer parallel pipeline type array architecture for image recognition based on the binary convolutional neural network.
Disclosure of Invention
The invention aims to provide an image identification interlayer parallel pipelined binarization convolutional neural network array architecture, which can improve the image identification calculation parallelism, reduce the weight storage requirement, effectively avoid multiplication calculation, reduce the power consumption and improve the energy efficiency.
The purpose of the invention is realized by the following technical scheme:
an image recognition interlayer parallel pipelined binarization convolutional neural network array architecture comprises: five calculation layers of an M1 layer, an M2 layer, an M3 layer, an M4 layer and an M5 layer are arranged in sequence, and an interlayer pipeline is formed, wherein:
the M1 layer, the M2 layer and the M3 layer respectively comprise the calculation of two convolution layers, a two-stage pipeline is respectively formed in each layer, and a maximum pooling layer is arranged at the tail end of each layer to complete pooling calculation; the M4 layer and the M5 layer each contain 1 and two fully connected layer computations; and each convolution layer and each full connection layer are internally provided with a control unit connected with the global controller and a memory for storing the weight parameters and the binary coding parameters.
According to the technical scheme provided by the invention, the hardware accelerated calculation of the image identification binary convolution neural network can reduce the hardware storage requirement, avoid multiplication calculation, reduce energy consumption and improve the parallelism, thereby improving the identification speed and energy efficiency.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a schematic diagram of an image recognition interlayer parallel pipelined binarization convolutional neural network array architecture according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an interlayer parallel pipeline calculation according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a first type C structure of a convolution calculation part of a PE unit according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a second type C of convolution calculating part of a PE unit according to an embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating a conversion of binary multiply-accumulate calculation into iso-sum or accumulate calculation according to an embodiment of the present invention;
fig. 6 is a schematic diagram of PE units of a convolution kernel of 3 × 3 size according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides an image recognition interlayer parallel pipelined binarization convolutional neural network array architecture, which mainly comprises: five calculation layers of M1, M2, M3, M4 and M5 are arranged in sequence and form an interlayer pipeline, wherein:
the M1 layer, the M2 layer and the M3 layer respectively comprise the calculation of two convolution layers, a two-stage pipeline is respectively formed in each layer, and a maximum pooling layer is arranged at the tail end of each layer to complete pooling calculation; the M4 layer and the M5 layer each contain 1 and two fully connected layer computations; and each convolution layer and each full connection layer are internally provided with a control unit connected with the global controller and a memory for storing the weight parameters and the binary coding parameters.
As shown in fig. 1, the M1 layer includes two convolutional layers of C1 and C2 and constitutes a two-stage pipeline, the M2 layer includes two convolutional layers of C3 and C4 and constitutes a two-stage pipeline, the M3 layer includes two convolutional layers of C5 and C6 and constitutes a two-stage pipeline, the M4 layer includes a fully-connected layer F1, and the M5 layer includes fully-connected layers F2 and F3.
As shown in fig. 1, the whole binary convolution neural network array architecture is divided into 9 blocks, which correspond to 6 convolution layers (C1-C6) and 3 full connection layers (layers F1-F3), respectively; wherein, the GlobalControl is a global controller used for realizing global control. In fig. 1, Images in the convolutional layer C1 are used to store pictures to be recognized, Weights & & K _ H are used to store weight parameters and K _ H parameters required for binary coding, Control is a Control unit of the convolutional layer C1 (receiving a Control signal issued by a global controller), and Ping-pong buffer unit is a Ping-pong buffer unit, which includes two identical storage units and is used to store array calculation results of the convolutional layer C1, wherein the convolutional layer C1 uses 24 PEs (processing units) to complete calculation, and inputs of the PEs are connected to the Control unit and a memory for storing the weight parameters and the binary coding parameters; convolutional layer C2 also contains control signals, weights and binary coding parameters, where the computational array uses 256 PEs, the input is fed by the Ping-pong buffer in convolutional layer C1, and the output is also stored in the Ping-pong buffer in convolutional layer C2; convolutional layer C3 contains control signals, weights and binary coding parameters, the array uses 256 PEs, the input is fed by Ping-PongBuffer in convolutional layer C2; convolutional layer C4 contains control signals, weights and binary coding parameters, the array uses 512 PEs, the input is fed by Ping-PongBuffer in convolutional layer C3, and the output is stored in Ping-PongBuffer in convolutional layer C4; convolutional layer C5 contains control signals, weights and binary coding parameters, the array uses 256 PEs, the input is fed from the Ping-pong buffer in convolutional layer C4, the output is stored in the Ping-pong buffer in convolutional layer C5; the convolutional layer C6 contains control signals, weights and binary coding parameters, 512 PEs are used in the array, input is fed from the Ping-PongBuffer in the convolutional layer C5, and output is stored in the Ping-PongBuffer in the convolutional layer C6; the full connection layer F1 contains control signals, weight parameters and binary coding parameters, the array calculation does not adopt convolution PE, but uses 512 exclusive nor calculation units and an addition tree to complete the operation, the input is fed by the Ping-PongBuffer in the convolution layer C6, and the output is stored in the Ping-PongBuffer in the full connection layer F1; the full connection layer F2 comprises control signals, weight parameters and binary coding parameters, 86 same or calculation units are used and the calculation is completed by an addition tree, the Ping-PongBuffer in the full connection layer F1 is input, and the output is directly output to the full connection layer F3 for calculation; the full connection layer F3 comprises control parameters, weight parameters and binary coding parameters, the calculation results are input into the array calculation result of the full connection layer F2, calculation is completed one by one and accumulation is completed, and the calculation result of the full connection layer F2 is not stored in the Ping-PongBuffer.
In the above description and the structure shown in fig. 1, the number of PEs in each convolution layer is given by way of example and not limitation; in practical application, a user can make matching adjustment according to practical conditions.
The binarization convolutional neural network array architecture provided by the embodiment of the invention can form a five-stage pipeline, the first convolutional layer C1 in the M1 layer is calculated firstly, and after partial result calculation is finished, the first convolutional layer C1 and the second convolutional layer C2 in the M1 layer are calculated simultaneously; and the calculation results of the M1 layer complete N clock cycles, the M1 layer carries out calculation in the first N clock cycle, the M1 layer and the M2 layer work simultaneously in the second N clock cycle, and the like, and the M1 layer, the M2 layer, the M3 layer, the M4 layer and the M5 layer work simultaneously in the fifth N clock cycle, so that a five-stage pipeline is formed.
As shown in fig. 2, there are 9 layers from top to bottom, C1, C2, C3, C4, C5, C6, F1, F2, and F3. n1 represents the time required for C1 to complete the calculation, n2 represents the time required for C2 to complete the calculation, and n3 represents the time required for M1 to complete the calculation. Since the calculation result of C1 needs to be used as input in the calculation of C2, in order to ensure the accuracy of the calculation result and improve the calculation parallelism, C2 starts calculation when C1 completes partial calculation, and C3 and C4 in an M2 layer start calculation in sequence by waiting until C2 finishes complete calculation, similarly, C5 and C6 in M3 complete calculation, M4 only comprises one layer of F1, and M5 comprises two layers of F2 and F3, which is set because the operation speed of the pipeline structure depends on the layer with the slowest calculation.
In an embodiment of the present invention, the first convolutional layer C1 may input a different type of data than the subsequent convolutional layer. Specifically, the method comprises the following steps: the input of the first convolutional layer C1 is floating point data, the second convolutional layer C2 in the M1 layer is input with binary data output by the first convolutional layer C1; similarly, the input of the convolution layers in the M2 layer and the M3 layer are both binary data; the structure of the convolution calculation part is different between the first convolution layer C1 in the M1 layer and the second convolution layer C2 in the M1 layer and the convolution layers of the M2 layer and the M3 layer.
As shown in fig. 1, the first convolutional layer C1 inputs image images, and some input pictures are RGB color, so that binarization cannot be performed, therefore, the first convolutional layer C1 needs to use a C unit shown in fig. 3, which controls input data through a gating unit, then stores the input data in a register, and then completes the subsequent accumulation step, and the use of this unit eliminates multiplication, and the formula is:
Figure BDA0002483008590000051
in is an input value, is floating point data, is a weight, is binary data, is generally in two states of +1 or-1, and can use 1bit to represent data in hardware, y is a result of completing convolution of one pixel, and generally one convolution kernel has convolution accumulation of a plurality of pixels, such as 3 × 3 size, 5 × 5 size, 7 × 7 size, and the like, and exemplarily, a convolution kernel of 3 × 3 size can be used. Since the input picture is floating point data, the C-unit calculation of fig. 3 uses a 16-bit register to store the data, with 1 sign bit, 5 integer bits, and 10 decimal bits.
Unlike the structure of fig. 3, fig. 4 is applied to the calculation of the convolutional layers C2 to C6. The second convolutional layer in the M1 layer, C2, and the convolutional layers in the M2 and M3 layers, input is binarized data, indicating that both input and weight are binarized, using an exclusive-nor operation. The exclusive nor operation can be expressed by the following formula:
Figure BDA0002483008590000052
in the above equation, k represents the size of a convolution kernel, for example, convolution of 3 × 3, k is 3, and i and j represent the position coordinates of a feature map pixel and a weight of the convolution, respectively; in represents an input feature map, w represents a weight, and an is an XNOR operator, i.e., y xnor Is the result of the exclusive nor operation. The multiplication operation can be replaced by an exclusive-nor operation in the binary convolution neural network through the formula. With reference to fig. 5, if a convolution operation of 3 x 3 size is performed, the result of the left graph using multiply-accumulate operation is-3; the right graph becomes +3 after being changed to the same or cumulative calculation. In order to ensure the consistency of the calculation on the hardware circuit, the final convolution result needs to be obtained by performing the result conversion according to the following formula:
y′=2×y xnor -L conv
wherein, y xnor Is the result of convolution using an exclusive-nor operation; l is conv Is the convolution kernel size, assuming 3 x 3 is used, since one neuron corresponds to more than one convolution kernel, then L conv Typically a multiple of 9; y' is the final convolution result. Full connectivity layer calculation of L conv Is the size of the weight matrix column. It has been described above that the result of the multiply-accumulate calculation in the left diagram of fig. 5 is-3, the result of the same or accumulate calculation in the right diagram is 3, and the result of substituting the same or accumulate result into the above equation to obtain y' is-3, so that the result of the convolution calculation is still correct after the multiply-accumulate calculation is converted into the same or accumulate calculation.
If the calculation is performed according to the formula, there is still one multiplication operation, however, in the embodiment of the present invention, the output of each convolution layer (i.e., C1-C6) in the M1 layer, the M2 layer, and the M3 layer, the fully-connected layer in the M4 layer, and the first fully-connected layer in the M5 layer all perform a batch normalization (abbreviated as BN) operation, which is used to accelerate the training speed during training, and the formula of the batch normalization operation is:
Figure BDA0002483008590000061
where μ is desired and σ is 2 Is the variance, the weight and bias of the gamma and beta batch normalization operations, and e is a constant (a normal number much less than 1), which is added to prevent sigma 2 Equal to 0; x represents the output of the convolutional layer or the fully-connected layer.
Combining the multiply-accumulate to the same or accumulate formula and the BN layer formula, the formula can be deduced:
Y=k f ×(X-h f )
Figure BDA0002483008590000062
Figure BDA0002483008590000063
in the above equation, X is the result of convolution calculation without bias, bias is bias, k f And h f Two groups of calculation parameters are deduced by combining convolution calculation and a BN layer calculation formula, and are floating point numbers.
The convolutional layers C1-C6 and the full connection layers F1-F2 are all binary outputs, binary coding is required to be carried out, and the binary coding is expressed by combining the formula and an activation function as follows:
Figure BDA0002483008590000064
in the above formula, sign (x) is a sign function; x refers to the input information, i.e., Y from batch normalization.
In conjunction with the activation function, the derived function can be simplified to:
Y=k i ⊙(x≥h f )
in the above formula, k i Represents k f The integer is 1, the negative number is 0, h f Before hardware acceleration, the data can be processed off-line and then guided into an accelerator for calculation, and can be directly input into a hardware circuit to participate in calculation, so that binary encoding can be completed, multiplication can be avoided, and the calculation process is simplified.
As shown in fig. 1, the convolutional layers C1 to C6 each include a plurality of PE units (processing units) therein, each processing unit is in a parallel operating mode, and fig. 6 illustrates an exemplary structure of a PE unit with 3 × 3 convolution, which mainly includes three parts: the first part is an input buffer part; the second part is a plurality of convolution calculation parts; in FIG. 6, 9C units shown in FIG. 3 are provided, that is, the PE unit of the convolutional layer C1 in FIG. 6 is replaced by the C unit shown in FIG. 4 for the other convolutional layers C2-C6; the third part is an addition tree unit for accumulating the result output by the second part, because the convolution kernel size is 3 × 3 in this example, the input buffer needs to buffer 3 lines, the weight register unit of the same or calculation part completes the buffer before calculation in a broadcast mode, and the addition tree starts to work and outputs the result after the same or calculation part completes the calculation.
To better illustrate the calculation process of the embodiment, a specific structure of the neural network of the embodiment is given below, as follows:
layer(s) Input device Filling in Convolution kernel Output of Size of weight Output size
C1
3*32*32 1(-1) 64*3*3*3 64*32*32 1728b 64kb
C2 64*32*32 1(-1) 64*64*3*3 64*32*32 36kb 64kb
MP1 64*32*32 - - 64*16*16 - 16kb
C3 64*16*16 1(-1) 128*64*3*3 128*16*16 72kb 32kb
C4 128*16*16 1(-1) 128*128*3*3 128*16*16 144kb 32kb
MP2 128*16*16 - - 128*8*8 - 8kb
C5 128*8*8 1(-1) 256*128*3*3 256*8*8 288kb 16kb
C6 256*8*8 1(-1) 256*256*3*3 256*8*8 576kb 16kb
MP3 256*8*8 - - 256*4*4 - 4kb
F1 4096 - 4096 1024 4Mb 1kb
F2 1024 - 1024 1024 1Mb 1kb
F3 1024 - 1024 10 10kb 10b
TABLE 1 neural network architecture
In the above table, C1, C2, C3, C4, C5 and C6 represent 6 convolutional layers, F1, F2 and F3 represent 3 fully-connected layers, MP1, MP2 and MP3 represent 3 pooling layers, and the pooling layers are designed to retain the maximum value in the 2 × 2 region, that is, to adopt the maximum value pooling method. In particular, when the hardware circuit architecture is combined for acceleration, the filled value is changed from 0 to-1 during software training, 0 is usually filled during general training, and the filling value is modified to-1, so that ternary operations of +1, 0 and-1 can be avoided when the same or alternative multiplication operation is adopted in the hardware circuit. The first value in the term convolution kernel represents the number of convolution kernels and the second value represents the number of convolution kernels
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (7)

1. An image recognition interlayer parallel pipelined binarization convolutional neural network array architecture is characterized by comprising: five calculation layers of an M1 layer, an M2 layer, an M3 layer, an M4 layer and an M5 layer are arranged in sequence, and an interlayer pipeline is formed, wherein:
the M1 layer, the M2 layer and the M3 layer respectively comprise the calculation of two convolution layers, a two-stage pipeline is respectively formed in each layer, and a maximum pooling layer is arranged at the tail end of each layer to complete pooling calculation; the M4 layer and the M5 layer each contain 1 and two fully connected layer computations; each convolution layer and each full connection layer are internally provided with a control unit connected with a global controller and a memory used for storing a weight parameter and a binary coding parameter;
wherein, the first convolutional layer C1 in the M1 layer is firstly calculated, and when partial result calculation is completed, the first convolutional layer C1 and the second convolutional layer C2 in the M1 layer are simultaneously calculated; and the calculation results of the M1 layer need N clock cycles in total, the M1 layer carries out calculation in the first N clock cycle, the M1 layer and the M2 layer work simultaneously in the second N clock cycle, and the like, and the M1 layer, the M2 layer, the M3 layer, the M4 layer and the M5 layer work simultaneously in the fifth N clock cycle, so that a five-stage pipeline is formed.
2. The array architecture of claim 1, wherein a plurality of processing units are disposed in each of the M1, M2, and M3 layers, and the input of each processing unit is connected to a control unit and a memory for storing the weight parameters and the binary coding parameters; the processing unit comprises three parts, wherein the first part is an input buffer part, the second part is a plurality of convolution calculation parts, and the third part is an addition tree unit and is used for accumulating results output by the second part.
3. The array architecture of the image recognition interlayer parallel pipelined binarization convolutional neural network as claimed in claim 1 or 2, wherein the input of a first convolutional layer C1 in the M1 layer is floating point data, the input of a second convolutional layer C2 in the M1 layer is binarization data output by the first convolutional layer C1; similarly, the input of the convolution layers in the M2 layer and the M3 layer are both binary data;
the structure of the convolution calculation section in the processing unit is different in the first convolution layer C1 in the M1 layer from the second convolution layer C2 in the M1 layer and the convolution layers of the M2 layer and the M3 layer.
4. The image recognition interlayer parallel pipelined binarization convolutional neural network array architecture of claim 3,
the first convolutional layer C1 inputs data through a gating unit, and then stores the data in a register, and then completes the subsequent accumulation step, and the formula is:
Figure FDA0003699588250000011
where in is an input value, which is floating point data, w is a weight, which is binary data, and y is a result of completing convolution of a pixel.
5. The array architecture of claim 4, wherein the input of the second convolutional layer C2 in M1 layer and the input of the convolutional layers in M2 layer and M3 layer are binary data, which indicates that the input and the weight are binary, and the final convolution result is obtained by performing result transformation with the same or cumulative operation and the following formula:
y′=2×y xnor -L conv
wherein, y xnor Is the result of a union or summation, L conv Is the convolution kernel size and y' is the final convolution result.
6. The array architecture of claim 1, wherein ping-pong buffer units are disposed at the ends of the convolutional layers of the M1 layer, the M2 layer and the M3 layer and the fully connected layer of the M4 layer for storing the computation results of the corresponding layers.
7. The array architecture of claim 1, wherein the outputs of each convolutional layer of the M1, M2, and M3 layers, the fully-connected layer of the M4 layers, and the first fully-connected layer of the M5 layers are batch normalized by the following formula:
Y=k i ⊙(x≥h f )
in the above formula, k i Represents k f The integer is 1, the negative number is 0, k f A floating-point number, an is the same or operator.
CN202010383601.0A 2020-05-08 2020-05-08 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture Active CN111582451B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010383601.0A CN111582451B (en) 2020-05-08 2020-05-08 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010383601.0A CN111582451B (en) 2020-05-08 2020-05-08 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture

Publications (2)

Publication Number Publication Date
CN111582451A CN111582451A (en) 2020-08-25
CN111582451B true CN111582451B (en) 2022-09-06

Family

ID=72125412

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010383601.0A Active CN111582451B (en) 2020-05-08 2020-05-08 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture

Country Status (1)

Country Link
CN (1) CN111582451B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112396176B (en) * 2020-11-11 2022-05-20 华中科技大学 Hardware neural network batch normalization system
WO2022160310A1 (en) * 2021-01-30 2022-08-04 华为技术有限公司 Data processing method and processor
CN113254206B (en) * 2021-05-25 2021-09-28 北京一流科技有限公司 Data processing system and method thereof
CN113344179B (en) * 2021-05-31 2022-06-14 哈尔滨理工大学 IP core of binary convolution neural network algorithm based on FPGA
CN113688983A (en) * 2021-08-09 2021-11-23 上海新氦类脑智能科技有限公司 Convolution operation implementation method, circuit and terminal for reducing weight storage in impulse neural network
CN115185588A (en) * 2022-06-15 2022-10-14 奥比中光科技集团股份有限公司 Method and device for solving pipeline calculation conflict

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6148101A (en) * 1995-11-27 2000-11-14 Canon Kabushiki Kaisha Digital image processor
CN106909970A (en) * 2017-01-12 2017-06-30 南京大学 A kind of two-value weight convolutional neural networks hardware accelerator computing module based on approximate calculation
CN107066239A (en) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 A kind of hardware configuration for realizing convolutional neural networks forward calculation
CN108647773A (en) * 2018-04-20 2018-10-12 复旦大学 A kind of hardwired interconnections framework of restructural convolutional neural networks
CN108665063A (en) * 2018-05-18 2018-10-16 南京大学 Two-way simultaneous for BNN hardware accelerators handles convolution acceleration system
CN109784489A (en) * 2019-01-16 2019-05-21 北京大学软件与微电子学院 Convolutional neural networks IP kernel based on FPGA
CN110780923A (en) * 2019-10-31 2020-02-11 合肥工业大学 Hardware accelerator applied to binary convolution neural network and data processing method thereof
CN110782022A (en) * 2019-10-31 2020-02-11 福州大学 Method for implementing small neural network for programmable logic device mobile terminal
CN111008691A (en) * 2019-11-06 2020-04-14 北京中科胜芯科技有限公司 Convolutional neural network accelerator architecture with weight and activation value both binarized

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6148101A (en) * 1995-11-27 2000-11-14 Canon Kabushiki Kaisha Digital image processor
CN106909970A (en) * 2017-01-12 2017-06-30 南京大学 A kind of two-value weight convolutional neural networks hardware accelerator computing module based on approximate calculation
CN107066239A (en) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 A kind of hardware configuration for realizing convolutional neural networks forward calculation
CN108647773A (en) * 2018-04-20 2018-10-12 复旦大学 A kind of hardwired interconnections framework of restructural convolutional neural networks
CN108665063A (en) * 2018-05-18 2018-10-16 南京大学 Two-way simultaneous for BNN hardware accelerators handles convolution acceleration system
CN109784489A (en) * 2019-01-16 2019-05-21 北京大学软件与微电子学院 Convolutional neural networks IP kernel based on FPGA
CN110780923A (en) * 2019-10-31 2020-02-11 合肥工业大学 Hardware accelerator applied to binary convolution neural network and data processing method thereof
CN110782022A (en) * 2019-10-31 2020-02-11 福州大学 Method for implementing small neural network for programmable logic device mobile terminal
CN111008691A (en) * 2019-11-06 2020-04-14 北京中科胜芯科技有限公司 Convolutional neural network accelerator architecture with weight and activation value both binarized

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Hierarchical Convolutional Neural Network for Malware Classification;Daniel Gibert 等;《2019 International Joint Conference on Neural Networks (IJCNN)》;20190930;第1-8页 *
卷积神经网络(CNN)算法的FPGA并行结构设计;王巍 等;《微电子学与计算机》;20140430;第57-62,66页 *

Also Published As

Publication number Publication date
CN111582451A (en) 2020-08-25

Similar Documents

Publication Publication Date Title
CN111582451B (en) Image recognition interlayer parallel pipeline type binary convolution neural network array architecture
You et al. Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks
Liang et al. FP-BNN: Binarized neural network on FPGA
CN108427990B (en) Neural network computing system and method
CN107832082B (en) Device and method for executing artificial neural network forward operation
EP3407266B1 (en) Artificial neural network calculating device and method for sparse connection
CN110163359B (en) Computing device and method
Cai et al. Low bit-width convolutional neural network on RRAM
CN110383300A (en) A kind of computing device and method
US11983616B2 (en) Methods and apparatus for constructing digital circuits for performing matrix operations
CN113010213B (en) Simplified instruction set storage and calculation integrated neural network coprocessor based on resistance change memristor
KR20190089685A (en) Method and apparatus for processing data
Geng et al. CQNN: a CGRA-based QNN framework
Zhang et al. A practical highly paralleled ReRAM-based DNN accelerator by reusing weight pattern repetitions
CN111275167A (en) High-energy-efficiency pulse array framework for binary convolutional neural network
CN111178492B (en) Computing device, related product and computing method for executing artificial neural network model
Adel et al. Accelerating deep neural networks using FPGA
CN112836793A (en) Floating point separable convolution calculation accelerating device, system and image processing method
CN110765413A (en) Matrix summation structure and neural network computing platform
Kong et al. A high efficient architecture for convolution neural network accelerator
US20240028869A1 (en) Reconfigurable processing elements for artificial intelligence accelerators and methods for operating the same
US20240013052A1 (en) Bit Sparse Neural Network Optimization
CN112749799B (en) Hardware accelerator, acceleration method and image classification method of full-frequency-domain convolutional neural network based on self-adaptive ReLU
US20230222315A1 (en) Systems and methods for energy-efficient data processing
CN117421703A (en) Depth sign regression accelerator and depth sign regression method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant