CN112862079B - Design method of running water type convolution computing architecture and residual error network acceleration system - Google Patents

Design method of running water type convolution computing architecture and residual error network acceleration system Download PDF

Info

Publication number
CN112862079B
CN112862079B CN202110262425.XA CN202110262425A CN112862079B CN 112862079 B CN112862079 B CN 112862079B CN 202110262425 A CN202110262425 A CN 202110262425A CN 112862079 B CN112862079 B CN 112862079B
Authority
CN
China
Prior art keywords
convolution
output
processing array
convolution processing
buffer area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110262425.XA
Other languages
Chinese (zh)
Other versions
CN112862079A (en
Inventor
黄以华
黄俊源
陈志炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202110262425.XA priority Critical patent/CN112862079B/en
Publication of CN112862079A publication Critical patent/CN112862079A/en
Application granted granted Critical
Publication of CN112862079B publication Critical patent/CN112862079B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Processing (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a design method of a running water type convolution computing architecture and a residual error network acceleration system, wherein the method divides the hardware acceleration architecture into an on-chip buffer area, a convolution processing array and a point-by-point addition module; the main route of the hardware acceleration architecture is composed of three serially arranged convolution processing arrays, and two assembly line buffer areas are inserted between the three serially arranged convolution processing arrays and used for realizing interlayer pipelining of three layers of convolution of the main route; setting a fourth convolution processing array for parallel processing of a convolution layer with the kernel size of 1 multiplied by 1, changing the working mode of the fourth convolution processing array by configuring a register in the fourth convolution processing array, so that the fourth convolution processing array can be used for calculating a head convolution layer or a full connection layer of a residual network, and skipping the fourth convolution processing array to execute no convolution when the branches of the residual building block have no convolution; and setting a point-by-point addition module to add the pixels of the corresponding output characteristics by elements of the output characteristics of the main path of the residual block and the output characteristics of the branch quick connection.

Description

Design method of running water type convolution computing architecture and residual error network acceleration system
Technical Field
The invention relates to the field of computer vision scene processing methods, in particular to a running water type convolution computing architecture design method and a residual error network acceleration system.
Background
Convolutional Neural Networks (CNNs) are widely used in a variety of computer vision scenarios and exhibit superior performance. However, due to the complex and intensive computational requirements and the huge storage requirements, it is a challenge to deploy and accelerate convolutional neural networks on mobile devices and embedded platforms with high power consumption sensitivity and real-time requirements.
In the convolutional neural network, the calculation time of the convolutional layer occupies more than 90% of the total calculation time of the network, so that the acceleration of the convolutional layer operation is the most important component of the acceleration of the convolutional neural network. The design of the accelerator of the convolutional neural network should make full use of the parallelism of each convolutional kernel in the layer and between layers of the convolutional neural network, and simultaneously customize the convolutional operation module according to the characteristics of the network model.
The Field Programmable Gate Array (FPGA) is a semi-custom circuit in an application-specific integrated circuit, is a programmable logic device, and along with the continuous upgrading and development of semiconductor technology, the current mainstream FPGA contains rich logic calculation, storage and wiring resources, and has the advantage of low power consumption, so that researchers have enough design space to customize a special convolutional neural network acceleration hardware structure so as to fully utilize the parallel characteristic of the convolutional neural network calculation to accelerate the operation process.
The residual network is a convolutional neural network model which is paid attention to in the field of computer vision in recent years, and is different from the layer-by-layer simple stacking of the traditional convolutional neural network in that the residual network adopts branch shortcut connection to construct a residual building block module, so that the problems of training and testing precision degradation along with the deepening of the network layer number are effectively solved, and the performance of the network can be improved more easily through the stacking layer number. But the number of related studies to deploy residual networks on FPGAs is currently small. Because the number of layers of the residual network is deeper, the sizes of all layers are different, and the branch shortcut connection is used between every two or three adjacent layers to accumulate the characteristic images, the structure height of the network is irregular, and compared with the traditional CNN, the difficulty of deploying the residual network on the FPGA is higher. Many studies are currently performed by designing a single convolution processing array module to process the convolution operations of the residual network, calculating one layer of convolution in the network at a time by the convolution processing array, and repeatedly calling the convolution processing array by the central processor to calculate all the convolution layers of the residual network layer by layer.
The structure of the residual network is mainly composed of a stack of residual building blocks with branching shortcuts (fig. 1), the main route is usually composed of three convolutions with kernel sizes 1×1, 3×3 and 1×1 (hence also called bottleneck building blocks), there are two cases of branching shortcuts: 1) Calculating input features by using a convolution layer with kernel size of 1 multiplied by 1, and adding the obtained result with pixels corresponding to the output features of the main path point by point; 2) And directly adding the input characteristic data with the pixels corresponding to the output characteristics of the main path point by point without any processing.
Fig. 2 is a conventional execution flow of calculating a layer of a residual network using a single convolution processing array, and each time the execution of the flow, one layer of calculation of the residual network can be completed. The existing solution for accelerating the convolutional neural network by using a single convolutional processing array module is suitable for processing the convolutional neural network model with a traditional simple layer-by-layer stacking structure, and has certain universality. However, since the external memory needs to be accessed once before and after the calculation of each layer of convolution calculation is started, and the residual network usually has a deeper network layer number, a great amount of energy consumption and memory access delay are brought; because of the specificity of the residual network structure, only a single convolution processing array can be used for serially executing the convolution layers of the main path and the branch quick connection of the residual building block, and then point-by-point addition is carried out, so that the parallelism of the structure cannot be fully utilized; meanwhile, the residual network convolution layers have various sizes, and the use of a single convolution processing array to process convolutions with different sizes cannot generally achieve higher hardware resource utilization.
Disclosure of Invention
The invention provides a design method of a running water type convolution computing architecture with higher hardware utilization rate.
It is yet another object of the present invention to design a residual network accelerator system using the pipelined convolutional computing architecture design method.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a design method of a running water type convolution computing architecture comprises the following steps:
s1: dividing the hardware acceleration architecture into an on-chip buffer area, a convolution processing array and a point-by-point addition module;
s2: the main route of the hardware acceleration architecture is composed of three serially arranged convolution processing arrays, two assembly line buffer areas are inserted between the three serially arranged convolution processing arrays and used for realizing interlayer pipelining of three layers of convolution of the main route, and the assembly line buffer areas are arranged in an on-chip buffer area;
s3: setting a fourth convolution processing array for parallel processing of a convolution layer with the kernel size of 1 multiplied by 1, changing the working mode of the fourth convolution processing array by configuring a register in the fourth convolution processing array, so that the fourth convolution processing array can be used for calculating a head convolution layer or a full connection layer of a residual network, and skipping the fourth convolution processing array to execute no convolution when the branches of the residual building block have no convolution;
s4: and setting a point-by-point addition module to add the pixels of the corresponding output characteristics by elements of the output characteristics of the main path of the residual block and the output characteristics of the branch quick connection.
Further, the buffer area comprises an input buffer area, a pipeline buffer area, an output buffer area and a weight buffer area; the input buffer area is used for buffering the characteristic data slices read from the off-chip memory and is shared by the first convolution processing array and the fourth convolution processing array of the residual block main path to provide characteristic input; applying pipeline buffer areas at the output ends of a first convolution processing array and a second convolution processing array for calculating main convolution of the residual building block module; the pipeline buffer is used for buffering the output characteristics of the first convolution processing array, namely buffering the input characteristics of the second convolution processing array; the method comprises the steps that a first output buffer area is arranged at the output end of a third convolution processing array of a residual block main path, a second output buffer area is arranged at the output end of a fourth convolution processing array of a branch quick connection part and used for storing convolution output characteristic results, and data in the output buffer area can be sent to a point-by-point addition module, a pooling operation unit or written back to an external memory through a direct memory access module according to different subsequent operation processes; the weight buffer area is used for buffering weight data slices corresponding to all layers of convolution, and because the pipeline technology is used for processing three layers of convolution of a main way, in order to enable next-stage pipeline to start faster and simultaneously minimize the size of the pipeline buffer area, the circulation sequence of convolution calculation is designed to calculate all output channels corresponding to a certain output characteristic firstly, then the output characteristic is replaced, all the output characteristics are calculated according to the sequence, the repeated loading of input characteristic slices of the buffer area is avoided, but the repeated loading of the weight buffer area is caused for replacing the weight slices, and each convolution processing array is designed with two weight buffer areas: the weight buffer area a and the weight buffer area b are used for realizing ping-pong buffer weight slicing and are used for overlapping convolution calculation delay and loading weight delay; the point-by-point addition module is used for performing element-by-element addition of corresponding output characteristic pixels on the output characteristics of the residual block main path and the output characteristics of the branch quick connection;
firstly, corresponding output characteristics are respectively read from a first output buffer area and a second output buffer area of a main circuit to carry out addition operation, then activation operation is carried out, then an operation result is sent back to the first output buffer area of the main circuit, and according to different follow-up operations, data of the first output buffer area can be sent to a pooling operation unit to carry out pooling operation, or the data is written back to an external memory through a direct memory access module.
Further, the register configuration module in the first to fourth convolution processing arrays is configured to receive and register various parameters of the convolution processing arrays, including a size of a convolution layer and a working mode; according to the values of the registers in the register configuration module, the logic control module sends the control weight and the characteristic data stream into a multiply-accumulate calculation unit, a bias calculation unit or an activation calculation unit of the convolution processing array in a specified mode to carry out operation, and sends the calculation result in a specified data stream mode.
A residual network acceleration system, comprising: the system comprises a direct memory access module, a running water type convolution computing architecture module, a pooling operation unit and a global control logic unit;
the direct memory access module sends a read data command to the off-chip memory, so that data in the off-chip memory is transmitted to an input buffer area on the chip; transmitting a data writing command to an off-chip memory, and writing the final output characteristics calculated by the current residual block into the external memory from the data of the output buffer;
the pooling operation unit is used for carrying out average pooling operation or maximum pooling operation; when the pooling operation is required to be executed, the pooling operation unit reads the characteristic data from the output buffer zone of the running water convolution computing architecture module to execute corresponding pooling operation, and then writes the executed result back to the output buffer zone;
the global control logic unit is used for controlling the starting, execution sequence and data flow of each module of the whole system; tracking the number of layers executed by the current network; transferring parameters required by the direct memory access module; the data in the off-chip memory comprises characteristic data for identification of a convolutional neural network and corresponding weight data; the global control logic unit is also used for configuring the working mode of the running water type convolution computing architecture and loading the kernel size and characteristic size parameters of the current computing layer into the register configuration modules of the convolution processing arrays.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the special central computing unit is designed according to the characteristics of the residual building block module, so that the central computing unit can complete the computation of a plurality of convolution layers through one-time configuration; by combining the design idea of a pipeline type design neural network accelerator, the pipeline buffer is inserted between three layers of convolution layers of a main path in a central computing unit, so that the computing parallelism is enhanced, and the computing delay is reduced; multiple accesses to an external memory are avoided, memory access delay is reduced, and power consumption is reduced; by using the design idea of hardware parallelism, a special convolution processing array is arranged for the branch convolution of the residual block, so that the branch convolution and the main convolution can be operated in parallel, and the calculation delay is reduced; the branched convolution processing array 4 can be used for calculating the head convolution and the full connection layer through configuration of the working mode; ping-pong buffers are designed for weights. The circulation sequence of the convolution calculation of the accelerator is designed to finish the calculation of all output channels corresponding to a certain output feature, and then finish the calculation of all output features, which can cause frequent weight slice replacement, so that a ping-pong buffer area is designed for weights to overlap memory access delay and calculation delay.
Drawings
FIG. 1 is a schematic diagram of a residual block of a residual network of the prior art;
FIG. 2 is a flow chart of a layer of a residual calculation network in the prior art;
FIG. 3 is a flow chart of the design method of the present invention;
FIG. 4 is a block diagram of a convolution processing array;
FIG. 5 is a block diagram of a residual network accelerator system;
FIG. 6 is a flow chart of the execution of the accelerator system calculation residual network.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;
it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 3, a design method of a running water type convolution computing architecture includes the following steps:
s1: dividing the hardware acceleration architecture into an on-chip buffer area, a convolution processing array and a point-by-point addition module;
s2: the main route of the hardware acceleration architecture is composed of three serially arranged convolution processing arrays, two assembly line buffer areas are inserted between the three serially arranged convolution processing arrays and used for realizing interlayer pipelining of three layers of convolution of the main route, and the assembly line buffer areas are arranged in an on-chip buffer area;
s3: setting a fourth convolution processing array for parallel processing of a convolution layer with the kernel size of 1 multiplied by 1, changing the working mode of the fourth convolution processing array by configuring a register in the fourth convolution processing array, so that the fourth convolution processing array can be used for calculating a head convolution layer or a full connection layer of a residual network, and skipping the fourth convolution processing array to execute no convolution when the branches of the residual building block have no convolution;
s4: and setting a point-by-point addition module to add the pixels of the corresponding output characteristics by elements of the output characteristics of the main path of the residual block and the output characteristics of the branch quick connection.
The buffer area comprises an input buffer area, a pipeline buffer area, an output buffer area and a weight buffer area; the input buffer area is used for buffering the characteristic data slices read from the off-chip memory and is shared by the first convolution processing array and the fourth convolution processing array of the residual block main path to provide characteristic input; applying pipeline buffer areas at the output ends of a first convolution processing array and a second convolution processing array for calculating main convolution of the residual building block module; the pipeline buffer is used for buffering the output characteristics of the first convolution processing array, namely buffering the input characteristics of the second convolution processing array; the method comprises the steps that a first output buffer area is arranged at the output end of a third convolution processing array of a residual block main path, a second output buffer area is arranged at the output end of a fourth convolution processing array of a branch quick connection part and used for storing convolution output characteristic results, and data in the output buffer area can be sent to a point-by-point addition module, a pooling operation unit or written back to an external memory through a direct memory access module according to different subsequent operation processes; the weight buffer area is used for buffering weight data slices corresponding to all layers of convolution, and because the pipeline technology is used for processing three layers of convolution of a main way, in order to enable next-stage pipeline to start faster and simultaneously minimize the size of the pipeline buffer area, the circulation sequence of convolution calculation is designed to calculate all output channels corresponding to a certain output characteristic firstly, then the output characteristic is replaced, all the output characteristics are calculated according to the sequence, the repeated loading of input characteristic slices of the buffer area is avoided, but the repeated loading of the weight buffer area is caused for replacing the weight slices, and each convolution processing array is designed with two weight buffer areas: the weight buffer area a and the weight buffer area b are used for realizing ping-pong buffer weight slicing and are used for overlapping convolution calculation delay and loading weight delay; the point-by-point addition module is used for performing element-by-element addition of corresponding output characteristic pixels on the output characteristics of the residual block main path and the output characteristics of the branch quick connection;
firstly, corresponding output characteristics are respectively read from a first output buffer area and a second output buffer area of a main circuit to carry out addition operation, then activation operation is carried out, then an operation result is sent back to the first output buffer area of the main circuit, and according to different follow-up operations, data of the first output buffer area can be sent to a pooling operation unit to carry out pooling operation, or the data is written back to an external memory through a direct memory access module.
As shown in fig. 4, the register configuration modules in the first to fourth convolution processing arrays are configured to receive and register various parameters of the convolution processing arrays, including the size of the convolution layer and the operation mode; according to the values of the registers in the register configuration module, the logic control module sends the control weight and the characteristic data stream into a multiply-accumulate calculation unit, a bias calculation unit or an activation calculation unit of the convolution processing array in a specified mode to carry out operation, and sends the calculation result in a specified data stream mode.
Example 2
As shown in fig. 5, a residual network accelerator system is designed by using the method for designing a running water convolution computing architecture, which comprises: the system comprises a direct memory access module, a running water type convolution computing architecture module, a pooling operation unit and a global control logic unit;
the direct memory access module sends a read data command to the off-chip memory, so that data in the off-chip memory is transmitted to an input buffer area on the chip; transmitting a data writing command to an off-chip memory, and writing the final output characteristics calculated by the current residual block into the external memory from the data of the output buffer;
the pooling operation unit is used for carrying out average pooling operation or maximum pooling operation; when the pooling operation is required to be executed, the pooling operation unit reads the characteristic data from the output buffer zone of the running water convolution computing architecture module to execute corresponding pooling operation, and then writes the executed result back to the output buffer zone;
the global control logic unit is used for controlling the starting, execution sequence and data flow of each module of the whole system; tracking the number of layers executed by the current network; transferring parameters required by the direct memory access module; the data in the off-chip memory comprises characteristic data for identification of a convolutional neural network and corresponding weight data; the global control logic unit is also used for configuring the working mode of the running water type convolution computing architecture and loading the kernel size and characteristic size parameters of the current computing layer into the register configuration modules of the convolution processing arrays.
The residual network accelerator system execution flow is shown in fig. 6.
The special central computing unit is designed according to the characteristics of the residual building block module, so that the central computing unit can complete the computation of a plurality of convolution layers through one-time configuration; by combining the design idea of a pipeline type design neural network accelerator, the pipeline buffer is inserted between three layers of convolution layers of a main path in a central computing unit, so that the computing parallelism is enhanced, and the computing delay is reduced; multiple accesses to an external memory are avoided, memory access delay is reduced, and power consumption is reduced; by using the design idea of hardware parallelism, a special convolution processing array is arranged for the branch convolution of the residual block, so that the branch convolution and the main convolution can be operated in parallel, and the calculation delay is reduced; the branched convolution processing array 4 can be used for calculating the head convolution and the full connection layer through configuration of the working mode; ping-pong buffers are designed for weights. The circulation sequence of the convolution calculation of the accelerator is designed to finish the calculation of all output channels corresponding to a certain output feature, and then finish the calculation of all output features, which can cause frequent weight slice replacement, so that a ping-pong buffer area is designed for weights to overlap memory access delay and calculation delay.
The same or similar reference numerals correspond to the same or similar components;
the positional relationship depicted in the drawings is for illustrative purposes only and is not to be construed as limiting the present patent;
it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (9)

1. The design method of the running water type convolution computing architecture is characterized by comprising the following steps of:
s1: dividing a running water type convolution computing architecture into an on-chip buffer area, a convolution processing array and a point-by-point addition module;
s2: the main route of the running water type convolution computing architecture is composed of three serially arranged convolution processing arrays, two assembly line buffer areas are inserted between the three serially arranged convolution processing arrays and used for realizing interlayer running water of three layers of convolution of the main route, and the assembly line buffer areas are arranged in the on-chip buffer areas;
s3: setting a fourth convolution processing array for parallel processing of a convolution layer with the kernel size of 1 multiplied by 1, changing the working mode of the fourth convolution processing array by configuring a register in the fourth convolution processing array, so that the fourth convolution processing array can be used for calculating a head convolution layer or a full connection layer of a residual network, and skipping the fourth convolution processing array to execute no convolution when the branches of the residual building block have no convolution;
s4: setting a point-by-point addition module to add the corresponding output characteristic pixels element by the output characteristics of the main path of the residual block and the output characteristics of the branch quick connection;
the buffer area comprises an input buffer area, a pipeline buffer area, an output buffer area and a weight buffer area; the input buffer area is used for buffering the characteristic data slices read from the off-chip memory and is shared by the first convolution processing array and the fourth convolution processing array of the residual block main path to provide characteristic input; pipeline buffers are applied at the outputs of the first convolution processing array and the second convolution processing array for calculating the residual building block main convolution.
2. The method of claim 1, wherein the pipeline buffer is configured to buffer output characteristics of the first convolution processing array, i.e., buffer input characteristics of the second convolution processing array.
3. The method according to claim 2, wherein a first output buffer is provided at an output end of a third convolution processing array of the residual block main path, and a second output buffer is provided at an output end of a fourth convolution processing array of the branch quick connection portion, for storing a convolution output characteristic result, and data in the output buffers may be sent to a point-by-point addition module, a pooled operation unit or written back to an external memory through a direct memory access module according to a difference of a subsequent operation process.
4. The method for designing a pipelined convolutional computing architecture according to claim 3, wherein the weight buffer is used for buffering weight data slices corresponding to the convolutions of each layer, and since the pipeline technology is used for processing the three-layer convolutions of the main way, in order to enable the next stage of pipeline to start faster while minimizing the size of the pipeline buffer, the cyclic sequence of the convolutional computation is designed to compute all output channels corresponding to a certain output feature first, then replace the output feature, and compute all output features according to the sequence, thereby avoiding the repeated loading of input feature slices of the buffer, but causing the repeated loading of the weight buffer to replace the weight slices, and each convolutional processing array is designed with two weight buffers for this purpose: and the weight buffer area a and the weight buffer area b are used for realizing ping-pong buffer weight slicing and are used for overlapping convolution calculation delay and loading weight delay.
5. The method of claim 4, wherein the point-by-point addition module is configured to perform element-by-element addition of corresponding output feature pixels for the output features of the residual block main path and the output features of the branch quick connection;
firstly, corresponding output characteristics are respectively read from a first output buffer area and a second output buffer area of a main circuit to carry out addition operation, then activation operation is carried out, then an operation result is sent back to the first output buffer area of the main circuit, and according to different follow-up operations, data of the first output buffer area can be sent to a pooling operation unit to carry out pooling operation, or the data is written back to an external memory through a direct memory access module.
6. The method of designing a pipelined convolutional computing architecture according to any one of claims 1-4, wherein the register configuration module in the first through fourth convolutional processing arrays is configured to receive and register parameters of the convolutional processing arrays, including the size of the convolutional layer and the operating mode; according to the values of the registers in the register configuration module, the logic control module sends the control weight and the characteristic data stream into a multiply-accumulate calculation unit, a bias calculation unit or an activation calculation unit of the convolution processing array in a specified mode to carry out operation, and sends the calculation result in a specified data stream mode.
7. A residual network acceleration system designed using the design method of claim 6, comprising: the system comprises a direct memory access module, a running water type convolution computing architecture module, a pooling operation unit and a global control logic unit;
the direct memory access module sends a read data command to the off-chip memory, so that data in the off-chip memory is transmitted to an input buffer area on the chip; transmitting a data writing command to an off-chip memory, and writing the final output characteristics calculated by the current residual block into the external memory from the data of the output buffer;
the pooling operation unit is used for carrying out average pooling operation or maximum pooling operation; when the pooling operation is required to be executed, the pooling operation unit reads the characteristic data from the output buffer zone of the running water convolution computing architecture module to execute corresponding pooling operation, and then writes the executed result back to the output buffer zone;
the global control logic unit is used for controlling the starting, execution sequence and data flow of each module of the whole system; tracking the number of layers executed by the current network; the parameters required by the direct memory access module are passed.
8. The residual network acceleration system of claim 7, wherein the data in the off-chip memory includes characteristic data and corresponding weight data for the convolutional neural network for identification.
9. The residual network acceleration system of claim 8, wherein the global control logic is further configured to configure an operating mode of a pipelined convolutional computing architecture and to load kernel size, feature size parameters of a current compute layer into register configuration modules of each convolutional processing array.
CN202110262425.XA 2021-03-10 2021-03-10 Design method of running water type convolution computing architecture and residual error network acceleration system Active CN112862079B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110262425.XA CN112862079B (en) 2021-03-10 2021-03-10 Design method of running water type convolution computing architecture and residual error network acceleration system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110262425.XA CN112862079B (en) 2021-03-10 2021-03-10 Design method of running water type convolution computing architecture and residual error network acceleration system

Publications (2)

Publication Number Publication Date
CN112862079A CN112862079A (en) 2021-05-28
CN112862079B true CN112862079B (en) 2023-04-28

Family

ID=75993917

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110262425.XA Active CN112862079B (en) 2021-03-10 2021-03-10 Design method of running water type convolution computing architecture and residual error network acceleration system

Country Status (1)

Country Link
CN (1) CN112862079B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114202071B (en) * 2022-02-17 2022-05-27 浙江光珀智能科技有限公司 Deep convolutional neural network reasoning acceleration method based on data stream mode

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447254A (en) * 2018-11-01 2019-03-08 济南浪潮高新科技投资发展有限公司 A kind of hardware-accelerated method and device thereof of convolutional neural networks reasoning

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10678508B2 (en) * 2018-03-23 2020-06-09 Amazon Technologies, Inc. Accelerated quantized multiply-and-add operations
CN110163215B (en) * 2018-06-08 2022-08-23 腾讯科技(深圳)有限公司 Image processing method, image processing device, computer readable medium and electronic equipment
CN109032781A (en) * 2018-07-13 2018-12-18 重庆邮电大学 A kind of FPGA parallel system of convolutional neural networks algorithm
CN109934339B (en) * 2019-03-06 2023-05-16 东南大学 General convolutional neural network accelerator based on one-dimensional pulse array
US11797345B2 (en) * 2019-04-30 2023-10-24 Prakash C R J Naidu Hardware accelerator for efficient convolution processing
CN112200302B (en) * 2020-09-27 2021-08-17 四川翼飞视科技有限公司 Construction method of weighted residual error neural network for image classification

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447254A (en) * 2018-11-01 2019-03-08 济南浪潮高新科技投资发展有限公司 A kind of hardware-accelerated method and device thereof of convolutional neural networks reasoning

Also Published As

Publication number Publication date
CN112862079A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
US10824939B2 (en) Device for implementing artificial neural network with flexible buffer pool structure
CN107679621B (en) Artificial neural network processing device
CN107679620B (en) Artificial neural network processing device
US11442785B2 (en) Computation method and product thereof
CN106940815B (en) Programmable convolutional neural network coprocessor IP core
CN110751280A (en) Configurable convolution accelerator applied to convolutional neural network
CN110348574B (en) ZYNQ-based universal convolutional neural network acceleration structure and design method
US7574466B2 (en) Method for finding global extrema of a set of shorts distributed across an array of parallel processing elements
US7447720B2 (en) Method for finding global extrema of a set of bytes distributed across an array of parallel processing elements
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN110222818B (en) Multi-bank row-column interleaving read-write method for convolutional neural network data storage
CN107085562B (en) Neural network processor based on efficient multiplexing data stream and design method
CN110674927A (en) Data recombination method for pulse array structure
CN112905530B (en) On-chip architecture, pooled computing accelerator array, unit and control method
CN110059797B (en) Computing device and related product
CN112862079B (en) Design method of running water type convolution computing architecture and residual error network acceleration system
CN115983348A (en) RISC-V accelerator system supporting convolution neural network extended instruction
US20230376733A1 (en) Convolutional neural network accelerator hardware
CN113158132A (en) Convolution neural network acceleration system based on unstructured sparsity
CN113407238A (en) Many-core architecture with heterogeneous processors and data processing method thereof
KR102603807B1 (en) Convolutional neural network accelerator minimizing memory access
CN114239816B (en) Reconfigurable hardware acceleration architecture of convolutional neural network-graph convolutional neural network
CN113240074B (en) Reconfigurable neural network processor
CN114595813A (en) Heterogeneous acceleration processor and data calculation method
KR102372869B1 (en) Matrix operator and matrix operation method for artificial neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant