CN112052941B - Efficient memory calculation system applied to CNN (computer numerical network) convolution layer and operation method thereof - Google Patents

Efficient memory calculation system applied to CNN (computer numerical network) convolution layer and operation method thereof Download PDF

Info

Publication number
CN112052941B
CN112052941B CN202010947798.6A CN202010947798A CN112052941B CN 112052941 B CN112052941 B CN 112052941B CN 202010947798 A CN202010947798 A CN 202010947798A CN 112052941 B CN112052941 B CN 112052941B
Authority
CN
China
Prior art keywords
data
result
weight
convolution
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010947798.6A
Other languages
Chinese (zh)
Other versions
CN112052941A (en
Inventor
李丽
陈铠
傅玉祥
宋文清
何国强
陈辉
何书专
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202010947798.6A priority Critical patent/CN112052941B/en
Publication of CN112052941A publication Critical patent/CN112052941A/en
Application granted granted Critical
Publication of CN112052941B publication Critical patent/CN112052941B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • G06F17/153Multidimensional correlation or convolution
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a high-efficiency memory computing system applied to a CNN network convolution layer and an operation method thereof, wherein the architecture comprises the following components: the data caching module is used for caching result data; the operation array is used for carrying out high-parallelism full-pipeline convolution operation to obtain a convolution operation result; the source data distribution module is used for reading the image source data in the data cache and sending the image source data to the operation array; the weight sharing module is used for reading weight data in the data cache, copying and regrouping the data and sending the data to the operation array; and the result data writing module is used for storing the convolution calculation result of the operation array into the data caching module. The high-efficiency memory computing architecture provided by the invention designs an operation array based on the parallel operation cluster of full-flow water, and designs a data buffer and a high-bandwidth data supply channel matched with the operation array, so that the high-performance operation of the CNN network dense convolution algorithm is realized with lower hardware complexity, and the high-efficiency memory computing architecture has a good application prospect.

Description

Efficient memory calculation system applied to CNN (computer numerical network) convolution layer and operation method thereof
Technical Field
The invention relates to the field of artificial intelligence algorithms, in particular to a hardware implementation method for high-density convolution operation in a CNN network convolution layer.
Background
Neural networks (networks) are part of the field of artificial intelligence research, the most popular of which is currently convolutional neural networks (convolutional neural networks, CNN), the underlying CNN consisting of three structures, convolutional (activation), and pooling (pooling). In 1998, lecun proposed a classical Lenet-5 network for solving the visual task of handwriting digital recognition, forming the embryonic form of contemporary convolutional neural networks. In recent years, with the rising of deep learning theory research and the continuous improvement of floating point operation performance of the heterogeneous computing GPU for training, the deep convolutional neural network is rapidly developed, and a large number of high-recognition-rate networks such as AlexNet, VGG, googleNet, mobileNet are developed. In 2015, he Kaiming et al proposed ResNet (residual neural network), successfully trained a convolutional neural network of 152 layers depth, and the visual recognition error rate was reduced to 4.94% which is lower than the error rate of human eye recognition by 5.1%. At present, the deep convolutional neural network is widely applied to the fields of voice recognition, image segmentation, natural language processing and the like.
Along with the continuous deepening of the depth of the neural network, the number of channels (namely the number of convolution kernels) is increased, the convolution operation amount is increased in an explosive manner, the operation amount is more than 80% of the whole CNN network, huge pressure is brought to high-real-time end-side application, the existing general processor based on the Von Neumann architecture cannot meet the high-real-time reasoning requirement of the neural network algorithm, a special high-efficiency computational architecture technology must be broken through, and the convolution operation processing performance is improved so as to adapt to the high-real-time application of the deep neural network end side.
Disclosure of Invention
The invention aims to: the invention aims to improve the performance of the CNN network convolution layer realization and achieve the efficient matching of storage, supply and operation resources. The invention provides a high-efficiency memory system applied to a CNN network convolution layer, and further provides an operation method based on the framework. The parallel operation cluster based on the full-pipeline is used for designing an operation array, and a data buffer and a high-bandwidth data supply channel matched with the operation array are designed, so that the high-performance operation of the CNN network dense convolution algorithm is realized with lower hardware complexity, and the performance requirement of the actual application of the convolution neural network is better met.
The technical scheme is as follows: an efficient memory system applied to a CNN network convolution layer comprises the following modules:
the data caching module is used for storing CNN image source data sets, storing weight data (comprising convolution kernels and offsets), caching result data and providing a read-write interface with peripheral equipment;
the operation array performs data convolution operation according to the input data and the weight data to obtain a convolution operation result;
the source data distribution module is used for generating a source data reading address, reading image source data in source data BANK and sending the source data to the operation array;
the weight sharing module reads weight data (comprising convolution kernel and offset) required by convolution operation, and copies and regroups the data through a 1-to-many driver and sends the data to the operation array;
and the result data writing module is used for generating a result data writing address and storing the convolution calculation result of the operation array into the data caching module.
In a further embodiment, the data buffer module further includes a source data buffer area, a weight buffer area, and a result buffer area, and the data buffer module provides a read-write interface for communication with a peripheral device; the source data buffer area is used for storing an image source data set, the weight buffer area is used for storing weight data, and the result buffer area is used for buffering result data.
In a further embodiment, the operational array comprises n identical operational clusters, where n represents the parallelism of the processed image; convolution operations of n images can be processed simultaneously.
In a further embodiment, the operational cluster includes m 16-bit fixed-point multiply accumulators, at least 1 16-bit fixed-point adder, a source data sharing module, and a data reordering module; where m represents the parallelism of operations within a single cluster, the total parallelism of convolution operations is n×m, and its value depends on the operation resource.
In a further embodiment, the parallelism of the operation array when performing convolution operation is n×m; the source data sharing module is further configured to copy at least 1 path of input data to m copies via at least 1 pair of m drivers, and distribute the m copies of the source data to m multiply-accumulators, where m represents the parallelism of operations within a single cluster.
In a further embodiment, the data rearrangement module is further configured to output the m multiply-accumulator output results arriving at the same time, and multiplex at least 1 output port through the delay unit and the multiplexer MUX to form a result output; where m represents the parallelism of operations within a single cluster.
In a further embodiment, the operation cluster has m+1 weight inputs and 1 source data input end, the operation cluster forms m paths of parallel pipeline operation structures through internal interconnection, the source data generates m outputs through the source data sharing module, the m outputs are respectively sent to the 1 st input end of m multiply-accumulators, the weight 1-weight m is respectively sent to the 2 nd input end of the multiply-accumulators 1-m, the multiply-accumulators 1-multiply-accumulators m operate simultaneously, the calculation result is sent to the data rearrangement module, the output of the data rearrangement module is sent to the input 1 of the adder, the weight m+1 is sent to the input 2 of the adder, and the output result of the adder is used as a convolution operation result of the operation cluster and is sent to the result data writing module.
In a further embodiment, the weight sharing module has m+1 weight inputs, each weight is copied n times through 1 pair of n drivers, n sets of weight outputs are generated through the multiplexer MUX, each set of weights includes weight 1-weight m+1, and n sets of weights are sent to n operation clusters respectively.
In a further embodiment, the data cache is divided into a source data cache region, a weight cache region and a result cache region, the source data cache region contains n data BANKs in total of source data BANK 1-source data BANKn, the weight cache region contains m+1 weights BANKs in total of weight BANK 1-weight bankm+1, and the result cache region contains n result BANKs in total of result BANK 1-result BANKn.
In a further embodiment, the BANK has a set of independent read ports and a set of independent write ports.
In a further embodiment, a mapping manner of the source data buffer area is defined, wherein the BANK1 stores the 1 st image data, the n+1st image data and the 2n+1st image data … …, and the 1 st image data is sent to the 1 st input end of the multiply-accumulator 1-multiply-accumulator m of the operation cluster 1 through the source data distribution module and the source data sharing module of the operation cluster 1; BANK2 stores the 2 nd image data, the n+2nd image data and the 2n+2nd image data … …, and sends the data to the 1 st input end of a multiplication accumulator 1-multiplication accumulator m of the operation cluster 2 through a source data distribution module and a source data sharing module of the operation cluster 2; similarly, BANKn stores nth image data, 2nth image data, and 3nth image data … …, and sends them to the 1 st input of multiply-accumulate 1-multiply-accumulate m of operation cluster n via source data distribution module and source data sharing module of operation cluster n.
In a further embodiment, a mapping manner of the weight buffer area is defined, wherein the BANK1 stores the 1 st convolution kernel data, the (m+1) th convolution kernel data and the (2m+1) th convolution kernel data … …, and the data are sent to the 2 nd input end of the multiply accumulator 1 of the n operation clusters through the weight sharing module and the operation clusters; BANK2 stores the 2 nd convolution kernel data, the (m+2) th convolution kernel data and the (2m+2) th convolution kernel data … …, and sends the data to the 2 nd input end of the multiply accumulator 2 of the n operation clusters through a weight sharing module and the operation clusters; similarly, BANKm stores the mth convolution kernel data, the 2 nd convolution kernel data and the 3 rd convolution kernel data … …, and sends the data to the 2 nd input end of the multiply accumulator m of the n operation clusters through the weight sharing module and the operation clusters; BANKm+1 stores bias parameters, and the bias parameters are sent to the 2 nd input end of an adder of n operation clusters through a weight sharing module and the operation clusters.
In a further embodiment, a mapping manner of the result buffer area is defined, and the BANK1 stores 1 st operation cluster result data, namely a 1 st image convolution result, an n+1st image convolution result and a 2n+1st image convolution result … …; BANK2 stores the result data of the 2 nd operation cluster, namely the 2 nd image convolution result, the n+2nd image convolution result and the 2n+2nd image convolution result … …; similarly, BANKn stores the nth operation cluster result data, that is, the nth image convolution result, the 2 nth image convolution result, and the 3 nth image convolution result … ….
Based on the above efficient storage architecture, the present invention further proposes a specific process for implementing the CNN network convolutional layer algorithm:
step 1) writing source data: the image source data is stored into a source data buffer area of the data buffer through a source data writing port, and the weight data required by convolution operation is stored into a weight buffer area of the data buffer.
Step 2) data transmission: the source data distribution module generates a reading address, reads source data of different images from n source data BANK, and sends the source data to an operation cluster 1-an operation cluster n respectively; meanwhile, the weight sharing module generates a read address, reads weight data from the weight BANK 1-weight BANKm to form n groups of identical weight data, and sends the n groups of identical weight data to the operation cluster 1-operation cluster n respectively.
Step 3) convolution operation: the operation clusters carry out convolution operation according to a parallel running water mode, the operation clusters 1-n carry out convolution operation of n images in parallel, m convolution kernels carry out convolution operation in each operation cluster in parallel, the total convolution operation parallelism reaches n multiplied by m, and each operation cluster generates m result points in the dimension of an output image channel.
Step 4) result caching: the result data is written into a result buffer area for receiving the convolution operation results of n operation clusters, namely, n image convolution results, and storing the result data into a data buffer.
And 5) repeating the steps 2) -4), and completing convolution operation of all result points of the image.
Step 6) reading the result: the image convolution result is read out through the result data read-out port.
The beneficial effects are that:
(1) The invention provides a high-efficiency memory architecture which can realize high-performance operation of a CNN network convolution layer algorithm.
(2) The invention designs an operation cluster structure, the main calculation unit is a multiply accumulator and an adder, the hardware implementation is easy, and the operation period of a convolution algorithm is shortened by adopting a full-flow multi-path parallel operation mode.
(3) The invention realizes the number supply to n input ports of n multiplied accumulators 2 multiplied by n multiplied by m and n input ports of n adders through the weight sharing module and the cluster endogenous data sharing module by using 2n+m+1 storage BANK read ports, and realizes the high bandwidth number supply to the operation components in the operation array with lower hardware complexity, thereby ensuring the full-flow operation of each operation cluster in the operation array.
(4) The invention designs a data buffer, which is divided into three independent areas, namely a source data buffer area, a weight buffer area and a result buffer area, and the conflict-free access is realized when a convolution algorithm is operated.
In conclusion, the invention can effectively improve the operation performance of the CNN network convolution layer and has good application value.
Drawings
Fig. 1 is a schematic diagram of a high-efficiency computing system applied to a CNN network convolution layer according to the present invention.
FIG. 2 is a schematic diagram of an operation cluster structure according to the present invention.
Fig. 3 is a schematic diagram of the multiply-accumulator architecture of the present invention.
Fig. 4 is a schematic diagram of a source data sharing module according to the present invention.
Fig. 5 is a schematic diagram of a data rearrangement module according to the present invention.
FIG. 6 is a diagram illustrating a source data buffer mapping according to the present invention.
FIG. 7 is a diagram illustrating a weight buffer mapping according to the present invention.
FIG. 8 is a diagram illustrating a result buffer map according to the present invention.
Fig. 9 is a schematic diagram of a weight sharing module according to the present invention.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the invention may be practiced without one or more of these details. In other instances, well-known features have not been described in detail in order to avoid obscuring the invention.
The technical solution for realizing the purpose of the invention is as follows: the operation array is designed based on the fully-pipelined parallel operation clusters, and a data buffer and a high-bandwidth data supply channel matched with the operation array are designed, so that an efficient memory system design applied to the CNN network convolution layer is formed, and high-performance hardware acceleration of the CNN network convolution layer is realized.
Example 1
As shown in FIG. 1, the efficient memory system design applied to the CNN network convolution layer provided by the invention comprises an operation array, a data cache, a source data distribution module, a weight sharing module and a result data writing module. The external interfaces are a source data writing interface, a weight writing interface and a result data reading interface.
(1) Operation array
The operation array is composed of n identical operation clusters, can process convolution operation of n images at the same time, and the operation clusters work independently and are provided with independent input and output data interfaces. The operation cluster consists of m 16-bit fixed-point multiply accumulators, 1 16-bit fixed-point adders, source data sharing and data rearrangement modules, the total parallelism of the convolution operation of the operation array is n multiplied by m, and the value of the total parallelism depends on operation resources. As shown in fig. 3, the multiply-accumulator in the operation cluster has 2 inputs, 1 output, and is internally composed of 1 16-bit fixed-point multiplier and 1 16-bit fixed-point adder. As shown in fig. 4, the source data sharing module is mainly composed of 1 pair of m drivers, copies m copies of source data, and distributes the copies to m multiply accumulators. As shown in fig. 5, the data rearrangement module multiplexes the m multiply-accumulate operation results arriving at the same time through the delay unit and the multiplexer MUX into 1 output port to form a result output. As shown in fig. 2, the operation cluster has m+1 weight inputs and 1 source data input end, the operation cluster forms m paths of parallel pipeline operation structures through internal interconnection, the source data generates m outputs through the source data sharing module, the m outputs are respectively sent to the 1 st input end of m multiply-accumulators, the weight 1-weight m is respectively sent to the 2 nd input end of the multiply-accumulator 1-multiply-accumulator m, the multiply-accumulator 1-multiply-accumulator m performs operation simultaneously, the calculation result is sent to the data rearrangement module, the output of the data rearrangement module is sent to the input 1 of the adder, the weight m+1 is sent to the input 2 of the adder, and the output result of the adder is used as the convolution operation result of the operation cluster and is sent to the result data writing module.
(2) Data caching
The data cache is used for storing CNN reasoning image data sets, storing reasoning weight data (comprising convolution kernel and bias), caching result data and providing a flexible read-write interface with the peripheral. The data cache is divided into a source data cache region, a weight cache region and a result cache region, wherein the source data cache region comprises n data BANK in total of source data BANK 1-source data BANKn, the weight cache region comprises m+1 weight BANK in total of weight BANK 1-weight BANKm+1, and the result cache region comprises n result BANK in total of result BANK 1-result BANKn. Each BANK has a set of independent read ports and a set of independent write ports.
The mapping mode of the source data buffer area is shown in fig. 6, wherein the BANK1 stores the 1 st image data, the n+1st image data and the 2n+1st image data … …, and the 1 st image data is sent to the 1 st input end of the multiply accumulator 1-multiply accumulator m of the operation cluster 1 through the source data distribution module and the source data sharing module of the operation cluster 1; BANK2 stores the 2 nd image data, the n+2nd image data and the 2n+2nd image data … …, and sends the data to the 1 st input end of a multiplication accumulator 1-multiplication accumulator m of the operation cluster 2 through a source data distribution module and a source data sharing module of the operation cluster 2; similarly, BANKn stores nth image data, 2nth image data, and 3nth image data … …, and sends them to the 1 st input of multiply-accumulate 1-multiply-accumulate m of operation cluster n via source data distribution module and source data sharing module of operation cluster n.
As shown in FIG. 7, the mapping mode of the weight buffer area is that BANK1 stores the 1 st convolution kernel data, the (m+1) th convolution kernel data and the (2m+1) th convolution kernel data … …, and the data is sent to the 2 nd input end of the multiply accumulator 1 of n operation clusters through the weight sharing module and the operation clusters; BANK2 stores the 2 nd convolution kernel data, the (m+2) th convolution kernel data and the (2m+2) th convolution kernel data … …, and sends the data to the 2 nd input end of the multiply accumulator 2 of the n operation clusters through a weight sharing module and the operation clusters; similarly, BANKm stores the mth convolution kernel data, the 2 nd convolution kernel data and the 3 rd convolution kernel data … …, and sends the data to the 2 nd input end of the multiply accumulator m of the n operation clusters through the weight sharing module and the operation clusters; BANKm+1 stores bias parameters, and the bias parameters are sent to the 2 nd input end of an adder of n operation clusters through a weight sharing module and the operation clusters.
As shown in FIG. 8, the mapping mode of the result buffer area is that BANK1 stores the 1 st operation cluster result data, namely the 1 st image convolution result, the n+1st image convolution result and the 2n+1st image convolution result … …; BANK2 stores the result data of the 2 nd operation cluster, namely the 2 nd image convolution result, the n+2nd image convolution result and the 2n+2nd image convolution result … …; similarly, BANKn stores the nth operation cluster result data, that is, the nth image convolution result, the 2 nth image convolution result, and the 3 nth image convolution result … ….
(3) Source data distribution
The source data distribution module generates a source data reading address, reads deduced image source data in source data BANK, and sends the source data to the operation array. During distribution, the source data of the source data BANK1 is sent to the source data input interface of the operation cluster 1, the source data of the source data BANK2 is sent to the source data input interface of the operation cluster 2, … …, and so on, the source data of the source data BANKn is sent to the source data input interface of the operation cluster n.
(4) Weight sharing
The weight sharing module reads the weight data (comprising convolution kernel and bias) needed by reasoning, and re-groups the data replication through the 1-to-many driver, and sends the data replication to the operation array. As shown in fig. 9, the weight sharing module has m+1 weight inputs, each weight passes through 1 pair of n drivers, replicates n copies, and generates n sets of weight outputs through the multiplexer MUX, each set of weights includes weight 1-weight m+1, and the n sets of weights are respectively sent to n operation clusters.
(5) Result data writing
The result data writing module generates a result data writing address and stores the convolution calculation result of the operation array into the data cache. When in storage, the convolution result of the operation cluster 1 is stored into the result BANK1, the convolution result of the operation cluster 2 is stored into the results BANK2 and … …, and so on, the convolution result of the operation cluster n is stored into the result BANKn.
Example 2
The embodiment realizes the high-performance hardware acceleration of the CNN network convolution layer algorithm on the basis of the high-efficiency memory architecture described in the embodiment 1.
In convolution computation, there are several parameters: input image size (in×in), input image channel (Ch), convolution kernel size (kxkxch), number of convolution kernels (Nu), step size (St), output image size (ou×ou), and channel of output image is equal to the number of convolution kernels Nu.
The formula of the single convolution operation is shown below, yielding 1 point of the output image.
Where H represents the rows of the two-dimensional matrix, W represents the columns of the two-dimensional matrix, C represents the channels of the image, H, W, C of X (H, W, C) determines the position of the input data in the image source data, H, W, C of WT (H, W, C) determines the position of the weight data in the convolution kernel, and b is the bias parameter.
The steps for realizing the convolution algorithm on the high-efficiency memory architecture are as follows:
step 1) writing source data: through the source data writing port, a plurality of image source data are stored in a source data buffer area of the data buffer according to the mode of fig. 6, and convolution kernels (WT) and offsets (b) required by convolution operation are stored in a weight buffer area of the data buffer according to the mode of fig. 7. The data is stored in the BANK according to (H, W, C), namely, the bit numbers jump according to the channel dimension, the column dimension and the row dimension.
Step 2) data transmission:
the source data distribution module generates a reading address, kxKxCh point source data of different images are read from n source data BANK and are respectively sent to an operation cluster 1-operation cluster n, the data is sent to the 1 st input end of the multiply accumulator 1-multiply accumulator m through the cluster source data sharing module through a data path shown in fig. 2, the operation cluster 1 processes the image 1 source data, the operation cluster 2 processes the image 2 source data, and the operation cluster n processes the image n source data.
Meanwhile, the weight sharing module generates a read address, reads m convolution kernels (KxKxCh) from the weight BANK 1-weight BANKm, reads offset parameters corresponding to the m convolution kernels from the weight BANKm+1, outputs the weight 1-weight m in fig. 9 respectively corresponding to the m convolution kernels read, sends the data channels in the operation cluster shown in fig. 2 to the 2 nd input end of the multiply accumulator 1-multiply accumulator m respectively, and outputs the weight m+1 in fig. 9 corresponding to the offset parameters read to the 2 nd input end of the adder through the data channels in the operation cluster shown in fig. 2. The weights 1-m+1 input between the clusters are the same.
Step 3) convolution operation:
the operation clusters carry out convolution operation according to a parallel running water mode, the operation clusters 1-n carry out convolution operation of n images in parallel, m convolution kernels carry out convolution operation in each operation cluster in parallel, and the total convolution operation parallelism reaches n multiplied by m. As shown in fig. 9, m multiply-accumulate operations are performed in parallel in each operation cluster, the data size of each multiply-accumulate operation is kxkxch, m multiply-accumulate results arriving simultaneously pass through a data rearrangement module and are converted into 1-path results to be sent to an addition module, and the 1-path results are added with a weight m+1, namely a bias parameter (b), so as to complete convolution operation and form m result points in the dimension of an output image channel.
Step 4) result caching:
the result data is written into a result buffer area for receiving the convolution operation results of n operation clusters, namely, the convolution results of n images, and is stored into a data buffer according to the illustration of fig. 8.
And 5) repeating the steps 2) -4), and completing convolution operation of all result points of the image.
Step 6) reading the result: the image convolution result is read out through the result data read-out port.
In summary, the design of the efficient memory system applied to the CNN network convolutional layer provided by the present example designs an operation array based on a fully pipelined parallel operation cluster, designs a data buffer and a high bandwidth data supply channel matched with the operation array, and uses 2n+m+1 memory BANK read ports through a weight sharing module and a cluster endogenous data sharing module, so as to implement n×m multiply-accumulator 2×n×m input ports and n input port supplies to n adders, and implement high performance operation of the CNN network convolutional layer algorithm with lower hardware complexity, thereby having good application prospects.
As described above, although the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (4)

1. The efficient storage and calculation system applied to the CNN network convolution layer is characterized by comprising the following modules:
the data caching module is used for caching result data;
the operation array is used for carrying out high-parallelism full-pipeline convolution operation to obtain a convolution operation result; the operation array comprises n identical operation clusters, wherein n represents the parallelism of the processed images; the operation cluster comprises m 16-bit fixed-point multiply accumulators, 1 16-bit fixed-point adders, a source data sharing module and a data rearrangement module; where m represents the parallelism of operations within a single cluster;
the data rearrangement module is further used for outputting results of m multiply accumulators which arrive at the same moment, multiplexing 1 output port through the delay unit and the multiplexer MUX to form one path of result output; where m represents the parallelism of operations within a single cluster;
the operation cluster comprises m+1 weight inputs and 1 source data input ends, an m-path parallel pipeline operation structure is formed through internal interconnection, the source data are respectively sent to the 1 st input ends of m multiply-accumulators through a source data sharing module, the weight 1 to the weight m are respectively sent to the 2 nd input ends of the multiply-accumulators 1 to m, the multiply-accumulators 1 to m are operated simultaneously, the calculation result is sent to a data rearrangement module, the output of the data rearrangement module is sent to the input 1 of an adder, the weight m+1 is sent to the input 2 of the adder, and the output result of the adder is used as the convolution operation result of the operation cluster; wherein n represents the parallelism of the image processed by the operation array, and m represents the operation parallelism in a single cluster;
the source data distribution module is used for reading the image source data in the data cache and sending the image source data to the operation array;
the weight sharing module is used for reading weight data in the data cache, copying and regrouping the data and sending the data to the operation array;
and the result data writing module is used for storing the convolution calculation result of the operation array into the data caching module.
2. The efficient computing system for CNN network convolutional layers of claim 1, wherein the data caching module further comprises a source data caching area, a weight caching area, a result caching area, and wherein the data caching module provides a read-write interface for communication with a peripheral device; the source data buffer area is used for storing an image source data set, the weight buffer area is used for storing weight data, and the result buffer area is used for buffering result data.
3. The efficient computing system for a CNN network convolutional layer of claim 1, wherein the operational array has a parallelism of n x m when performing convolutional operations; the source data sharing module is further used for copying the 1-path input data into m copies through a 1-pair m driver and distributing the m copies to m multiply accumulators, wherein m represents the operation parallelism in a single cluster.
4. The efficient computing system for CNN network convolutional layers of claim 1, wherein the weight sharing module comprises m+1 weight inputs, each weight passes through 1 pair of n drivers, n copies are copied, n sets of weight outputs are generated through a multiplexer MUX, each set of weights comprises weights 1 to m+1, and n sets of weights are respectively sent to n operation clusters; where n represents the parallelism of the processing image of the operation array and m represents the parallelism of the operations within a single cluster.
CN202010947798.6A 2020-09-10 2020-09-10 Efficient memory calculation system applied to CNN (computer numerical network) convolution layer and operation method thereof Active CN112052941B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010947798.6A CN112052941B (en) 2020-09-10 2020-09-10 Efficient memory calculation system applied to CNN (computer numerical network) convolution layer and operation method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010947798.6A CN112052941B (en) 2020-09-10 2020-09-10 Efficient memory calculation system applied to CNN (computer numerical network) convolution layer and operation method thereof

Publications (2)

Publication Number Publication Date
CN112052941A CN112052941A (en) 2020-12-08
CN112052941B true CN112052941B (en) 2024-02-20

Family

ID=73610422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010947798.6A Active CN112052941B (en) 2020-09-10 2020-09-10 Efficient memory calculation system applied to CNN (computer numerical network) convolution layer and operation method thereof

Country Status (1)

Country Link
CN (1) CN112052941B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113191493B (en) * 2021-04-27 2024-05-28 北京工业大学 Convolutional neural network accelerator based on FPGA parallelism self-adaption

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB789486A (en) * 1955-01-27 1958-01-22 Ncr Co Digital computer electrical data classifying system
US7843907B1 (en) * 2004-02-13 2010-11-30 Habanero Holdings, Inc. Storage gateway target for fabric-backplane enterprise servers
CN108805792A (en) * 2017-04-28 2018-11-13 英特尔公司 Programmable coarseness with advanced scheduling and sparse matrix computing hardware
CN109886400A (en) * 2019-02-19 2019-06-14 合肥工业大学 The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel
CN109993297A (en) * 2019-04-02 2019-07-09 南京吉相传感成像技术研究院有限公司 A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing
CN110738308A (en) * 2019-09-23 2020-01-31 陈小柏 neural network accelerators
CN111445012A (en) * 2020-04-28 2020-07-24 南京大学 FPGA-based packet convolution hardware accelerator and method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679621B (en) * 2017-04-19 2020-12-08 赛灵思公司 Artificial neural network processing device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB789486A (en) * 1955-01-27 1958-01-22 Ncr Co Digital computer electrical data classifying system
US7843907B1 (en) * 2004-02-13 2010-11-30 Habanero Holdings, Inc. Storage gateway target for fabric-backplane enterprise servers
CN108805792A (en) * 2017-04-28 2018-11-13 英特尔公司 Programmable coarseness with advanced scheduling and sparse matrix computing hardware
CN109886400A (en) * 2019-02-19 2019-06-14 合肥工业大学 The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel
CN109993297A (en) * 2019-04-02 2019-07-09 南京吉相传感成像技术研究院有限公司 A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing
CN110738308A (en) * 2019-09-23 2020-01-31 陈小柏 neural network accelerators
CN111445012A (en) * 2020-04-28 2020-07-24 南京大学 FPGA-based packet convolution hardware accelerator and method thereof

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A DSP-Purposed REconfigurable Acceleration Machine (DREAM) for High Energy Efficiency MIMO Signal Processing;Yuxiang Fu等;《IEEE Transactions on Circuits and Systems I: Regular Papers》;第70卷(第2期);952-965 *
Efficient Processing of Deep Neural Networks: A Tutorial and Survey;Vivienne Sze等;《Proceedings of the IEEE》;第105卷(第12期);2295-2329 *
GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data;Cen Chen等;《IEEE Transactions on Parallel and Distributed Systems》;第29卷(第6期);1275-1288 *
基于FPGA的卷积神经网络加速模块设计;梅志伟等;《南京大学学报(自然科学)》;第56卷(第04期);581-590 *
用于实时目标检测的高速可编程视觉芯片;李鸿龙等;《红外与激光工程》;第49卷(第05期);193-202 *

Also Published As

Publication number Publication date
CN112052941A (en) 2020-12-08

Similar Documents

Publication Publication Date Title
KR102511911B1 (en) General matrix-matrix multiplication dataflow accelerator semiconductor circuit
US20220327367A1 (en) Accelerator for deep neural networks
US20210224125A1 (en) Operation Accelerator, Processing Method, and Related Device
US20180157969A1 (en) Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network
CN110222818B (en) Multi-bank row-column interleaving read-write method for convolutional neural network data storage
CN110738308B (en) Neural network accelerator
Nakahara et al. High-throughput convolutional neural network on an FPGA by customized JPEG compression
CN111898733A (en) Deep separable convolutional neural network accelerator architecture
CN113191488B (en) LSTM network model-oriented hardware acceleration system
CN110580519B (en) Convolution operation device and method thereof
US11983616B2 (en) Methods and apparatus for constructing digital circuits for performing matrix operations
CN112052941B (en) Efficient memory calculation system applied to CNN (computer numerical network) convolution layer and operation method thereof
CN113516236A (en) VGG16 network parallel acceleration processing method based on ZYNQ platform
CN115989505A (en) System and method for accelerating deep learning network using sparsity
Iliev et al. Low latency CMOS hardware acceleration for fully connected layers in deep neural networks
CN109948787B (en) Arithmetic device, chip and method for neural network convolution layer
US11886347B2 (en) Large-scale data processing computer architecture
CN112836793B (en) Floating point separable convolution calculation accelerating device, system and image processing method
EP4168943A1 (en) System and method for accelerating training of deep learning networks
CN109583577B (en) Arithmetic device and method
Kim et al. An Asynchronous Inter-Processor Communication Based, Input Recycling Parallel Architecture for Large Scale Neural Network Simulation
US20240134930A1 (en) Method and apparatus for neural network weight block compression in a compute accelerator
US20240094986A1 (en) Method and apparatus for matrix computation using data conversion in a compute accelerator
CN117634568A (en) Attention mechanism accelerator based on data flow
Yan et al. FPGA-based Convolutional Neural Network Design and Implementation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant