CN112052941B

CN112052941B - Efficient memory calculation system applied to CNN (computer numerical network) convolution layer and operation method thereof

Info

Publication number: CN112052941B
Application number: CN202010947798.6A
Authority: CN
Inventors: 李丽; 陈铠; 傅玉祥; 宋文清; 何国强; 陈辉; 何书专
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2024-02-20
Anticipated expiration: 2040-09-10
Also published as: CN112052941A

Abstract

The invention provides a high-efficiency memory computing system applied to a CNN network convolution layer and an operation method thereof, wherein the architecture comprises the following components: the data caching module is used for caching result data; the operation array is used for carrying out high-parallelism full-pipeline convolution operation to obtain a convolution operation result; the source data distribution module is used for reading the image source data in the data cache and sending the image source data to the operation array; the weight sharing module is used for reading weight data in the data cache, copying and regrouping the data and sending the data to the operation array; and the result data writing module is used for storing the convolution calculation result of the operation array into the data caching module. The high-efficiency memory computing architecture provided by the invention designs an operation array based on the parallel operation cluster of full-flow water, and designs a data buffer and a high-bandwidth data supply channel matched with the operation array, so that the high-performance operation of the CNN network dense convolution algorithm is realized with lower hardware complexity, and the high-efficiency memory computing architecture has a good application prospect.

Description

Efficient memory calculation system applied to CNN (computer numerical network) convolution layer and operation method thereof

Technical Field

The invention relates to the field of artificial intelligence algorithms, in particular to a hardware implementation method for high-density convolution operation in a CNN network convolution layer.

Background

Neural networks (networks) are part of the field of artificial intelligence research, the most popular of which is currently convolutional neural networks (convolutional neural networks, CNN), the underlying CNN consisting of three structures, convolutional (activation), and pooling (pooling). In 1998, lecun proposed a classical Lenet-5 network for solving the visual task of handwriting digital recognition, forming the embryonic form of contemporary convolutional neural networks. In recent years, with the rising of deep learning theory research and the continuous improvement of floating point operation performance of the heterogeneous computing GPU for training, the deep convolutional neural network is rapidly developed, and a large number of high-recognition-rate networks such as AlexNet, VGG, googleNet, mobileNet are developed. In 2015, he Kaiming et al proposed ResNet (residual neural network), successfully trained a convolutional neural network of 152 layers depth, and the visual recognition error rate was reduced to 4.94% which is lower than the error rate of human eye recognition by 5.1%. At present, the deep convolutional neural network is widely applied to the fields of voice recognition, image segmentation, natural language processing and the like.

Along with the continuous deepening of the depth of the neural network, the number of channels (namely the number of convolution kernels) is increased, the convolution operation amount is increased in an explosive manner, the operation amount is more than 80% of the whole CNN network, huge pressure is brought to high-real-time end-side application, the existing general processor based on the Von Neumann architecture cannot meet the high-real-time reasoning requirement of the neural network algorithm, a special high-efficiency computational architecture technology must be broken through, and the convolution operation processing performance is improved so as to adapt to the high-real-time application of the deep neural network end side.

Disclosure of Invention

The invention aims to: the invention aims to improve the performance of the CNN network convolution layer realization and achieve the efficient matching of storage, supply and operation resources. The invention provides a high-efficiency memory system applied to a CNN network convolution layer, and further provides an operation method based on the framework. The parallel operation cluster based on the full-pipeline is used for designing an operation array, and a data buffer and a high-bandwidth data supply channel matched with the operation array are designed, so that the high-performance operation of the CNN network dense convolution algorithm is realized with lower hardware complexity, and the performance requirement of the actual application of the convolution neural network is better met.

The technical scheme is as follows: an efficient memory system applied to a CNN network convolution layer comprises the following modules:

the data caching module is used for storing CNN image source data sets, storing weight data (comprising convolution kernels and offsets), caching result data and providing a read-write interface with peripheral equipment;

the operation array performs data convolution operation according to the input data and the weight data to obtain a convolution operation result;

the source data distribution module is used for generating a source data reading address, reading image source data in source data BANK and sending the source data to the operation array;

the weight sharing module reads weight data (comprising convolution kernel and offset) required by convolution operation, and copies and regroups the data through a 1-to-many driver and sends the data to the operation array;

and the result data writing module is used for generating a result data writing address and storing the convolution calculation result of the operation array into the data caching module.

In a further embodiment, the data buffer module further includes a source data buffer area, a weight buffer area, and a result buffer area, and the data buffer module provides a read-write interface for communication with a peripheral device; the source data buffer area is used for storing an image source data set, the weight buffer area is used for storing weight data, and the result buffer area is used for buffering result data.

In a further embodiment, the operational array comprises n identical operational clusters, where n represents the parallelism of the processed image; convolution operations of n images can be processed simultaneously.

In a further embodiment, the operational cluster includes m 16-bit fixed-point multiply accumulators, at least 1 16-bit fixed-point adder, a source data sharing module, and a data reordering module; where m represents the parallelism of operations within a single cluster, the total parallelism of convolution operations is n×m, and its value depends on the operation resource.

In a further embodiment, the parallelism of the operation array when performing convolution operation is n×m; the source data sharing module is further configured to copy at least 1 path of input data to m copies via at least 1 pair of m drivers, and distribute the m copies of the source data to m multiply-accumulators, where m represents the parallelism of operations within a single cluster.

In a further embodiment, the data rearrangement module is further configured to output the m multiply-accumulator output results arriving at the same time, and multiplex at least 1 output port through the delay unit and the multiplexer MUX to form a result output; where m represents the parallelism of operations within a single cluster.

In a further embodiment, the operation cluster has m+1 weight inputs and 1 source data input end, the operation cluster forms m paths of parallel pipeline operation structures through internal interconnection, the source data generates m outputs through the source data sharing module, the m outputs are respectively sent to the 1 st input end of m multiply-accumulators, the weight 1-weight m is respectively sent to the 2 nd input end of the multiply-accumulators 1-m, the multiply-accumulators 1-multiply-accumulators m operate simultaneously, the calculation result is sent to the data rearrangement module, the output of the data rearrangement module is sent to the input 1 of the adder, the weight m+1 is sent to the input 2 of the adder, and the output result of the adder is used as a convolution operation result of the operation cluster and is sent to the result data writing module.

In a further embodiment, the weight sharing module has m+1 weight inputs, each weight is copied n times through 1 pair of n drivers, n sets of weight outputs are generated through the multiplexer MUX, each set of weights includes weight 1-weight m+1, and n sets of weights are sent to n operation clusters respectively.

In a further embodiment, the data cache is divided into a source data cache region, a weight cache region and a result cache region, the source data cache region contains n data BANKs in total of source data BANK 1-source data BANKn, the weight cache region contains m+1 weights BANKs in total of weight BANK 1-weight bankm+1, and the result cache region contains n result BANKs in total of result BANK 1-result BANKn.

In a further embodiment, the BANK has a set of independent read ports and a set of independent write ports.

In a further embodiment, a mapping manner of the source data buffer area is defined, wherein the BANK1 stores the 1 st image data, the n+1st image data and the 2n+1st image data … …, and the 1 st image data is sent to the 1 st input end of the multiply-accumulator 1-multiply-accumulator m of the operation cluster 1 through the source data distribution module and the source data sharing module of the operation cluster 1; BANK2 stores the 2 nd image data, the n+2nd image data and the 2n+2nd image data … …, and sends the data to the 1 st input end of a multiplication accumulator 1-multiplication accumulator m of the operation cluster 2 through a source data distribution module and a source data sharing module of the operation cluster 2; similarly, BANKn stores nth image data, 2nth image data, and 3nth image data … …, and sends them to the 1 st input of multiply-accumulate 1-multiply-accumulate m of operation cluster n via source data distribution module and source data sharing module of operation cluster n.

In a further embodiment, a mapping manner of the weight buffer area is defined, wherein the BANK1 stores the 1 st convolution kernel data, the (m+1) th convolution kernel data and the (2m+1) th convolution kernel data … …, and the data are sent to the 2 nd input end of the multiply accumulator 1 of the n operation clusters through the weight sharing module and the operation clusters; BANK2 stores the 2 nd convolution kernel data, the (m+2) th convolution kernel data and the (2m+2) th convolution kernel data … …, and sends the data to the 2 nd input end of the multiply accumulator 2 of the n operation clusters through a weight sharing module and the operation clusters; similarly, BANKm stores the mth convolution kernel data, the 2 nd convolution kernel data and the 3 rd convolution kernel data … …, and sends the data to the 2 nd input end of the multiply accumulator m of the n operation clusters through the weight sharing module and the operation clusters; BANKm+1 stores bias parameters, and the bias parameters are sent to the 2 nd input end of an adder of n operation clusters through a weight sharing module and the operation clusters.

In a further embodiment, a mapping manner of the result buffer area is defined, and the BANK1 stores 1 st operation cluster result data, namely a 1 st image convolution result, an n+1st image convolution result and a 2n+1st image convolution result … …; BANK2 stores the result data of the 2 nd operation cluster, namely the 2 nd image convolution result, the n+2nd image convolution result and the 2n+2nd image convolution result … …; similarly, BANKn stores the nth operation cluster result data, that is, the nth image convolution result, the 2 nth image convolution result, and the 3 nth image convolution result … ….

Based on the above efficient storage architecture, the present invention further proposes a specific process for implementing the CNN network convolutional layer algorithm:

step 1) writing source data: the image source data is stored into a source data buffer area of the data buffer through a source data writing port, and the weight data required by convolution operation is stored into a weight buffer area of the data buffer.

Step 2) data transmission: the source data distribution module generates a reading address, reads source data of different images from n source data BANK, and sends the source data to an operation cluster 1-an operation cluster n respectively; meanwhile, the weight sharing module generates a read address, reads weight data from the weight BANK 1-weight BANKm to form n groups of identical weight data, and sends the n groups of identical weight data to the operation cluster 1-operation cluster n respectively.

Step 3) convolution operation: the operation clusters carry out convolution operation according to a parallel running water mode, the operation clusters 1-n carry out convolution operation of n images in parallel, m convolution kernels carry out convolution operation in each operation cluster in parallel, the total convolution operation parallelism reaches n multiplied by m, and each operation cluster generates m result points in the dimension of an output image channel.

Step 4) result caching: the result data is written into a result buffer area for receiving the convolution operation results of n operation clusters, namely, n image convolution results, and storing the result data into a data buffer.

And 5) repeating the steps 2) -4), and completing convolution operation of all result points of the image.

Step 6) reading the result: the image convolution result is read out through the result data read-out port.

The beneficial effects are that:

(1) The invention provides a high-efficiency memory architecture which can realize high-performance operation of a CNN network convolution layer algorithm.

(2) The invention designs an operation cluster structure, the main calculation unit is a multiply accumulator and an adder, the hardware implementation is easy, and the operation period of a convolution algorithm is shortened by adopting a full-flow multi-path parallel operation mode.

(3) The invention realizes the number supply to n input ports of n multiplied accumulators 2 multiplied by n multiplied by m and n input ports of n adders through the weight sharing module and the cluster endogenous data sharing module by using 2n+m+1 storage BANK read ports, and realizes the high bandwidth number supply to the operation components in the operation array with lower hardware complexity, thereby ensuring the full-flow operation of each operation cluster in the operation array.

(4) The invention designs a data buffer, which is divided into three independent areas, namely a source data buffer area, a weight buffer area and a result buffer area, and the conflict-free access is realized when a convolution algorithm is operated.

In conclusion, the invention can effectively improve the operation performance of the CNN network convolution layer and has good application value.

Drawings

Fig. 1 is a schematic diagram of a high-efficiency computing system applied to a CNN network convolution layer according to the present invention.

FIG. 2 is a schematic diagram of an operation cluster structure according to the present invention.

Fig. 3 is a schematic diagram of the multiply-accumulator architecture of the present invention.

Fig. 4 is a schematic diagram of a source data sharing module according to the present invention.

Fig. 5 is a schematic diagram of a data rearrangement module according to the present invention.

FIG. 6 is a diagram illustrating a source data buffer mapping according to the present invention.

FIG. 7 is a diagram illustrating a weight buffer mapping according to the present invention.

FIG. 8 is a diagram illustrating a result buffer map according to the present invention.

Fig. 9 is a schematic diagram of a weight sharing module according to the present invention.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the invention may be practiced without one or more of these details. In other instances, well-known features have not been described in detail in order to avoid obscuring the invention.

The technical solution for realizing the purpose of the invention is as follows: the operation array is designed based on the fully-pipelined parallel operation clusters, and a data buffer and a high-bandwidth data supply channel matched with the operation array are designed, so that an efficient memory system design applied to the CNN network convolution layer is formed, and high-performance hardware acceleration of the CNN network convolution layer is realized.

Example 1

As shown in FIG. 1, the efficient memory system design applied to the CNN network convolution layer provided by the invention comprises an operation array, a data cache, a source data distribution module, a weight sharing module and a result data writing module. The external interfaces are a source data writing interface, a weight writing interface and a result data reading interface.

(1) Operation array

The operation array is composed of n identical operation clusters, can process convolution operation of n images at the same time, and the operation clusters work independently and are provided with independent input and output data interfaces. The operation cluster consists of m 16-bit fixed-point multiply accumulators, 1 16-bit fixed-point adders, source data sharing and data rearrangement modules, the total parallelism of the convolution operation of the operation array is n multiplied by m, and the value of the total parallelism depends on operation resources. As shown in fig. 3, the multiply-accumulator in the operation cluster has 2 inputs, 1 output, and is internally composed of 1 16-bit fixed-point multiplier and 1 16-bit fixed-point adder. As shown in fig. 4, the source data sharing module is mainly composed of 1 pair of m drivers, copies m copies of source data, and distributes the copies to m multiply accumulators. As shown in fig. 5, the data rearrangement module multiplexes the m multiply-accumulate operation results arriving at the same time through the delay unit and the multiplexer MUX into 1 output port to form a result output. As shown in fig. 2, the operation cluster has m+1 weight inputs and 1 source data input end, the operation cluster forms m paths of parallel pipeline operation structures through internal interconnection, the source data generates m outputs through the source data sharing module, the m outputs are respectively sent to the 1 st input end of m multiply-accumulators, the weight 1-weight m is respectively sent to the 2 nd input end of the multiply-accumulator 1-multiply-accumulator m, the multiply-accumulator 1-multiply-accumulator m performs operation simultaneously, the calculation result is sent to the data rearrangement module, the output of the data rearrangement module is sent to the input 1 of the adder, the weight m+1 is sent to the input 2 of the adder, and the output result of the adder is used as the convolution operation result of the operation cluster and is sent to the result data writing module.

(2) Data caching

The data cache is used for storing CNN reasoning image data sets, storing reasoning weight data (comprising convolution kernel and bias), caching result data and providing a flexible read-write interface with the peripheral. The data cache is divided into a source data cache region, a weight cache region and a result cache region, wherein the source data cache region comprises n data BANK in total of source data BANK 1-source data BANKn, the weight cache region comprises m+1 weight BANK in total of weight BANK 1-weight BANKm+1, and the result cache region comprises n result BANK in total of result BANK 1-result BANKn. Each BANK has a set of independent read ports and a set of independent write ports.

The mapping mode of the source data buffer area is shown in fig. 6, wherein the BANK1 stores the 1 st image data, the n+1st image data and the 2n+1st image data … …, and the 1 st image data is sent to the 1 st input end of the multiply accumulator 1-multiply accumulator m of the operation cluster 1 through the source data distribution module and the source data sharing module of the operation cluster 1; BANK2 stores the 2 nd image data, the n+2nd image data and the 2n+2nd image data … …, and sends the data to the 1 st input end of a multiplication accumulator 1-multiplication accumulator m of the operation cluster 2 through a source data distribution module and a source data sharing module of the operation cluster 2; similarly, BANKn stores nth image data, 2nth image data, and 3nth image data … …, and sends them to the 1 st input of multiply-accumulate 1-multiply-accumulate m of operation cluster n via source data distribution module and source data sharing module of operation cluster n.

As shown in FIG. 7, the mapping mode of the weight buffer area is that BANK1 stores the 1 st convolution kernel data, the (m+1) th convolution kernel data and the (2m+1) th convolution kernel data … …, and the data is sent to the 2 nd input end of the multiply accumulator 1 of n operation clusters through the weight sharing module and the operation clusters; BANK2 stores the 2 nd convolution kernel data, the (m+2) th convolution kernel data and the (2m+2) th convolution kernel data … …, and sends the data to the 2 nd input end of the multiply accumulator 2 of the n operation clusters through a weight sharing module and the operation clusters; similarly, BANKm stores the mth convolution kernel data, the 2 nd convolution kernel data and the 3 rd convolution kernel data … …, and sends the data to the 2 nd input end of the multiply accumulator m of the n operation clusters through the weight sharing module and the operation clusters; BANKm+1 stores bias parameters, and the bias parameters are sent to the 2 nd input end of an adder of n operation clusters through a weight sharing module and the operation clusters.

As shown in FIG. 8, the mapping mode of the result buffer area is that BANK1 stores the 1 st operation cluster result data, namely the 1 st image convolution result, the n+1st image convolution result and the 2n+1st image convolution result … …; BANK2 stores the result data of the 2 nd operation cluster, namely the 2 nd image convolution result, the n+2nd image convolution result and the 2n+2nd image convolution result … …; similarly, BANKn stores the nth operation cluster result data, that is, the nth image convolution result, the 2 nth image convolution result, and the 3 nth image convolution result … ….

(3) Source data distribution

The source data distribution module generates a source data reading address, reads deduced image source data in source data BANK, and sends the source data to the operation array. During distribution, the source data of the source data BANK1 is sent to the source data input interface of the operation cluster 1, the source data of the source data BANK2 is sent to the source data input interface of the operation cluster 2, … …, and so on, the source data of the source data BANKn is sent to the source data input interface of the operation cluster n.

(4) Weight sharing

The weight sharing module reads the weight data (comprising convolution kernel and bias) needed by reasoning, and re-groups the data replication through the 1-to-many driver, and sends the data replication to the operation array. As shown in fig. 9, the weight sharing module has m+1 weight inputs, each weight passes through 1 pair of n drivers, replicates n copies, and generates n sets of weight outputs through the multiplexer MUX, each set of weights includes weight 1-weight m+1, and the n sets of weights are respectively sent to n operation clusters.

(5) Result data writing

The result data writing module generates a result data writing address and stores the convolution calculation result of the operation array into the data cache. When in storage, the convolution result of the operation cluster 1 is stored into the result BANK1, the convolution result of the operation cluster 2 is stored into the results BANK2 and … …, and so on, the convolution result of the operation cluster n is stored into the result BANKn.

Example 2

The embodiment realizes the high-performance hardware acceleration of the CNN network convolution layer algorithm on the basis of the high-efficiency memory architecture described in the embodiment 1.

In convolution computation, there are several parameters: input image size (in×in), input image channel (Ch), convolution kernel size (kxkxch), number of convolution kernels (Nu), step size (St), output image size (ou×ou), and channel of output image is equal to the number of convolution kernels Nu.

The formula of the single convolution operation is shown below, yielding 1 point of the output image.

Where H represents the rows of the two-dimensional matrix, W represents the columns of the two-dimensional matrix, C represents the channels of the image, H, W, C of X (H, W, C) determines the position of the input data in the image source data, H, W, C of WT (H, W, C) determines the position of the weight data in the convolution kernel, and b is the bias parameter.

The steps for realizing the convolution algorithm on the high-efficiency memory architecture are as follows:

step 1) writing source data: through the source data writing port, a plurality of image source data are stored in a source data buffer area of the data buffer according to the mode of fig. 6, and convolution kernels (WT) and offsets (b) required by convolution operation are stored in a weight buffer area of the data buffer according to the mode of fig. 7. The data is stored in the BANK according to (H, W, C), namely, the bit numbers jump according to the channel dimension, the column dimension and the row dimension.

Step 2) data transmission:

the source data distribution module generates a reading address, kxKxCh point source data of different images are read from n source data BANK and are respectively sent to an operation cluster 1-operation cluster n, the data is sent to the 1 st input end of the multiply accumulator 1-multiply accumulator m through the cluster source data sharing module through a data path shown in fig. 2, the operation cluster 1 processes the image 1 source data, the operation cluster 2 processes the image 2 source data, and the operation cluster n processes the image n source data.

Meanwhile, the weight sharing module generates a read address, reads m convolution kernels (KxKxCh) from the weight BANK 1-weight BANKm, reads offset parameters corresponding to the m convolution kernels from the weight BANKm+1, outputs the weight 1-weight m in fig. 9 respectively corresponding to the m convolution kernels read, sends the data channels in the operation cluster shown in fig. 2 to the 2 nd input end of the multiply accumulator 1-multiply accumulator m respectively, and outputs the weight m+1 in fig. 9 corresponding to the offset parameters read to the 2 nd input end of the adder through the data channels in the operation cluster shown in fig. 2. The weights 1-m+1 input between the clusters are the same.

Step 3) convolution operation:

the operation clusters carry out convolution operation according to a parallel running water mode, the operation clusters 1-n carry out convolution operation of n images in parallel, m convolution kernels carry out convolution operation in each operation cluster in parallel, and the total convolution operation parallelism reaches n multiplied by m. As shown in fig. 9, m multiply-accumulate operations are performed in parallel in each operation cluster, the data size of each multiply-accumulate operation is kxkxch, m multiply-accumulate results arriving simultaneously pass through a data rearrangement module and are converted into 1-path results to be sent to an addition module, and the 1-path results are added with a weight m+1, namely a bias parameter (b), so as to complete convolution operation and form m result points in the dimension of an output image channel.

Step 4) result caching:

the result data is written into a result buffer area for receiving the convolution operation results of n operation clusters, namely, the convolution results of n images, and is stored into a data buffer according to the illustration of fig. 8.

In summary, the design of the efficient memory system applied to the CNN network convolutional layer provided by the present example designs an operation array based on a fully pipelined parallel operation cluster, designs a data buffer and a high bandwidth data supply channel matched with the operation array, and uses 2n+m+1 memory BANK read ports through a weight sharing module and a cluster endogenous data sharing module, so as to implement n×m multiply-accumulator 2×n×m input ports and n input port supplies to n adders, and implement high performance operation of the CNN network convolutional layer algorithm with lower hardware complexity, thereby having good application prospects.

As described above, although the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The efficient storage and calculation system applied to the CNN network convolution layer is characterized by comprising the following modules:

the data caching module is used for caching result data;

the operation array is used for carrying out high-parallelism full-pipeline convolution operation to obtain a convolution operation result; the operation array comprises n identical operation clusters, wherein n represents the parallelism of the processed images; the operation cluster comprises m 16-bit fixed-point multiply accumulators, 1 16-bit fixed-point adders, a source data sharing module and a data rearrangement module; where m represents the parallelism of operations within a single cluster;

the data rearrangement module is further used for outputting results of m multiply accumulators which arrive at the same moment, multiplexing 1 output port through the delay unit and the multiplexer MUX to form one path of result output; where m represents the parallelism of operations within a single cluster;

the operation cluster comprises m+1 weight inputs and 1 source data input ends, an m-path parallel pipeline operation structure is formed through internal interconnection, the source data are respectively sent to the 1 st input ends of m multiply-accumulators through a source data sharing module, the weight 1 to the weight m are respectively sent to the 2 nd input ends of the multiply-accumulators 1 to m, the multiply-accumulators 1 to m are operated simultaneously, the calculation result is sent to a data rearrangement module, the output of the data rearrangement module is sent to the input 1 of an adder, the weight m+1 is sent to the input 2 of the adder, and the output result of the adder is used as the convolution operation result of the operation cluster; wherein n represents the parallelism of the image processed by the operation array, and m represents the operation parallelism in a single cluster;

the source data distribution module is used for reading the image source data in the data cache and sending the image source data to the operation array;

the weight sharing module is used for reading weight data in the data cache, copying and regrouping the data and sending the data to the operation array;

and the result data writing module is used for storing the convolution calculation result of the operation array into the data caching module.

2. The efficient computing system for CNN network convolutional layers of claim 1, wherein the data caching module further comprises a source data caching area, a weight caching area, a result caching area, and wherein the data caching module provides a read-write interface for communication with a peripheral device; the source data buffer area is used for storing an image source data set, the weight buffer area is used for storing weight data, and the result buffer area is used for buffering result data.

3. The efficient computing system for a CNN network convolutional layer of claim 1, wherein the operational array has a parallelism of n x m when performing convolutional operations; the source data sharing module is further used for copying the 1-path input data into m copies through a 1-pair m driver and distributing the m copies to m multiply accumulators, wherein m represents the operation parallelism in a single cluster.

4. The efficient computing system for CNN network convolutional layers of claim 1, wherein the weight sharing module comprises m+1 weight inputs, each weight passes through 1 pair of n drivers, n copies are copied, n sets of weight outputs are generated through a multiplexer MUX, each set of weights comprises weights 1 to m+1, and n sets of weights are respectively sent to n operation clusters; where n represents the parallelism of the processing image of the operation array and m represents the parallelism of the operations within a single cluster.