CN110188869B

CN110188869B - Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm

Info

Publication number: CN110188869B
Application number: CN201910368448.1A
Authority: CN
Inventors: 王成; 龙舟
Original assignee: Beijing Zhongke Huicheng Technology Co ltd
Current assignee: Beijing Zhongke Huicheng Technology Co ltd
Priority date: 2019-05-05
Filing date: 2019-05-05
Publication date: 2021-08-10
Anticipated expiration: 2039-05-05
Also published as: CN110188869A

Abstract

The invention belongs to the technical field of artificial intelligence, in particular to a method and a system for accelerating calculation of an integrated circuit based on a convolutional neural network algorithm, wherein the system comprises a unit queue of multiply accumulator units which input convolutional kernel data and external data from different directions in parallel, each multiply accumulator unit in the unit queue of multiply accumulator unit simultaneously and parallelly carries out corresponding multiply-accumulate processing on the convolutional kernel data and the external data which flow through the interior of the multiply accumulator unit and respectively outputs the data to a data storage unit, the invention solves the problems that the operation amount of the convolutional neural network is huge, the operation is not easy to be carried out on the integrated circuit or embedded equipment in real time, the traditional processor mainly based on a serial architecture is not easy to meet the requirements in the prior art, and therefore, how to quickly finish the operation of the convolutional neural network is an important problem to be solved, the method has the characteristics of less reading times, high operation throughput rate and low broadband requirement, and greatly improves the real-time performance of the operation of the convolutional neural network.

Description

Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm

Technical Field

The invention belongs to the technical field of artificial intelligence, particularly relates to an integrated circuit accelerated calculation method based on a convolutional neural network algorithm, and also provides an integrated circuit accelerated calculation system based on the convolutional neural network algorithm.

Background

The convolutional neural network is a feedforward neural network, is usually applied to the image recognition, generally include the convolutional layer, pooling layer and all-connected layer, the convolutional operation of the convolutional layer is, each weight in the convolutional kernel multiplies its correspondent input data point-to-point, then accumulate the result of dot multiplication, get a data outputted, later, set up according to the step length of the convolutional layer, slip the convolutional kernel, repeat the above-mentioned operation, a disadvantage existing in the prior art because the convolutional neural network is its operand is huge, it is difficult to calculate in real time on integrated circuit or embedded apparatus, it is difficult to reach the requirement with the traditional processor that the serial architecture is the main, therefore, how to finish the convolutional neural network operation fast is the important problem that needs to be solved.

Disclosure of Invention

The invention provides a method and a system for accelerating calculation of an integrated circuit based on a convolutional neural network algorithm, which aim to solve the problem that how to quickly complete the operation of the convolutional neural network is an important problem to be solved because the convolutional neural network has huge operation amount, is not easy to operate in real time on the integrated circuit or embedded equipment and is not easy to meet the requirement of a traditional processor mainly based on a serial architecture in the prior art.

The technical problem solved by the invention is realized by adopting the following technical scheme: a method for accelerating computation of an integrated circuit based on a convolutional neural network algorithm comprises the following steps:

inputting convolution kernel data and external data from different directions to a multiplication accumulator unit queue in parallel;

and each multiplication accumulator unit in the multiplication accumulator unit queue simultaneously and parallelly carries out corresponding multiplication accumulation processing on convolution kernel data and external data flowing through the multiplication accumulator unit queue, and respectively outputs the convolution kernel data and the external data to the data storage unit.

Further, the method further comprises:

inputting at least one convolution kernel data into corresponding multiplication accumulator units respectively;

respectively inputting external data in the input queue into corresponding multiplication accumulator units according to the queue sequence;

and (3) performing convolution operation on the convolution kernel data and the external data in each column of the multiply accumulator unit simultaneously in parallel by a multi-stage pipeline technology.

Further, the method further comprises:

the convolution kernel data is a pre-designed convolution kernel matrix, and the external data is data continuously generated by external input equipment;

reading in data continuously generated by a pre-designed convolution kernel matrix and an external input device through an external storage reading engine, respectively distributing convolution kernels and external data in the data read in by the reading engine to a convolution kernel data input queue and a head data preprocessing buffer area and a corresponding data input queue through data distribution, wherein the head data of the external data is distributed to the head data preprocessing buffer area, the non-head data of the external data is distributed to the corresponding data input queue, each convolution kernel is distributed to a corresponding multiplication accumulator unit of a row of accumulator multiplication unit queues correspondingly by the convolution kernel data input queue, the head data of the external data is output to a corresponding multiplication accumulator unit of the multiplication accumulator unit queues correspondingly, and the non-head data of the external data is circularly output to a corresponding multiplication accumulator unit of the multiplication accumulator unit queues correspondingly by the accumulator data input queue, and the corresponding multiply accumulators of the multiply accumulator unit queue perform corresponding multiply-accumulate operations and respectively output the results of the corresponding multiply-accumulate operations to output data for storage.

Further, if the data continuously generated by the external input device is external data, the CNN operation function of the convolutional neural network is:

the result of the convolution operation is a convolution kernel matrix x external data matrix;

convolution kernel matrix: a linear convolution kernel data matrix;

external data matrix: an external data matrix having a two-dimensional data structure, wherein the external data matrix contains MxN external data;

the result of the convolution operation: the convolution operation result matrix with a two-dimensional data structure comprises MxN convolution operation results, and the convolution operation results are products of convolution kernel data and corresponding external data.

Further, the multiply accumulator unit only has one queue of multiply accumulator units.

Further, the multiplication accumulator unit forms a data matrix sequential data processing mode, and the sequential data processing mode comprises:

processing data according to the sequence from top to bottom, or;

processing data according to the sequence from bottom to top, or;

processing the data in the order from left to right, or;

the data is processed in the order from right to left.

Further, the input processing of the multiply accumulator unit includes an initialization phase including 1 clock cycle to N clock cycles and a subsequent phase including clock cycles after N clock cycles.

The initialization stage adopts a first input rule of an external data flow closed loop, and the subsequent stage adopts a second input rule of the external data flow closed loop;

the first input rule includes:

before a multiply accumulator unit operation;

if the number of the multiplication accumulator units is 1 st to N-M times, reading an external data header preprocessing buffer area, wherein the data of the preprocessing buffer area are from 1 st to N th columns and from 1 st to N-M th rows;

if the number is from the (N-M + 1) th time to the (N) th time of the 1 st multiplication accumulator unit, reading the first data input queue and repeatedly circulating in sequence;

if the number of the multiplication accumulators is not 1 to N-M times of the 1 st multiplication accumulator, reading data output by the previous-stage multiplication accumulator unit, and if the number of the multiplication accumulators is not 1 to N-M times, stopping operation;

if the number of the multiplication accumulators is not the number from the (N-M + 1) th time to the (N) th time of the 1 st multiplication accumulator, the corresponding input queue of the data is read and the cycle is repeated according to the sequence.

The second input rule includes:

before a multiply accumulator unit operation;

if the 1 st time to the N-M times of the 1 st multiplication accumulator unit, reading the data output by the last multiplication accumulator unit;

if the 1 st multiplication accumulator unit is from the (N-M + 1) th time to the (N) th time, reading an input queue 1 of data and repeatedly circulating in sequence;

if the data is not the 1 st multiply accumulator, the corresponding data input queue is read from the N-M +1 th time to the N-th time, and the cycle is repeated according to the sequence.

If a multiply accumulator unit finishes the operation, the current input data is processed according to the operation finishing rule of a data flow closed loop:

the completion operation rule comprises:

if the data is input from the 1 st time to the Mth time, directly discarding the data;

if the data is inputted from the M +1 th time to the Nth time, if the data is not the last multiply accumulator, sequentially moving down to the adjacent multiply accumulator units;

if the data is inputted from the M +1 th time to the Nth time, the data is moved to the first multiply accumulator unit if the data is the last multiply accumulator.

Meanwhile, the invention also provides a system for accelerating the calculation of the integrated circuit based on the convolutional neural network algorithm, which comprises the following steps: the device comprises an external memory, an external memory reading engine, an input data distribution controller, an input data header preprocessing buffer unit, a row of multiplication accumulator unit queues, a convolution kernel data input queue unit, a data input queue unit and an output data storage unit;

the external memory is used for storing a pre-designed convolution kernel matrix and data continuously generated by external input equipment;

the external memory reading engine is used for reading and outputting the convolution kernel matrix and the external data in the external memory to the input data distribution controller;

the input data distribution controller is used for distributing the convolution kernel matrix and the external data in the data read by the reading engine to a convolution kernel data input queue, a head data preprocessing buffer area and a corresponding data input queue through data distribution respectively, wherein the head data of the external data is distributed to the head data preprocessing buffer area, and the non-head data of the external data is distributed to the corresponding data input queue;

the input data header preprocessing buffer unit is used for outputting header data of external data to corresponding multiply accumulator units of the multiply accumulator unit queue;

the convolution kernel data input queue unit is used for correspondingly distributing each convolution kernel to a corresponding multiplication accumulator unit of a row of multiplication accumulator unit queues;

the data input queue unit is used for circularly outputting non-head data of the external data to corresponding multiply accumulator units of the multiply accumulator unit queue;

the array of multiply accumulator unit queues includes multiply accumulator units for respective multiply accumulate operations and outputting respective multiply accumulate results to an output data store.

Further, the output of the external memory is connected to an external memory reading engine, the output of the external memory reading engine is connected to an input data distribution controller, one output end of the input data distribution controller is connected to a convolution kernel data input queue unit, the other output end of the input data distribution controller is connected to an input data header preprocessing buffer unit and a corresponding data input queue unit, the output of the convolution kernel data input queue unit is connected to the convolution kernel input ends of a row of multiplication accumulator unit queues, each multiplication accumulator unit of the row of multiplication accumulator unit queues is connected in turn, the input data header preprocessing buffer unit is connected to the data input ends of the row of multiplication accumulator unit queues, the corresponding multiplication accumulator units of the row of multiplication accumulator unit queues are respectively connected to corresponding data input queue units, the respective multiply accumulator units of the column of the queue of multiply accumulator units are each output at a respective input of the output data storage unit.

Furthermore, the multiplication accumulator unit comprises a convolution kernel register, an external data register, a multiplier, an adder and an activation function module;

the convolution kernel register and one input end of the multiplier jointly receive each convolution kernel data, the external data register and the other input end of the multiplier jointly receive each external data, the multiplier outputs to the adder, the adder outputs to the activation function module, the activation function module outputs to the corresponding port of the output data storage unit, the convolution kernel register outputs to the convolution kernel data input end of the next multiplication accumulator, and the external data register outputs to the external data input end of the next multiplication accumulator.

The invention has the beneficial effects that:

convolution kernel data and external data are input to a multiplication accumulator unit queue in parallel from different directions; the invention discloses a design method of integrated circuit for accelerating the convolution neural network CNN operation of calculating large data quantity, belonging to a design method of accelerating calculation of integrated circuit for real-time processing large data quantity (such as image processing, sound data processing, etc.), the convolution kernel data and data are parallelly inputted into multiplication accumulator unit queue from different directions, and all multiplication accumulator units simultaneously and parallelly process and output the data passed through it, the invention repeatedly utilizes the inputted data many times to construct data flow closed loop, and greatly reduces the number of times of reading external memory, and the invention greatly improves the utilization rate of multiplication accumulator unit by multi-stage pipeline technology, the circuit can greatly reduce the bandwidth requirement on reading data of an external memory in the data calculation process and accelerate the CNN operation of the convolutional neural network of the data, so the circuit has the characteristics of less reading times, high operation throughput and low bandwidth requirement, and the real-time performance of the operation of the convolutional neural network is greatly improved.

Drawings

FIG. 1 is a general flow diagram of a method for integrated circuit accelerated computation based on convolutional neural network algorithm of the present invention;

FIG. 2 is a flowchart illustrating a method for accelerating computation of an integrated circuit based on a convolutional neural network algorithm according to an embodiment of the present invention;

FIG. 3 is a detailed flow chart of a method for accelerating computation of an integrated circuit based on a convolutional neural network algorithm according to the present invention;

FIG. 4 is a system block diagram of an integrated circuit accelerated computing system based on convolutional neural network algorithm of the present invention;

FIG. 5 is an exemplary diagrammatic architectural diagram of a single multiply accumulator unit of a system for integrated circuit accelerated computation based on convolutional neural network algorithms of the present invention;

FIG. 6 is an exemplary pictorial architectural diagram for clock cycle 1 of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 7 is an exemplary pictorial architectural diagram for a 2 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 8 is an exemplary pictorial architectural diagram for the 3 rd clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 9 is an exemplary pictorial architectural diagram for the 4 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 10 is an exemplary pictorial architectural diagram for the 5 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 11 is an exemplary pictorial architectural diagram for the 6 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 12 is an exemplary pictorial architectural diagram for the 7 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 13 is an exemplary pictorial architectural diagram for the 8 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 14 is an exemplary pictorial architectural diagram for a 9 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 15 is an exemplary pictorial architectural diagram for a 10 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 16 is an exemplary pictorial architectural diagram for the 11 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 17 is an exemplary pictorial architectural diagram for a 12 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 18 is an exemplary pictorial architectural diagram for the 13 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 19 is an exemplary pictorial architectural diagram for the 14 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 20 is an exemplary pictorial architectural diagram for the 15 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 21 is an exemplary pictorial architectural diagram for the 16 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 22 is an exemplary pictorial architectural diagram for a 17 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 23 is an exemplary pictorial architectural diagram of an 18 clock cycle of a system for integrated circuit accelerated computing based on a convolutional neural network algorithm of the present invention;

FIG. 24 is an exemplary pictorial architectural diagram of a 19 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 25 is an exemplary pictorial architectural diagram for a 20 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 26 is an exemplary pictorial architectural diagram for the 21 st clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 27 is an exemplary pictorial architectural diagram of a 22 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 28 is an exemplary pictorial architectural diagram for a 23 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 29 is an exemplary pictorial architectural diagram for a 24 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 30 is an exemplary pictorial architectural diagram for a 25 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 31 is an exemplary pictorial architectural diagram of a 26 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 32 is an exemplary pictorial architectural diagram of a 27 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 33 is an exemplary pictorial architectural diagram for the 28 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 34 is an exemplary pictorial architectural diagram for the 29 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 35 is an exemplary pictorial architectural diagram for a 30 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;

FIG. 36 is an exemplary graphical architecture diagram of the 31 st clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

in the figure:

s101, inputting convolution kernel data and external data from different directions to a multiplication accumulator unit queue in parallel;

s102, each multiplication accumulator unit in the multiplication accumulator unit queue simultaneously and parallelly carries out corresponding multiplication accumulation processing on convolution kernel data and external data flowing through the multiplication accumulator unit queue;

s103-respectively outputting to the data storage units;

s201, inputting at least one convolution kernel data into corresponding multiplication accumulator units respectively;

s202, respectively inputting external data in an input queue into corresponding multiplication accumulator units according to a queue sequence;

s203, carrying out convolution operation on convolution kernel data and external data in each row of multiplication accumulator units simultaneously and in parallel through a multi-stage pipeline technology;

s301, reading in data continuously generated by a pre-designed convolution kernel matrix and external input equipment through an external storage reading engine;

s302, distributing a convolution kernel and external data in data read by a reading engine to a convolution kernel data input queue, a head data preprocessing buffer area and a corresponding data input queue respectively through data distribution;

s303 — wherein the header data of the external data is allocated to the header data pre-processing buffer, and the non-header data thereof is allocated to the corresponding data input queue;

s304-the convolution kernel data input queue allocates each convolution kernel to a corresponding multiply accumulator unit of a row of multiply accumulator unit queues;

s305-the data preprocessing buffer outputs the head data of the external data to the corresponding multiply accumulator unit of the multiply accumulator unit queue;

s306-the corresponding data input queue circularly outputs the non-head data of the external data to the corresponding multiply accumulator unit of the multiply accumulator unit queue;

s307, the corresponding multiply accumulators of the multiply accumulator unit queue perform corresponding multiply-accumulate operations and respectively output the results of the corresponding multiply-accumulate operations to output data for storage;

example (b):

the first embodiment is as follows: as shown in fig. 1, a method for accelerating computation of an integrated circuit based on a convolutional neural network algorithm includes: convolution kernel data and external data are input to a multiplication accumulator unit queue S101 in parallel from different directions; each multiply accumulator unit in the multiply accumulator unit queue simultaneously and parallelly carries out corresponding multiply accumulation processing S102 on convolution kernel data and external data flowing through the multiply accumulator unit queue, and the data are respectively output to a data storage unit S103;

because the convolution kernel data and the external data are input to the multiplication accumulator unit queue from different directions in parallel; the invention discloses a design method of integrated circuit for accelerating the convolution neural network CNN operation of calculating large data quantity, belonging to a design method of accelerating calculation of integrated circuit for real-time processing large data quantity (such as image processing, sound data processing, etc.), the convolution kernel data and data are parallelly inputted into multiplication accumulator unit queue from different directions, and all multiplication accumulator units simultaneously and parallelly process and output the data passed through it, the invention repeatedly utilizes the inputted data many times to construct data flow closed loop, and greatly reduces the number of times of reading external memory, and the invention greatly improves the utilization rate of multiplication accumulator unit by multi-stage pipeline technology, the circuit can greatly reduce the bandwidth requirement on reading data of an external memory in the data calculation process and accelerate the CNN operation of the convolutional neural network of the data, so the circuit has the characteristics of less reading times, high operation throughput and low bandwidth requirement, and the real-time performance of the operation of the convolutional neural network is greatly improved.

As shown in fig. 2, the method further comprises:

inputting at least one convolution kernel data into corresponding multiplication accumulator units S201 respectively;

respectively inputting external data in the input queue into corresponding multiply accumulator units S202 according to the queue order;

performing convolution operation S203 on convolution kernel data and external data in each row of multiply accumulator units simultaneously in parallel by a multi-stage pipeline technology;

because at least one convolution kernel data is respectively input into the corresponding multiplication accumulator units; respectively inputting external data in the input queue into corresponding multiplication accumulator units according to the queue sequence; the convolution kernel data and the external data in each row of the multiplication accumulator units are simultaneously and parallelly convoluted by a multi-stage pipeline technology, the data are compressed by constructing a sparse matrix with a plurality of zeros to accelerate data reading, a row of multiplication accumulator unit queues are constructed, the convolution operation is simultaneously and parallelly carried out by the multi-stage pipeline technology to construct a plurality of data input queues, the data are simultaneously input to the multiplication accumulator units according to a specific sequence, and the simultaneous and parallel convolution operation of the multiplication accumulator units is ensured.

As shown in fig. 3, the method further comprises:

the convolution kernel data is a pre-designed convolution kernel matrix, and the external data is data continuously generated by external input equipment; reading the data continuously generated by the pre-designed convolution kernel matrix and the external input device into an external storage reading engine S301, respectively distributing the convolution kernels and the external data in the data read by the reading engine to a convolution kernel data input queue and a head data preprocessing buffer area and a corresponding data input queue S302 through data distribution, wherein the head data of the external data is distributed to the head data preprocessing buffer area, the non-head data of the external data is distributed to a corresponding data input queue S303, each convolution kernel is correspondingly distributed to a corresponding multiplication accumulator unit S304 of a row of multiplication accumulator unit queues by the convolution kernel data input queue, the head data of the external data is output to a corresponding accumulator multiplication unit S305 of the multiplication accumulator unit queues by the data preprocessing buffer area, the non-head data of the external data is circularly output to a corresponding multiplication accumulator unit S306 of the multiplication accumulator unit queues by the corresponding data input queue, the corresponding multiply accumulators of the multiply accumulator unit queue perform corresponding multiply-accumulate operations and output the results of the corresponding multiply-accumulate operations to the output data storage S307, respectively;

the convolution kernel data is a pre-designed convolution kernel matrix, and the external data is data continuously generated by external input equipment; reading in data continuously generated by a pre-designed convolution kernel matrix and an external input device through an external storage reading engine, respectively distributing convolution kernels and external data in the data read in by the reading engine to a convolution kernel data input queue and a head data preprocessing buffer area and a corresponding data input queue through data distribution, wherein the head data of the external data is distributed to the head data preprocessing buffer area, the non-head data of the external data is distributed to the corresponding data input queue, each convolution kernel is distributed to a corresponding multiplication accumulator unit of a row of accumulator multiplication unit queues correspondingly by the convolution kernel data input queue, the head data of the external data is output to a corresponding multiplication accumulator unit S305 of the multiplication accumulator unit queues, and the non-head data of the external data is circularly output to a corresponding multiplication accumulator unit of the multiplication accumulator unit queues by the corresponding data input queue, the corresponding multiply accumulators of the multiply accumulator unit queue carry out corresponding multiply accumulation operation and respectively output the results of the corresponding multiply accumulation operation to output data for storage, convolution kernel data and input data are input to the multiply accumulator unit queue from different directions in parallel through the process, and the data passing through the multiply accumulator units are simultaneously processed in parallel.

If the data continuously generated by the external input device is external data, the CNN operation function of the convolutional neural network is:

convolution kernel matrix: a linear convolution kernel data matrix;

Because the data continuously generated by the external input equipment is adopted as the external data, the operation function of the convolutional neural network CNN is as follows: the result of the convolution operation is a convolution kernel matrix x external data matrix;

for example, the following steps are carried out:

regarding the convolutional neural network CNN operation of the image, the formula is: the result of the convolution operation is a convolution kernel data pixel matrix, wherein:

convolution kernel data:

convolution kernel data 1	Convolution kernel data 4	Convolution kernel data 7
			Convolution kernel data 2	Convolution kernel data 5	Convolution kernel data 8
Convolution kernel data 3	Convolution kernel data 6	Convolution kernel data 9

Pixel matrix:

pixel 1	Pixel 11	Pixel 21
			Pixel 2	Pixel 12	Pixel 22
Pixel 3	Pixel 13	Pixel 23

The convolution operation is to accumulate the multiplication results of the following table and then output the results by activating the function.

Convolution kernel data 1 pixel 1	Convolution kernel data 4 pixels 11	Convolution kernel data 7 pixels 21
			Convolution kernel data 2 pixels 2	Convolution kernel data 5 pixels 12	Convolution kernel data 8 pixels 22
Convolution kernel data 3 pixels 3	Convolution kernel data 6 pixels 13	Convolution kernel data 9 pixels 23

The multiply accumulator unit is limited to only one queue of multiply accumulator units.

Because only one row of multiplication accumulator unit queues is adopted by the multiplication accumulator units, and only one row of multiplication accumulator unit queues exists, the circuit is simple and clear to realize.

The multiplication accumulator unit forms a data matrix for sequential data processing.

The sequential data processing mode comprises the following steps:

processing data according to the sequence from top to bottom, or;

processing data according to the sequence from bottom to top, or;

processing the data in the order from left to right, or;

the data is processed in the order from right to left.

Because the multiplication accumulator unit is adopted to form a data matrix for sequential data processing, the sequential data processing mode comprises the following steps: the data are processed according to the sequence from top to bottom, or the data are processed according to the sequence from bottom to top, or the data are processed according to the sequence from left to right, or the data are processed according to the sequence from right to left.

The input processing of the multiply accumulator unit includes an initialization phase including 1 clock cycle to Nth clock cycle and a subsequent phase including clock cycles after the Nth clock cycle.

The initialization stage adopts a first input rule of an input data flow closed loop, and the subsequent stage adopts a second input rule of the input data flow closed loop;

the first input rule includes:

before a multiply accumulator unit operation;

if the data is from 1 st time to N-M times of the 1 st multiply accumulator unit, reading a preprocessing buffer area at the head of the input data, wherein the data of the preprocessing buffer area is from 1 st row to N th row and from 1 st row to N-M th row;

if the number is from the (N-M + 1) th time to the (N) th time of the 1 st multiplication accumulator unit, reading an input queue 1 of data and repeatedly circulating in sequence;

The input processing adopting the multiplication accumulator unit comprises an initialization stage and a subsequent stage, wherein the initialization stage comprises 1 clock cycle to Nth clock cycle, the initialization stage adopts a first input rule of an input data flowing closed loop, and the subsequent stage adopts a second input rule of the input data flowing closed loop, and the first input rule comprises a time before a multiplication accumulator unit operation; if the data is from 1 st time to N-M times of the 1 st multiply accumulator unit, reading a preprocessing buffer area at the head of the input data, wherein the data of the preprocessing buffer area is from 1 st row to N th row and from 1 st row to N-M th row; if the number is from the (N-M + 1) th time to the (N) th time of the 1 st multiplication accumulator unit, reading an input queue 1 of data and repeatedly circulating in sequence; if the number of the multiplication accumulators is not 1 to N-M times of the 1 st multiplication accumulator, reading data output by the previous-stage multiplication accumulator unit, and if the number of the multiplication accumulators is not 1 to N-M times, stopping operation; if the number of the multiply accumulator units is not from the (N-M + 1) th time to the (N) th time of the 1 st multiply accumulator, reading the corresponding data input queue and repeating the cycle in sequence, wherein the 1 st clock cycle to the (N x N) th clock cycle are used as an initialization stage to design a specific input rule of an image data flow closed loop, and before one multiply accumulator unit operates, reading the data according to the following 4 rules: rule (a): for the 1 st multiply accumulator unit, the input data header pre-processing buffer is read from 1 st to N-M times. The data of the preprocessing buffer area are from the 1 st column to the Nth column and from the 1 st line to the N-M line; rule (b): for the 1 st multiply accumulator unit, the input queue 1 of data is read from times N-M +1 through N. Repeating the cycle in this order; rule (c): for the multiplication accumulator which is not the 1 st, reading the data output by the previous multiplication accumulator unit from the 1 st to the N-M times, and if not, stopping operation; rule (d): for multiply accumulators that are not 1 st, the input queue of the corresponding data is read from time N-M +1 to time N. The cycle is repeated in this order.

The second input rule includes:

before a multiply accumulator unit operation;

The input processing adopting the multiplication accumulator unit comprises an initialization stage and a subsequent stage, wherein the subsequent stage comprises clock cycles after Nth clock cycle, and the subsequent stage adopts a second input rule of an input data flow closed loop; the second input rule includes: before a multiply accumulator unit operation; if the 1 st time to the N-M times of the 1 st multiplication accumulator unit, reading the data output by the last multiplication accumulator unit; if the 1 st multiplication accumulator unit is from the (N-M + 1) th time to the (N) th time, reading an input queue 1 of data and repeatedly circulating in sequence; if the number of the multiplication accumulators is not 1 to N-M times of the 1 st multiplication accumulator, reading data output by the previous-stage multiplication accumulator unit, and if the number of the multiplication accumulators is not 1 to N-M times, stopping operation; if not, the corresponding data input queue is read from the (N-M + 1) th time to the (N) th time, and the cycle is repeated in sequence, and a specific data flow closed-loop input rule is designed after the (N x N) th clock period. Before a multiply-accumulator unit operation, data is read according to the following 4 rules: rule (a): for the 1 st multiplication accumulator unit, reading the data output by the last multiplication accumulator unit from the 1 st time to the N-M times; rule (b): for the 1 st multiply accumulator unit, the input queue 1 of data is read from times N-M +1 through N. Repeating the cycle in this order; rule (c): for the multiplication accumulator which is not the 1 st, reading the data output by the previous multiplication accumulator unit from the 1 st to the N-M times, and if not, stopping operation; rule (d): for multiply accumulators that are not 1 st, the input queue of the corresponding data is read from time N-M +1 to time N. The cycle is repeated in this order.

the completion operation rule comprises:

if the data is the input data from the 1 st time to the Mth time, directly discarding the data;

Because the current input data is processed according to the operation completing rule of a data flow closed loop if one multiplication accumulator unit completes the operation: the completion operation rule comprises: if the data is the input data from the 1 st time to the Mth time, directly discarding the data; if the data is inputted from the M +1 th time to the Nth time, if the data is not the last multiply accumulator, sequentially moving down to the adjacent multiply accumulator units; if the data is inputted from the M +1 th time to the N th time, if the data is inputted from the last multiply accumulator, the data is shifted to the first multiply accumulator unit, since it is assumed that the convolution kernel is a square matrix of N × N, the width of the convolution kernel is N, the shift step size is M, and N > M. Assuming that each unit in the multiplication accumulator unit queue is arranged from top to bottom, the first multiplication accumulator is at the top, the last multiplication accumulator is at the bottom, and after one multiplication accumulator unit finishes operation, the currently input data is processed according to the following 3 rules: rule (a): directly discarding the data input from the 1 st time to the Mth time; rule (b): for the data input from the M +1 th time to the Nth time, if not the last multiplication accumulator, sequentially moving down to the adjacent multiplication accumulator units; rule (c): for the data inputted from the M +1 th time to the N th time, if the last multiplication accumulator is, the data is moved to the first multiplication accumulator unit, so that a data flow closed loop is constructed.

the array of multiply accumulator unit queues includes multiply accumulator units for respective multiply accumulate operations and output respective results of the multiply accumulate operations to an output data store

As shown in fig. 4, the external memory output is connected to an external memory read engine, the external memory read engine output is connected to an input data distribution controller, one output end of the input data distribution controller is connected to a convolution kernel data input queue unit, the other output end of the input data distribution controller is connected to an input data header preprocessing buffer unit and a corresponding data input queue unit, the output of the convolution kernel data input queue unit is connected to a convolution kernel input end of a row of multiply accumulator unit queues, each multiply accumulator unit of the row of multiply accumulator unit queues is connected in turn, the input data header preprocessing buffer unit is connected to a data input end of the row of multiply accumulator unit queues, the corresponding multiply accumulator units of the row of multiply accumulator unit queues are respectively connected to corresponding data input queue units, the respective multiply accumulator units of the column of the queue of multiply accumulator units are each output at a respective input of the output data storage unit.

Meanwhile, the invention also provides a system for accelerating the calculation of the integrated circuit based on the convolutional neural network algorithm, which comprises the following steps: the device comprises an external memory, an external memory reading engine, an input data distribution controller, an input data header preprocessing buffer area, a multiplication accumulator unit, a convolution kernel data input queue, an input queue of input data and an output data storage unit; the external memory is used for storing a pre-designed convolution kernel matrix and data continuously generated by external input equipment; the external memory reading engine is used for reading and outputting the convolution kernel in the external memory and external data to the input data distribution controller; the input data distribution controller is used for distributing the convolution kernels and the external data to corresponding convolution kernel data input queues and corresponding input data header preprocessing buffer areas respectively; the multiply accumulator unit comprises multiply accumulators, each multiply accumulator of the multiply accumulator unit is used for carrying out corresponding multiply-accumulate operation and respectively outputting the result of the multiply-accumulate operation to the output data storage unit, and the circuit comprises: 1-an external memory read engine; 2-input data distribution controller; 3-an output data storage unit; 4-input data header pre-processing buffer; 5-a column of multiply accumulator unit queues; 6-convolution kernel data input queue; 7-input queue of data, realize that convolution kernel data and data are input to the queue of multiply accumulator unit from different directions in parallel, each multiply accumulator unit processes the data flowing through it in parallel at the same time, therefore, realize the acceleration of calculation of an integrated circuit which processes a large amount of data (such as image processing, sound data processing, etc.) in real time.

The multiply accumulator unit is a column of multiply accumulator unit queues.

As shown in fig. 5, the multiply accumulator unit includes a convolution kernel register, an external data register, a multiplier, an adder, and an activation function module;

The multiplication accumulator comprises a convolution kernel register, an external data register, a multiplier, an adder and an activation function module; the convolution kernel register and one input end of the multiplier commonly receive each convolution kernel data, the other input end of the external data register and the other input end of the multiplier commonly receive each external data, the multiplier outputs to the adder, the adder outputs to the activation function module, the activation function module outputs to the corresponding port of the output data storage unit, the convolution kernel register outputs to the convolution kernel data input end of the next multiplication accumulator, the external data register outputs to the external data input end of the next multiplication accumulator, and due to the fact that an input queue of a plurality of data is constructed, data is simultaneously input to a plurality of multiplication accumulator units according to a specific sequence, the plurality of multiplication accumulator units are guaranteed to simultaneously carry out convolution operation in parallel.

Example two:

the following describes a method and system for accelerating computation of an integrated circuit based on a convolutional neural network algorithm by taking image processing as an example, as follows:

step 1: assume that the convolution kernel data matrix is N and the sliding step is M. For example: n is 3, M is 1;

the pre-designed convolution kernel data matrix is placed in an external memory. For example:

Step 2: processing image data continuously output by a camera, subtracting pixels of adjacent images to generate more zeros, compressing a sparse matrix, and putting the sparse matrix into an external memory. For example, the camera outputs an image, which is the first 3 columns of data:

pixel 1	Pixel 11	Pixel 21
			Pixel 2	Pixel 12	Pixel 22
Pixel 3	Pixel 13	Pixel 23
			Pixel 4	Pixel 14	Pixel 24
Pixel 5	Pixel 15	Pixel 25
			Pixel 6	Pixel 16	Pixel 26
Pixel 7	Pixel 17	Pixel 27
			Pixel 8	Pixel 18	Pixel 28
Pixel 9	Pixel 19	Pixel 29
			Pixel 10	Pixel 20	Pixel 30
Pixel 31	Pixel 41	Pixel 51
			Pixel 32	Pixel 42	Pixel 52
Pixel 33	Pixel 43	Pixel 53
			Pixel 34	Pixel 44	Pixel 54
Pixel 35	Pixel 45	Pixel 55
			Pixel 36	Pixel 46	Pixel 56

And step 3:

and the reading engine reads the data of the external memory into the buffer area and decompresses the sparse matrix.

And 4, step 4:

the input data distribution controller queues the convolution kernel data. For example:

convolution kernel data 9
	Convolution kernel data 8
Number of convolution kernelsAccording to 7
	Convolution kernel data 6
Convolution kernel data 5
	Convolution kernel data 4
Convolution kernel data 3
	Convolution kernel data 2
Convolution kernel data 1

And 5:

the input data distribution controller puts the image data of the 1 st column to the Nth column and the 1 st line to the N-1 st line into the head preprocessing buffer. For example:

pixel 1

Pixel 2

Pixel 11

Pixel 12

Pixel 21

Pixel 22

Step 6:

an input data allocation controller places the image data of the 1 st column to the Nth column and the Nth row in an input queue 1 of image data. For example:

and 7:

an input data allocation controller puts the image data of the 1 st column to the N th column and the N +1 th row into an input queue 2 of the image data. For example:

and 8:

an input data allocation controller places the image data of the 1 st column to the Nth column and the (N + 2) th row in an input queue 3 of image data. For example:

and step 9:

an input data allocation controller puts the image data of the 1 st column to the Nth column and the (N + 3) th row into an input queue 4 of image data. For example:

step 10:

an input data allocation controller puts the image data of the 1 st column to the Nth column and the (N + 4) th row into an input queue 5 of the image data. For example:

step 11:

as shown in fig. 6, which is a 1 st clock cycle after initialization, an exemplary illustrative architecture;

step 12:

as shown in FIG. 7, this figure is the 2 nd clock cycle;

reading convolution kernel data 1 of a convolution kernel data queue by a unit 1;

unit 1 reads in pixel 1 of the data header pre-processing buffer;

unit 1 starts multiplication and accumulates the result: convolution kernel data 1 pixel 1;

after a multiply accumulator unit completes the operation, the currently input image data is processed according to the following 3 rules:

assuming that the convolution kernel is a 3 x 3 square matrix, the width of the convolution kernel is 3, the shift step size is 1,

rule (a): for the 1 st input image data, it is directly discarded.

Rule (b): for the 2 nd to 3 rd input image data, if not the last multiply accumulator, the sequence moves down to the adjacent multiply accumulator unit.

Rule (c): for the image data input from the 2 nd to the 3 rd time, if the last multiplication accumulator is used, the image data is moved to the first multiplication accumulator unit, and thus an image data flowing closed loop is constructed.

The number 1 of this arrow of the figure indicates the 1 st time. So pixel 1 is discarded directly after the operation is complete.

Step 13:

as shown in FIG. 8, this figure is the 3 rd clock cycle;

the convolution kernel data 1 is moved to a buffer register;

the unit 1 reads in convolution kernel data 2;

unit 1 reads in pixel 2;

unit 1 starts multiplication and accumulates the result: convolution kernel data 2 × pixel 2;

the number 2 of this arrow of the figure indicates the 2 nd time. So after the operation is completed, pixel 2 will be moved to the multiply-accumulate unit 2.

Step 14:

as shown in fig. 9, the lower diagram is the 4 th clock cycle, the convolution kernel data 2 is moved to the buffer register;

reading convolution kernel data 3 by a unit 1, and reading convolution kernel data 1 by a unit 2;

the unit 1 reads in the pixel 3 and the unit 2 reads in the pixel 2;

the

units

1 and 2 start multiplication operation and accumulate the result;

convolution kernel data 3 × pixels 3, convolution kernel data 1 × pixels 2;

step 15:

as shown in fig. 10, which is the 5 th clock cycle,

convolution kernel data

1,3 move to the buffer register;

the unit 1 reads convolution kernel data 4, and the unit 2 reads convolution kernel data 2;

the unit 1 reads in the pixel 11 and the unit 2 reads in the pixel 3;

the

units

1 and 2 start multiplication operation and accumulate the result;

convolution kernel data 4 × pixels 11, convolution kernel data 2 × pixels 3;

step 16:

as shown in fig. 11, which is the 6 th clock cycle, the convolution kernel data 2,4 is moved to the buffer register;

cell 1 reads convolution kernel data 5, cell 2 reads convolution kernel data 3, cell 3 reads convolution kernel data 1,

element 1 reads into pixel 12, element 2 reads into pixel 4, element 3 reads into pixel 3,

the

units

1,2 and 3 start multiplication operation and accumulate the result;

convolution kernel data 5 × pixels 12, convolution kernel data 3 × pixels 4, convolution kernel data 1 × pixels 3;

and step 17:

as shown in fig. 12, which is the 7 th clock cycle, the

convolution kernel data

1,3,5 are moved to the buffer register;

the unit 1 reads convolution kernel data 6, the unit 2 reads convolution kernel data 4, and the unit 3 reads convolution kernel data 2;

unit 1 reads pixel 13, unit 2 reads pixel 12, and unit 3 reads pixel 4;

the

units

1,2 and 3 start multiplication operation and accumulate the result;

convolution kernel data 6 × pixels 13, convolution kernel data 4 × pixels 12, convolution kernel data 2 × pixels 4;

step 18:

as shown in fig. 13, which is the 8 th clock cycle, the convolution kernel data 2,4,6 are moved to the buffer register;

reading convolution kernel data 7 by a unit 1, reading convolution kernel data 5 by a unit 2, reading convolution kernel data 3 by a unit 3, and reading convolution kernel data 1 by a unit 4;

unit 1 reads pixel 21, unit 2 reads pixel 13, unit 3 reads pixel 5, and unit 4 reads pixel 4;

the

units

1,2,3 and 4 start multiplication operation and accumulate the results;

convolution kernel data 7 × pixels 21, convolution kernel data 5 × pixels 13, convolution kernel data 3 × pixels 5, convolution kernel data 1 × pixels 4;

step 19:

as shown in fig. 14, which is the 9 th clock cycle, the

convolution kernel data

1,3,5,7 move to the buffer register;

reading convolution kernel data 8 by a unit 1, reading convolution kernel data 6 by a unit 2, reading convolution kernel data 4 by a unit 3 and reading convolution kernel data 2 by a unit 4;

element 1 reads into pixel 22, element 2 reads into pixel 14, element 3 reads into pixel 13, element 4 reads into pixel 5,

the

units

1,2,3 and 4 start multiplication operation and accumulate the results;

convolution kernel data 8 × pixels 22, convolution kernel data 6 × pixels 14, convolution kernel data 4 × pixels 13, convolution kernel data 2 × pixels 5;

step 20:

as shown in fig. 15, which is the 10 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;

reading convolution kernel data 9 by a unit 1, reading convolution kernel data 7 by a unit 2, reading convolution kernel data 5 by a unit 3, reading convolution kernel data 3 by a unit 4 and reading convolution kernel data 1 by the unit 5;

unit 1 reads pixel 23, unit 2 reads pixel 22, unit 3 reads pixel 14, unit 4 reads pixel 6, and unit 2 reads pixel 5;

the

units

1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 9 × pixels 23, convolution kernel data 7 × pixels 22, convolution kernel data 5 × pixels 14, convolution kernel data 3 × pixels 6, convolution kernel data 1 × pixels 5;

step 21:

as shown in fig. 16, which is the 11 th clock cycle,

convolution kernel data

1,3,5,7,9 are moved to the buffer register;

the unit 2 reads convolution kernel data 8, the unit 3 reads convolution kernel data 6, the unit 4 reads convolution kernel data 4, and the unit 5 reads convolution kernel data 2;

element 2 reads into pixel 23, element 3 reads into pixel 15, element 4 reads into pixel 14, element 5 reads into pixel 6,

the accumulated result of the unit 1 is output through an activation function;

the

units

2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 8 × pixels 23, convolution kernel data 6 × pixels 15, convolution kernel data 4 × pixels 14, convolution kernel data 2 × pixels 6;

step 22:

as shown in fig. 17, which is the 12 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;

reading convolution kernel data 1 by a unit 1, reading convolution kernel data 9 by a unit 2, reading convolution kernel data 7 by a unit 3, reading convolution kernel data 5 by a unit 4 and reading convolution kernel data 3 by a unit 5;

element 1 reads pixel 6, element 2 reads pixel 24, element 3 reads pixel 23, element 4 reads pixel 15, and element 5 reads pixel 7;

the

units

1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 1 × pixel 6, convolution kernel data 9 × pixel 24, convolution kernel data 7 × pixel 23, convolution kernel data 5 × pixel 15, convolution kernel data 3 × pixel 7;

step 23:

as shown in fig. 18, which is the 13 th clock cycle,

convolution kernel data

1,3,5,7,9 move to the buffer register;

reading convolution kernel data 2 by a unit 1, reading convolution kernel data 8 by a unit 3, reading convolution kernel data 6 by a unit 4 and reading convolution kernel data 4 by a unit 5;

element 1 reads into pixel 7, element 3 reads into pixel 24, element 4 reads into pixel 16, element 5 reads into pixel 15,

the accumulated result of the unit 2 is output through an activation function;

the

units

1,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 2 × pixels 7, convolution kernel data 8 × pixels 24, convolution kernel data 6 × pixels 16, convolution kernel data 4 × pixels 15;

step 24:

as shown in fig. 19, which is the 14 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;

reading convolution kernel data 3 by a unit 1, reading convolution kernel data 1 by a unit 2, reading convolution kernel data 9 by the unit 3, reading convolution kernel data 7 by a unit 4 and reading convolution kernel data 5 by a unit 5;

unit 1 reads pixel 8, unit 2 reads pixel 7, unit 3 reads pixel 25, unit 4 reads pixel 24, and unit 5 reads pixel 16;

the

units

1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 3 × pixels 8, convolution kernel data 1 × pixels 7, convolution kernel data 9 × pixels 25, convolution kernel data 7 × pixels 24, convolution kernel data 5 × pixels 16;

step 25:

as shown in fig. 20, which is the 15 th clock cycle,

convolution kernel data

1,3,5,7,9 are moved to the buffer register;

reading convolution kernel data 4 by a unit 1, reading convolution kernel data 2 by a unit 2, reading convolution kernel data 8 by the unit 4, and reading convolution kernel data 6 by a unit 5;

element 1 reads into pixel 16, element 2 reads into pixel 8, element 4 reads into pixel 25, element 5 reads into pixel 17,

the accumulated result of the unit 3 is output through an activation function;

the

units

1,2,4,5 start multiplication, and the accumulation result is: convolution kernel data 4 × pixels 16, convolution kernel data 2 × pixels 8, convolution kernel data 8 × pixels 25, convolution kernel data 6 × pixels 17;

step 26:

as shown in fig. 21, which is the 16 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;

reading convolution kernel data 5 by a unit 1, reading convolution kernel data 3 by a unit 2, reading convolution kernel data 1 by the unit 3, reading convolution kernel data 9 by a unit 4 and reading convolution kernel data 7 by the unit 5;

unit 1 reads pixel 17, unit 2 reads pixel 9, unit 3 reads pixel 8, unit 4 reads pixel 26, and unit 5 reads pixel 25;

the

units

1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 5 × pixels 17, convolution kernel data 3 × pixels 9, convolution kernel data 1 × pixels 8, convolution kernel data 9 × pixels 26, convolution kernel data 7 × pixels 25;

step 27:

as shown in fig. 22, which is a 17 th clock cycle,

convolution kernel data

1,3,5,7,9 are moved to the buffer register;

reading convolution kernel data 6 by a unit 1, reading convolution kernel data 4 by a unit 2, reading convolution kernel data 2 by a unit 3 and reading convolution kernel data 8 by a unit 5;

cell 1 reads pixel 18, cell 2 reads pixel 17, cell 3 reads pixel 9, and cell 5 reads pixel 26;

the accumulated result of the unit 4 is output through an activation function;

the

units

1,2,3,5 start multiplication, and the accumulation result is: convolution kernel data 6 × pixels 18, convolution kernel data 4 × pixels 17, convolution kernel data 2 × pixels 9, convolution kernel data 8 × pixels 26;

step 28:

as shown in fig. 23, which is the 18 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;

reading convolution kernel data 7 by a unit 1, reading convolution kernel data 5 by a unit 2, reading convolution kernel data 3 by a unit 3, reading convolution kernel data 1 by a unit 4 and reading convolution kernel data 9 by a unit 5;

element 1 reads pixel 26, element 2 reads pixel 18, element 3 reads pixel 10, element 4 reads pixel 9, and element 5 reads pixel 27;

the

units

1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 7 × pixels 26, convolution kernel data 5 × pixels 18, convolution kernel data 3 × pixels 10, convolution kernel data 1 × pixels 9, convolution kernel data 9 × pixels 27;

step 29:

as shown in fig. 24, which is the 19 th clock cycle, the

convolution kernel data

1,3,5,7,9 are moved to the buffer register;

element 1 reads into pixel 27, element 2 reads into pixel 19, element 3 reads into pixel 18, element 4 reads into pixel 10,

the accumulated result of the unit 5 is output through an activation function;

the

units

1,2,3,5 start multiplication, and the accumulation result is: convolution kernel data 8 × pixels 27, convolution kernel data 6 × pixels 19, convolution kernel data 4 × pixels 18, convolution kernel data 2 × pixels 10;

step 30:

as shown in fig. 25, which is the 20 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;

element 1 reads pixel 28, element 2 reads pixel 27, element 3 reads pixel 19, element 4 reads pixel 31, and element 5 reads pixel 10;

the

units

1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 9 × pixels 28, convolution kernel data 7 × pixels 27, convolution kernel data 5 × pixels 19, convolution kernel data 3 × pixels 31, convolution kernel data 1 × pixels 10;

step 31:

as shown in fig. 26, which is the 21 st clock cycle, the

convolution kernel data

1,3,5,7,9 are moved to the buffer register;

element 2 reads into pixel 28, element 3 reads into pixel 20, element 4 reads into pixel 19, element 5 reads into pixel 31,

the accumulated result of the unit 1 is output through an activation function;

the

units

2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 8 × pixels 28, convolution kernel data 6 × pixels 20, convolution kernel data 4 × pixels 19, convolution kernel data 2 × pixels 31;

step 32:

as shown in fig. 27, the lower diagram is the 22 th clock cycle, the convolution kernel data 2,4,6,8 moves to the buffer register;

element 1 reads pixel 31, element 2 reads pixel 29, element 3 reads pixel 28, element 4 reads pixel 20, and element 5 reads pixel 32;

the

units

1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 1 × pixel 31, convolution kernel data 9 × pixel 29, convolution kernel data 7 × pixel 28, convolution kernel data 5 × pixel 20, convolution kernel data 3 × pixel 32;

step 33:

as shown in fig. 28, which is the 23 rd clock cycle, the

convolution kernel data

1,3,5,7,9 are moved to the buffer register;

element 1 reads into pixel 32, element 3 reads into pixel 29, element 4 reads into pixel 41, element 5 reads into pixel 20,

the accumulated result of the unit 2 is output through an activation function;

the

units

1,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 2 × pixels 32, convolution kernel data 8 × pixels 29, convolution kernel data 6 × pixels 41, convolution kernel data 4 × pixels 20;

step 34:

as shown in fig. 29, which is the 24 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;

cell 1 reads pixel 33, cell 2 reads pixel 32, cell 3 reads pixel 30, cell 4 reads pixel 29, and cell 5 reads pixel 41;

the

units

1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 3 × pixels 33, convolution kernel data 1 × pixels 32, convolution kernel data 9 × pixels 30, convolution kernel data 7 × pixels 29, convolution kernel data 5 × pixels 41;

step 35:

as shown in fig. 30, the lower diagram is the 25 th clock cycle, the

convolution kernel data

1,3,5,7,9 moves to the buffer register;

element 1 reads into pixel 41, element 2 reads into pixel 33, element 4 reads into pixel 30, element 5 reads into pixel 42,

the accumulated result of the unit 3 is output through an activation function;

the

units

1,2,4,5 start multiplication, and the accumulation result is: convolution kernel data 4 × pixels 41, convolution kernel data 2 × pixels 33, convolution kernel data 8 × pixels 30, convolution kernel data 6 × pixels 42;

step 36:

as shown in fig. 31, which is the 26 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;

element 1 reads into pixel 42, element 2 reads into pixel 34, element 3 reads into pixel 33, element 4 reads into pixel 51, and element 5 reads into pixel 30;

the

units

1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 5 × pixels 42, convolution kernel data 3 × pixels 34, convolution kernel data 1 × pixels 33, convolution kernel data 9 × pixels 51, convolution kernel data 7 × pixels 30;

step 37:

as shown in fig. 32, which is the 27 th clock cycle,

convolution kernel data

1,3,5,7,9 are moved to the buffer register;

element 1 reads into pixel 43, element 2 reads into pixel 42, element 3 reads into pixel 34, element 5 reads into pixel 51,

the accumulated result of the unit 4 is output through an activation function;

the

units

1,2,3,5 start multiplication, and the accumulation result is: convolution kernel data 6 × pixels 43, convolution kernel data 4 × pixels 42, convolution kernel data 2 × pixels 34, convolution kernel data 8 × pixels 51;

step 38:

as shown in fig. 33, which is the 28 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;

element 1 reads pixel 51, element 2 reads pixel 43, element 3 reads pixel 35, element 4 reads pixel 34, and element 5 reads pixel 52;

the

units

1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 7 × pixels 51, convolution kernel data 5 × pixels 43, convolution kernel data 3 × pixels 35, convolution kernel data 1 × pixels 34, convolution kernel data 9 × pixels 52;

step 39:

as shown in fig. 34, which is the 29 th clock cycle, the

convolution kernel data

1,3,5,7,9 are moved to the buffer register;

element 1 reads into pixel 52, element 2 reads into pixel 44, element 3 reads into pixel 43, element 4 reads into pixel 35,

the accumulated result of the unit 5 is output through an activation function;

the

units

1,2,3,4 start multiplication, and the accumulation result is: convolution kernel data 8 × pixels 52, convolution kernel data 6 × pixels 44, convolution kernel data 4 × pixels 43, convolution kernel data 2 × pixels 35;

step 40:

as shown in fig. 35, which is a 30 th clock cycle, convolution kernel data 2,4,6,8 are moved to a buffer register;

element 1 reads pixel 53, element 2 reads pixel 52, element 3 reads pixel 44, element 4 reads pixel 36, and element 5 reads pixel 35;

the

units

1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 9 × pixels 53, convolution kernel data 7 × pixels 52, convolution kernel data 5 × pixels 44, convolution kernel data 3 × pixels 36, convolution kernel data 1 × pixels 35;

step 41:

as shown in fig. 36, which is the 31 th clock cycle, the

convolution kernel data

1,3,5,7,9 are moved to the buffer register;

element 2 reads into pixel 53, element 3 reads into pixel 45, element 4 reads into pixel 44, element 5 reads into pixel 36,

the accumulated result of the unit 1 is output through an activation function;

the

units

2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 8 × pixels 53, convolution kernel data 6 × pixels 45, convolution kernel data 4 × pixels 44, convolution kernel data 2 × pixels 36.

The working principle is as follows:

convolution kernel data and external data are input to a multiplication accumulator unit queue in parallel from different directions; the invention discloses a design method of integrated circuit for accelerating the convolution neural network CNN operation of calculating large data quantity, belonging to a design method of accelerating calculation of integrated circuit for real-time processing large data quantity (such as image processing, sound data processing, etc.), the convolution kernel data and data are parallelly inputted into multiplication accumulator unit queue from different directions, and all multiplication accumulator units simultaneously and parallelly process and output the data passed through it, the invention repeatedly utilizes the inputted data many times to construct data flow closed loop, and greatly reduces the number of times of reading external memory, and the invention greatly improves the utilization rate of multiplication accumulator unit by multi-stage pipeline technology, the circuit can greatly reduce the bandwidth requirement on reading data of an external memory in the data calculation process, and accelerate the CNN calculation of the convolutional neural network of the data.

The technical solutions of the present invention or similar technical solutions designed by those skilled in the art based on the teachings of the technical solutions of the present invention are all within the scope of the present invention to achieve the above technical effects.

Claims

1. A method for accelerating computation of an integrated circuit based on a convolutional neural network algorithm is characterized by comprising the following steps:

each multiplication accumulator unit in the multiplication accumulator unit queue simultaneously and parallelly carries out corresponding multiplication accumulation processing on convolution kernel data and external data flowing through the multiplication accumulator unit queue, and the convolution kernel data and the external data are respectively output to the data storage unit;

each multiply accumulator unit in the multiply accumulator unit queue simultaneously and parallelly carries out corresponding multiply accumulation processing on convolution kernel data and external data flowing through the multiply accumulator unit queue, and the corresponding multiply accumulation processing further comprises the following steps:

performing convolution operation on convolution kernel data and external data in each row of multiplication accumulator units simultaneously in parallel by a multi-stage pipeline technology;

2. The method according to claim 1, wherein if the data continuously generated by the external input device is external data, the CNN operation function of the convolutional neural network is:

convolution kernel matrix: a linear convolution kernel data matrix;

3. The method of claim 1, wherein the multiply accumulator unit only has one queue of multiply accumulator units.

4. The method of claim 1, wherein said multiply accumulator unit forms a data matrix sequential data processing, said sequential data processing comprising:

processing data according to the sequence from top to bottom, or;

processing data according to the sequence from bottom to top, or;

processing the data in the order from left to right, or;

the data is processed in the order from right to left.

5. The method of claim 1, wherein the input processing of the multiply accumulator unit includes an initialization phase including 1 clock cycle through Nth clock cycle and a subsequent phase including clock cycles after Nth clock cycle.

6. The method of claim 5, wherein the initialization phase employs a first input rule of an external data flow closed loop, the first input rule comprising:

before a multiply accumulator unit operation;

7. The method of claim 5, wherein the subsequent stage employs a second input rule of the external data flow closed loop, the second input rule comprising:

before a multiply accumulator unit operation;

8. The method of claim 7, wherein if a multiply accumulator unit completes the operation, then processing the current input data according to a data flow closed loop completion operation rule;

the completion operation rule comprises:

9. A system for integrated circuit accelerated computing based on convolutional neural network algorithm, comprising: the device comprises an external memory, an external memory reading engine, an input data distribution controller, an input data header preprocessing buffer unit, a row of multiplication accumulator unit queues, a convolution kernel data input queue unit, a data input queue unit and an output data storage unit;

the array of the multiply accumulator unit queues comprises multiply accumulator units, and the multiply accumulator units are used for corresponding multiply-accumulate operations and respectively output the results of the corresponding multiply-accumulate operations to output data for storage;

the output of the external memory is connected with an external memory reading engine, the output of the external memory reading engine is connected with an input data distribution controller, one output end of the input data distribution controller is connected with a convolution kernel data input queue unit, the other output end of the input data distribution controller is connected with an input data head preprocessing buffer unit and a corresponding data input queue unit, the output of the convolution kernel data input queue unit is connected with the convolution kernel input end of a row of multiplication accumulator unit queues, each multiplication accumulator unit of the row of multiplication accumulator unit queues is connected in turn, the input data head preprocessing buffer unit is connected with the data input end of the row of multiplication accumulator unit queues, and the corresponding multiplication accumulator units of the row of multiplication accumulator unit queues are respectively connected with the corresponding data input queue units, the respective multiply accumulator units of the column of the queue of multiply accumulator units are each output at a respective input of the output data storage unit.

10. The system of claim 9, wherein the multiply accumulator unit comprises a convolution kernel register, an external data register, a multiplier, an adder, an activation function module;