CN110188869B - Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm - Google Patents
Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm Download PDFInfo
- Publication number
- CN110188869B CN110188869B CN201910368448.1A CN201910368448A CN110188869B CN 110188869 B CN110188869 B CN 110188869B CN 201910368448 A CN201910368448 A CN 201910368448A CN 110188869 B CN110188869 B CN 110188869B
- Authority
- CN
- China
- Prior art keywords
- data
- convolution kernel
- unit
- input
- multiplication
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
The invention belongs to the technical field of artificial intelligence, in particular to a method and a system for accelerating calculation of an integrated circuit based on a convolutional neural network algorithm, wherein the system comprises a unit queue of multiply accumulator units which input convolutional kernel data and external data from different directions in parallel, each multiply accumulator unit in the unit queue of multiply accumulator unit simultaneously and parallelly carries out corresponding multiply-accumulate processing on the convolutional kernel data and the external data which flow through the interior of the multiply accumulator unit and respectively outputs the data to a data storage unit, the invention solves the problems that the operation amount of the convolutional neural network is huge, the operation is not easy to be carried out on the integrated circuit or embedded equipment in real time, the traditional processor mainly based on a serial architecture is not easy to meet the requirements in the prior art, and therefore, how to quickly finish the operation of the convolutional neural network is an important problem to be solved, the method has the characteristics of less reading times, high operation throughput rate and low broadband requirement, and greatly improves the real-time performance of the operation of the convolutional neural network.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, particularly relates to an integrated circuit accelerated calculation method based on a convolutional neural network algorithm, and also provides an integrated circuit accelerated calculation system based on the convolutional neural network algorithm.
Background
The convolutional neural network is a feedforward neural network, is usually applied to the image recognition, generally include the convolutional layer, pooling layer and all-connected layer, the convolutional operation of the convolutional layer is, each weight in the convolutional kernel multiplies its correspondent input data point-to-point, then accumulate the result of dot multiplication, get a data outputted, later, set up according to the step length of the convolutional layer, slip the convolutional kernel, repeat the above-mentioned operation, a disadvantage existing in the prior art because the convolutional neural network is its operand is huge, it is difficult to calculate in real time on integrated circuit or embedded apparatus, it is difficult to reach the requirement with the traditional processor that the serial architecture is the main, therefore, how to finish the convolutional neural network operation fast is the important problem that needs to be solved.
Disclosure of Invention
The invention provides a method and a system for accelerating calculation of an integrated circuit based on a convolutional neural network algorithm, which aim to solve the problem that how to quickly complete the operation of the convolutional neural network is an important problem to be solved because the convolutional neural network has huge operation amount, is not easy to operate in real time on the integrated circuit or embedded equipment and is not easy to meet the requirement of a traditional processor mainly based on a serial architecture in the prior art.
The technical problem solved by the invention is realized by adopting the following technical scheme: a method for accelerating computation of an integrated circuit based on a convolutional neural network algorithm comprises the following steps:
inputting convolution kernel data and external data from different directions to a multiplication accumulator unit queue in parallel;
and each multiplication accumulator unit in the multiplication accumulator unit queue simultaneously and parallelly carries out corresponding multiplication accumulation processing on convolution kernel data and external data flowing through the multiplication accumulator unit queue, and respectively outputs the convolution kernel data and the external data to the data storage unit.
Further, the method further comprises:
inputting at least one convolution kernel data into corresponding multiplication accumulator units respectively;
respectively inputting external data in the input queue into corresponding multiplication accumulator units according to the queue sequence;
and (3) performing convolution operation on the convolution kernel data and the external data in each column of the multiply accumulator unit simultaneously in parallel by a multi-stage pipeline technology.
Further, the method further comprises:
the convolution kernel data is a pre-designed convolution kernel matrix, and the external data is data continuously generated by external input equipment;
reading in data continuously generated by a pre-designed convolution kernel matrix and an external input device through an external storage reading engine, respectively distributing convolution kernels and external data in the data read in by the reading engine to a convolution kernel data input queue and a head data preprocessing buffer area and a corresponding data input queue through data distribution, wherein the head data of the external data is distributed to the head data preprocessing buffer area, the non-head data of the external data is distributed to the corresponding data input queue, each convolution kernel is distributed to a corresponding multiplication accumulator unit of a row of accumulator multiplication unit queues correspondingly by the convolution kernel data input queue, the head data of the external data is output to a corresponding multiplication accumulator unit of the multiplication accumulator unit queues correspondingly, and the non-head data of the external data is circularly output to a corresponding multiplication accumulator unit of the multiplication accumulator unit queues correspondingly by the accumulator data input queue, and the corresponding multiply accumulators of the multiply accumulator unit queue perform corresponding multiply-accumulate operations and respectively output the results of the corresponding multiply-accumulate operations to output data for storage.
Further, if the data continuously generated by the external input device is external data, the CNN operation function of the convolutional neural network is:
the result of the convolution operation is a convolution kernel matrix x external data matrix;
convolution kernel matrix: a linear convolution kernel data matrix;
external data matrix: an external data matrix having a two-dimensional data structure, wherein the external data matrix contains MxN external data;
the result of the convolution operation: the convolution operation result matrix with a two-dimensional data structure comprises MxN convolution operation results, and the convolution operation results are products of convolution kernel data and corresponding external data.
Further, the multiply accumulator unit only has one queue of multiply accumulator units.
Further, the multiplication accumulator unit forms a data matrix sequential data processing mode, and the sequential data processing mode comprises:
processing data according to the sequence from top to bottom, or;
processing data according to the sequence from bottom to top, or;
processing the data in the order from left to right, or;
the data is processed in the order from right to left.
Further, the input processing of the multiply accumulator unit includes an initialization phase including 1 clock cycle to N clock cycles and a subsequent phase including clock cycles after N clock cycles.
The initialization stage adopts a first input rule of an external data flow closed loop, and the subsequent stage adopts a second input rule of the external data flow closed loop;
the first input rule includes:
before a multiply accumulator unit operation;
if the number of the multiplication accumulator units is 1 st to N-M times, reading an external data header preprocessing buffer area, wherein the data of the preprocessing buffer area are from 1 st to N th columns and from 1 st to N-M th rows;
if the number is from the (N-M + 1) th time to the (N) th time of the 1 st multiplication accumulator unit, reading the first data input queue and repeatedly circulating in sequence;
if the number of the multiplication accumulators is not 1 to N-M times of the 1 st multiplication accumulator, reading data output by the previous-stage multiplication accumulator unit, and if the number of the multiplication accumulators is not 1 to N-M times, stopping operation;
if the number of the multiplication accumulators is not the number from the (N-M + 1) th time to the (N) th time of the 1 st multiplication accumulator, the corresponding input queue of the data is read and the cycle is repeated according to the sequence.
The second input rule includes:
before a multiply accumulator unit operation;
if the 1 st time to the N-M times of the 1 st multiplication accumulator unit, reading the data output by the last multiplication accumulator unit;
if the 1 st multiplication accumulator unit is from the (N-M + 1) th time to the (N) th time, reading an input queue 1 of data and repeatedly circulating in sequence;
if the number of the multiplication accumulators is not 1 to N-M times of the 1 st multiplication accumulator, reading data output by the previous-stage multiplication accumulator unit, and if the number of the multiplication accumulators is not 1 to N-M times, stopping operation;
if the data is not the 1 st multiply accumulator, the corresponding data input queue is read from the N-M +1 th time to the N-th time, and the cycle is repeated according to the sequence.
If a multiply accumulator unit finishes the operation, the current input data is processed according to the operation finishing rule of a data flow closed loop:
the completion operation rule comprises:
if the data is input from the 1 st time to the Mth time, directly discarding the data;
if the data is inputted from the M +1 th time to the Nth time, if the data is not the last multiply accumulator, sequentially moving down to the adjacent multiply accumulator units;
if the data is inputted from the M +1 th time to the Nth time, the data is moved to the first multiply accumulator unit if the data is the last multiply accumulator.
Meanwhile, the invention also provides a system for accelerating the calculation of the integrated circuit based on the convolutional neural network algorithm, which comprises the following steps: the device comprises an external memory, an external memory reading engine, an input data distribution controller, an input data header preprocessing buffer unit, a row of multiplication accumulator unit queues, a convolution kernel data input queue unit, a data input queue unit and an output data storage unit;
the external memory is used for storing a pre-designed convolution kernel matrix and data continuously generated by external input equipment;
the external memory reading engine is used for reading and outputting the convolution kernel matrix and the external data in the external memory to the input data distribution controller;
the input data distribution controller is used for distributing the convolution kernel matrix and the external data in the data read by the reading engine to a convolution kernel data input queue, a head data preprocessing buffer area and a corresponding data input queue through data distribution respectively, wherein the head data of the external data is distributed to the head data preprocessing buffer area, and the non-head data of the external data is distributed to the corresponding data input queue;
the input data header preprocessing buffer unit is used for outputting header data of external data to corresponding multiply accumulator units of the multiply accumulator unit queue;
the convolution kernel data input queue unit is used for correspondingly distributing each convolution kernel to a corresponding multiplication accumulator unit of a row of multiplication accumulator unit queues;
the data input queue unit is used for circularly outputting non-head data of the external data to corresponding multiply accumulator units of the multiply accumulator unit queue;
the array of multiply accumulator unit queues includes multiply accumulator units for respective multiply accumulate operations and outputting respective multiply accumulate results to an output data store.
Further, the output of the external memory is connected to an external memory reading engine, the output of the external memory reading engine is connected to an input data distribution controller, one output end of the input data distribution controller is connected to a convolution kernel data input queue unit, the other output end of the input data distribution controller is connected to an input data header preprocessing buffer unit and a corresponding data input queue unit, the output of the convolution kernel data input queue unit is connected to the convolution kernel input ends of a row of multiplication accumulator unit queues, each multiplication accumulator unit of the row of multiplication accumulator unit queues is connected in turn, the input data header preprocessing buffer unit is connected to the data input ends of the row of multiplication accumulator unit queues, the corresponding multiplication accumulator units of the row of multiplication accumulator unit queues are respectively connected to corresponding data input queue units, the respective multiply accumulator units of the column of the queue of multiply accumulator units are each output at a respective input of the output data storage unit.
Furthermore, the multiplication accumulator unit comprises a convolution kernel register, an external data register, a multiplier, an adder and an activation function module;
the convolution kernel register and one input end of the multiplier jointly receive each convolution kernel data, the external data register and the other input end of the multiplier jointly receive each external data, the multiplier outputs to the adder, the adder outputs to the activation function module, the activation function module outputs to the corresponding port of the output data storage unit, the convolution kernel register outputs to the convolution kernel data input end of the next multiplication accumulator, and the external data register outputs to the external data input end of the next multiplication accumulator.
The invention has the beneficial effects that:
convolution kernel data and external data are input to a multiplication accumulator unit queue in parallel from different directions; the invention discloses a design method of integrated circuit for accelerating the convolution neural network CNN operation of calculating large data quantity, belonging to a design method of accelerating calculation of integrated circuit for real-time processing large data quantity (such as image processing, sound data processing, etc.), the convolution kernel data and data are parallelly inputted into multiplication accumulator unit queue from different directions, and all multiplication accumulator units simultaneously and parallelly process and output the data passed through it, the invention repeatedly utilizes the inputted data many times to construct data flow closed loop, and greatly reduces the number of times of reading external memory, and the invention greatly improves the utilization rate of multiplication accumulator unit by multi-stage pipeline technology, the circuit can greatly reduce the bandwidth requirement on reading data of an external memory in the data calculation process and accelerate the CNN operation of the convolutional neural network of the data, so the circuit has the characteristics of less reading times, high operation throughput and low bandwidth requirement, and the real-time performance of the operation of the convolutional neural network is greatly improved.
Drawings
FIG. 1 is a general flow diagram of a method for integrated circuit accelerated computation based on convolutional neural network algorithm of the present invention;
FIG. 2 is a flowchart illustrating a method for accelerating computation of an integrated circuit based on a convolutional neural network algorithm according to an embodiment of the present invention;
FIG. 3 is a detailed flow chart of a method for accelerating computation of an integrated circuit based on a convolutional neural network algorithm according to the present invention;
FIG. 4 is a system block diagram of an integrated circuit accelerated computing system based on convolutional neural network algorithm of the present invention;
FIG. 5 is an exemplary diagrammatic architectural diagram of a single multiply accumulator unit of a system for integrated circuit accelerated computation based on convolutional neural network algorithms of the present invention;
FIG. 6 is an exemplary pictorial architectural diagram for clock cycle 1 of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 7 is an exemplary pictorial architectural diagram for a 2 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 8 is an exemplary pictorial architectural diagram for the 3 rd clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 9 is an exemplary pictorial architectural diagram for the 4 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 10 is an exemplary pictorial architectural diagram for the 5 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 11 is an exemplary pictorial architectural diagram for the 6 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 12 is an exemplary pictorial architectural diagram for the 7 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 13 is an exemplary pictorial architectural diagram for the 8 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 14 is an exemplary pictorial architectural diagram for a 9 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 15 is an exemplary pictorial architectural diagram for a 10 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 16 is an exemplary pictorial architectural diagram for the 11 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 17 is an exemplary pictorial architectural diagram for a 12 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 18 is an exemplary pictorial architectural diagram for the 13 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 19 is an exemplary pictorial architectural diagram for the 14 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 20 is an exemplary pictorial architectural diagram for the 15 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 21 is an exemplary pictorial architectural diagram for the 16 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 22 is an exemplary pictorial architectural diagram for a 17 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 23 is an exemplary pictorial architectural diagram of an 18 clock cycle of a system for integrated circuit accelerated computing based on a convolutional neural network algorithm of the present invention;
FIG. 24 is an exemplary pictorial architectural diagram of a 19 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 25 is an exemplary pictorial architectural diagram for a 20 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 26 is an exemplary pictorial architectural diagram for the 21 st clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 27 is an exemplary pictorial architectural diagram of a 22 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 28 is an exemplary pictorial architectural diagram for a 23 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 29 is an exemplary pictorial architectural diagram for a 24 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 30 is an exemplary pictorial architectural diagram for a 25 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 31 is an exemplary pictorial architectural diagram of a 26 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 32 is an exemplary pictorial architectural diagram of a 27 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 33 is an exemplary pictorial architectural diagram for the 28 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 34 is an exemplary pictorial architectural diagram for the 29 th clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 35 is an exemplary pictorial architectural diagram for a 30 clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention;
FIG. 36 is an exemplary graphical architecture diagram of the 31 st clock cycle of a system for integrated circuit accelerated computing based on convolutional neural network algorithm of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings:
in the figure:
s101, inputting convolution kernel data and external data from different directions to a multiplication accumulator unit queue in parallel;
s102, each multiplication accumulator unit in the multiplication accumulator unit queue simultaneously and parallelly carries out corresponding multiplication accumulation processing on convolution kernel data and external data flowing through the multiplication accumulator unit queue;
s103-respectively outputting to the data storage units;
s201, inputting at least one convolution kernel data into corresponding multiplication accumulator units respectively;
s202, respectively inputting external data in an input queue into corresponding multiplication accumulator units according to a queue sequence;
s203, carrying out convolution operation on convolution kernel data and external data in each row of multiplication accumulator units simultaneously and in parallel through a multi-stage pipeline technology;
s301, reading in data continuously generated by a pre-designed convolution kernel matrix and external input equipment through an external storage reading engine;
s302, distributing a convolution kernel and external data in data read by a reading engine to a convolution kernel data input queue, a head data preprocessing buffer area and a corresponding data input queue respectively through data distribution;
s303 — wherein the header data of the external data is allocated to the header data pre-processing buffer, and the non-header data thereof is allocated to the corresponding data input queue;
s304-the convolution kernel data input queue allocates each convolution kernel to a corresponding multiply accumulator unit of a row of multiply accumulator unit queues;
s305-the data preprocessing buffer outputs the head data of the external data to the corresponding multiply accumulator unit of the multiply accumulator unit queue;
s306-the corresponding data input queue circularly outputs the non-head data of the external data to the corresponding multiply accumulator unit of the multiply accumulator unit queue;
s307, the corresponding multiply accumulators of the multiply accumulator unit queue perform corresponding multiply-accumulate operations and respectively output the results of the corresponding multiply-accumulate operations to output data for storage;
example (b):
the first embodiment is as follows: as shown in fig. 1, a method for accelerating computation of an integrated circuit based on a convolutional neural network algorithm includes: convolution kernel data and external data are input to a multiplication accumulator unit queue S101 in parallel from different directions; each multiply accumulator unit in the multiply accumulator unit queue simultaneously and parallelly carries out corresponding multiply accumulation processing S102 on convolution kernel data and external data flowing through the multiply accumulator unit queue, and the data are respectively output to a data storage unit S103;
because the convolution kernel data and the external data are input to the multiplication accumulator unit queue from different directions in parallel; the invention discloses a design method of integrated circuit for accelerating the convolution neural network CNN operation of calculating large data quantity, belonging to a design method of accelerating calculation of integrated circuit for real-time processing large data quantity (such as image processing, sound data processing, etc.), the convolution kernel data and data are parallelly inputted into multiplication accumulator unit queue from different directions, and all multiplication accumulator units simultaneously and parallelly process and output the data passed through it, the invention repeatedly utilizes the inputted data many times to construct data flow closed loop, and greatly reduces the number of times of reading external memory, and the invention greatly improves the utilization rate of multiplication accumulator unit by multi-stage pipeline technology, the circuit can greatly reduce the bandwidth requirement on reading data of an external memory in the data calculation process and accelerate the CNN operation of the convolutional neural network of the data, so the circuit has the characteristics of less reading times, high operation throughput and low bandwidth requirement, and the real-time performance of the operation of the convolutional neural network is greatly improved.
As shown in fig. 2, the method further comprises:
inputting at least one convolution kernel data into corresponding multiplication accumulator units S201 respectively;
respectively inputting external data in the input queue into corresponding multiply accumulator units S202 according to the queue order;
performing convolution operation S203 on convolution kernel data and external data in each row of multiply accumulator units simultaneously in parallel by a multi-stage pipeline technology;
because at least one convolution kernel data is respectively input into the corresponding multiplication accumulator units; respectively inputting external data in the input queue into corresponding multiplication accumulator units according to the queue sequence; the convolution kernel data and the external data in each row of the multiplication accumulator units are simultaneously and parallelly convoluted by a multi-stage pipeline technology, the data are compressed by constructing a sparse matrix with a plurality of zeros to accelerate data reading, a row of multiplication accumulator unit queues are constructed, the convolution operation is simultaneously and parallelly carried out by the multi-stage pipeline technology to construct a plurality of data input queues, the data are simultaneously input to the multiplication accumulator units according to a specific sequence, and the simultaneous and parallel convolution operation of the multiplication accumulator units is ensured.
As shown in fig. 3, the method further comprises:
the convolution kernel data is a pre-designed convolution kernel matrix, and the external data is data continuously generated by external input equipment; reading the data continuously generated by the pre-designed convolution kernel matrix and the external input device into an external storage reading engine S301, respectively distributing the convolution kernels and the external data in the data read by the reading engine to a convolution kernel data input queue and a head data preprocessing buffer area and a corresponding data input queue S302 through data distribution, wherein the head data of the external data is distributed to the head data preprocessing buffer area, the non-head data of the external data is distributed to a corresponding data input queue S303, each convolution kernel is correspondingly distributed to a corresponding multiplication accumulator unit S304 of a row of multiplication accumulator unit queues by the convolution kernel data input queue, the head data of the external data is output to a corresponding accumulator multiplication unit S305 of the multiplication accumulator unit queues by the data preprocessing buffer area, the non-head data of the external data is circularly output to a corresponding multiplication accumulator unit S306 of the multiplication accumulator unit queues by the corresponding data input queue, the corresponding multiply accumulators of the multiply accumulator unit queue perform corresponding multiply-accumulate operations and output the results of the corresponding multiply-accumulate operations to the output data storage S307, respectively;
the convolution kernel data is a pre-designed convolution kernel matrix, and the external data is data continuously generated by external input equipment; reading in data continuously generated by a pre-designed convolution kernel matrix and an external input device through an external storage reading engine, respectively distributing convolution kernels and external data in the data read in by the reading engine to a convolution kernel data input queue and a head data preprocessing buffer area and a corresponding data input queue through data distribution, wherein the head data of the external data is distributed to the head data preprocessing buffer area, the non-head data of the external data is distributed to the corresponding data input queue, each convolution kernel is distributed to a corresponding multiplication accumulator unit of a row of accumulator multiplication unit queues correspondingly by the convolution kernel data input queue, the head data of the external data is output to a corresponding multiplication accumulator unit S305 of the multiplication accumulator unit queues, and the non-head data of the external data is circularly output to a corresponding multiplication accumulator unit of the multiplication accumulator unit queues by the corresponding data input queue, the corresponding multiply accumulators of the multiply accumulator unit queue carry out corresponding multiply accumulation operation and respectively output the results of the corresponding multiply accumulation operation to output data for storage, convolution kernel data and input data are input to the multiply accumulator unit queue from different directions in parallel through the process, and the data passing through the multiply accumulator units are simultaneously processed in parallel.
If the data continuously generated by the external input device is external data, the CNN operation function of the convolutional neural network is:
the result of the convolution operation is a convolution kernel matrix x external data matrix;
convolution kernel matrix: a linear convolution kernel data matrix;
external data matrix: an external data matrix having a two-dimensional data structure, wherein the external data matrix contains MxN external data;
the result of the convolution operation: the convolution operation result matrix with a two-dimensional data structure comprises MxN convolution operation results, and the convolution operation results are products of convolution kernel data and corresponding external data.
Because the data continuously generated by the external input equipment is adopted as the external data, the operation function of the convolutional neural network CNN is as follows: the result of the convolution operation is a convolution kernel matrix x external data matrix;
for example, the following steps are carried out:
regarding the convolutional neural network CNN operation of the image, the formula is: the result of the convolution operation is a convolution kernel data pixel matrix, wherein:
convolution kernel data:
|
Convolution kernel data 4 | Convolution kernel data 7 |
|
Convolution kernel data 5 | Convolution kernel data 8 |
|
Convolution kernel data 6 | Convolution kernel data 9 |
Pixel matrix:
|
Pixel 11 | Pixel 21 |
|
Pixel 12 | Pixel 22 |
|
Pixel 13 | Pixel 23 |
The convolution operation is to accumulate the multiplication results of the following table and then output the results by activating the function.
|
Convolution kernel data 4 pixels 11 | Convolution kernel data 7 pixels 21 |
|
Convolution kernel data 5 pixels 12 | Convolution kernel data 8 pixels 22 |
|
Convolution kernel data 6 pixels 13 | Convolution kernel data 9 pixels 23 |
The multiply accumulator unit is limited to only one queue of multiply accumulator units.
Because only one row of multiplication accumulator unit queues is adopted by the multiplication accumulator units, and only one row of multiplication accumulator unit queues exists, the circuit is simple and clear to realize.
The multiplication accumulator unit forms a data matrix for sequential data processing.
The sequential data processing mode comprises the following steps:
processing data according to the sequence from top to bottom, or;
processing data according to the sequence from bottom to top, or;
processing the data in the order from left to right, or;
the data is processed in the order from right to left.
Because the multiplication accumulator unit is adopted to form a data matrix for sequential data processing, the sequential data processing mode comprises the following steps: the data are processed according to the sequence from top to bottom, or the data are processed according to the sequence from bottom to top, or the data are processed according to the sequence from left to right, or the data are processed according to the sequence from right to left.
The input processing of the multiply accumulator unit includes an initialization phase including 1 clock cycle to Nth clock cycle and a subsequent phase including clock cycles after the Nth clock cycle.
The initialization stage adopts a first input rule of an input data flow closed loop, and the subsequent stage adopts a second input rule of the input data flow closed loop;
the first input rule includes:
before a multiply accumulator unit operation;
if the data is from 1 st time to N-M times of the 1 st multiply accumulator unit, reading a preprocessing buffer area at the head of the input data, wherein the data of the preprocessing buffer area is from 1 st row to N th row and from 1 st row to N-M th row;
if the number is from the (N-M + 1) th time to the (N) th time of the 1 st multiplication accumulator unit, reading an input queue 1 of data and repeatedly circulating in sequence;
if the number of the multiplication accumulators is not 1 to N-M times of the 1 st multiplication accumulator, reading data output by the previous-stage multiplication accumulator unit, and if the number of the multiplication accumulators is not 1 to N-M times, stopping operation;
if the number of the multiplication accumulators is not the number from the (N-M + 1) th time to the (N) th time of the 1 st multiplication accumulator, the corresponding input queue of the data is read and the cycle is repeated according to the sequence.
The input processing adopting the multiplication accumulator unit comprises an initialization stage and a subsequent stage, wherein the initialization stage comprises 1 clock cycle to Nth clock cycle, the initialization stage adopts a first input rule of an input data flowing closed loop, and the subsequent stage adopts a second input rule of the input data flowing closed loop, and the first input rule comprises a time before a multiplication accumulator unit operation; if the data is from 1 st time to N-M times of the 1 st multiply accumulator unit, reading a preprocessing buffer area at the head of the input data, wherein the data of the preprocessing buffer area is from 1 st row to N th row and from 1 st row to N-M th row; if the number is from the (N-M + 1) th time to the (N) th time of the 1 st multiplication accumulator unit, reading an input queue 1 of data and repeatedly circulating in sequence; if the number of the multiplication accumulators is not 1 to N-M times of the 1 st multiplication accumulator, reading data output by the previous-stage multiplication accumulator unit, and if the number of the multiplication accumulators is not 1 to N-M times, stopping operation; if the number of the multiply accumulator units is not from the (N-M + 1) th time to the (N) th time of the 1 st multiply accumulator, reading the corresponding data input queue and repeating the cycle in sequence, wherein the 1 st clock cycle to the (N x N) th clock cycle are used as an initialization stage to design a specific input rule of an image data flow closed loop, and before one multiply accumulator unit operates, reading the data according to the following 4 rules: rule (a): for the 1 st multiply accumulator unit, the input data header pre-processing buffer is read from 1 st to N-M times. The data of the preprocessing buffer area are from the 1 st column to the Nth column and from the 1 st line to the N-M line; rule (b): for the 1 st multiply accumulator unit, the input queue 1 of data is read from times N-M +1 through N. Repeating the cycle in this order; rule (c): for the multiplication accumulator which is not the 1 st, reading the data output by the previous multiplication accumulator unit from the 1 st to the N-M times, and if not, stopping operation; rule (d): for multiply accumulators that are not 1 st, the input queue of the corresponding data is read from time N-M + 1 to time N. The cycle is repeated in this order.
The second input rule includes:
before a multiply accumulator unit operation;
if the 1 st time to the N-M times of the 1 st multiplication accumulator unit, reading the data output by the last multiplication accumulator unit;
if the 1 st multiplication accumulator unit is from the (N-M + 1) th time to the (N) th time, reading an input queue 1 of data and repeatedly circulating in sequence;
if the number of the multiplication accumulators is not 1 to N-M times of the 1 st multiplication accumulator, reading data output by the previous-stage multiplication accumulator unit, and if the number of the multiplication accumulators is not 1 to N-M times, stopping operation;
if the data is not the 1 st multiply accumulator, the corresponding data input queue is read from the N-M +1 th time to the N-th time, and the cycle is repeated according to the sequence.
The input processing adopting the multiplication accumulator unit comprises an initialization stage and a subsequent stage, wherein the subsequent stage comprises clock cycles after Nth clock cycle, and the subsequent stage adopts a second input rule of an input data flow closed loop; the second input rule includes: before a multiply accumulator unit operation; if the 1 st time to the N-M times of the 1 st multiplication accumulator unit, reading the data output by the last multiplication accumulator unit; if the 1 st multiplication accumulator unit is from the (N-M + 1) th time to the (N) th time, reading an input queue 1 of data and repeatedly circulating in sequence; if the number of the multiplication accumulators is not 1 to N-M times of the 1 st multiplication accumulator, reading data output by the previous-stage multiplication accumulator unit, and if the number of the multiplication accumulators is not 1 to N-M times, stopping operation; if not, the corresponding data input queue is read from the (N-M + 1) th time to the (N) th time, and the cycle is repeated in sequence, and a specific data flow closed-loop input rule is designed after the (N x N) th clock period. Before a multiply-accumulator unit operation, data is read according to the following 4 rules: rule (a): for the 1 st multiplication accumulator unit, reading the data output by the last multiplication accumulator unit from the 1 st time to the N-M times; rule (b): for the 1 st multiply accumulator unit, the input queue 1 of data is read from times N-M +1 through N. Repeating the cycle in this order; rule (c): for the multiplication accumulator which is not the 1 st, reading the data output by the previous multiplication accumulator unit from the 1 st to the N-M times, and if not, stopping operation; rule (d): for multiply accumulators that are not 1 st, the input queue of the corresponding data is read from time N-M + 1 to time N. The cycle is repeated in this order.
If a multiply accumulator unit finishes the operation, the current input data is processed according to the operation finishing rule of a data flow closed loop:
the completion operation rule comprises:
if the data is the input data from the 1 st time to the Mth time, directly discarding the data;
if the data is inputted from the M +1 th time to the Nth time, if the data is not the last multiply accumulator, sequentially moving down to the adjacent multiply accumulator units;
if the data is inputted from the M +1 th time to the Nth time, the data is moved to the first multiply accumulator unit if the data is the last multiply accumulator.
Because the current input data is processed according to the operation completing rule of a data flow closed loop if one multiplication accumulator unit completes the operation: the completion operation rule comprises: if the data is the input data from the 1 st time to the Mth time, directly discarding the data; if the data is inputted from the M +1 th time to the Nth time, if the data is not the last multiply accumulator, sequentially moving down to the adjacent multiply accumulator units; if the data is inputted from the M +1 th time to the N th time, if the data is inputted from the last multiply accumulator, the data is shifted to the first multiply accumulator unit, since it is assumed that the convolution kernel is a square matrix of N × N, the width of the convolution kernel is N, the shift step size is M, and N > M. Assuming that each unit in the multiplication accumulator unit queue is arranged from top to bottom, the first multiplication accumulator is at the top, the last multiplication accumulator is at the bottom, and after one multiplication accumulator unit finishes operation, the currently input data is processed according to the following 3 rules: rule (a): directly discarding the data input from the 1 st time to the Mth time; rule (b): for the data input from the M +1 th time to the Nth time, if not the last multiplication accumulator, sequentially moving down to the adjacent multiplication accumulator units; rule (c): for the data inputted from the M +1 th time to the N th time, if the last multiplication accumulator is, the data is moved to the first multiplication accumulator unit, so that a data flow closed loop is constructed.
Meanwhile, the invention also provides a system for accelerating the calculation of the integrated circuit based on the convolutional neural network algorithm, which comprises the following steps: the device comprises an external memory, an external memory reading engine, an input data distribution controller, an input data header preprocessing buffer unit, a row of multiplication accumulator unit queues, a convolution kernel data input queue unit, a data input queue unit and an output data storage unit;
the external memory is used for storing a pre-designed convolution kernel matrix and data continuously generated by external input equipment;
the external memory reading engine is used for reading and outputting the convolution kernel matrix and the external data in the external memory to the input data distribution controller;
the input data distribution controller is used for distributing the convolution kernel matrix and the external data in the data read by the reading engine to a convolution kernel data input queue, a head data preprocessing buffer area and a corresponding data input queue through data distribution respectively, wherein the head data of the external data is distributed to the head data preprocessing buffer area, and the non-head data of the external data is distributed to the corresponding data input queue;
the input data header preprocessing buffer unit is used for outputting header data of external data to corresponding multiply accumulator units of the multiply accumulator unit queue;
the convolution kernel data input queue unit is used for correspondingly distributing each convolution kernel to a corresponding multiplication accumulator unit of a row of multiplication accumulator unit queues;
the data input queue unit is used for circularly outputting non-head data of the external data to corresponding multiply accumulator units of the multiply accumulator unit queue;
the array of multiply accumulator unit queues includes multiply accumulator units for respective multiply accumulate operations and output respective results of the multiply accumulate operations to an output data store
As shown in fig. 4, the external memory output is connected to an external memory read engine, the external memory read engine output is connected to an input data distribution controller, one output end of the input data distribution controller is connected to a convolution kernel data input queue unit, the other output end of the input data distribution controller is connected to an input data header preprocessing buffer unit and a corresponding data input queue unit, the output of the convolution kernel data input queue unit is connected to a convolution kernel input end of a row of multiply accumulator unit queues, each multiply accumulator unit of the row of multiply accumulator unit queues is connected in turn, the input data header preprocessing buffer unit is connected to a data input end of the row of multiply accumulator unit queues, the corresponding multiply accumulator units of the row of multiply accumulator unit queues are respectively connected to corresponding data input queue units, the respective multiply accumulator units of the column of the queue of multiply accumulator units are each output at a respective input of the output data storage unit.
Meanwhile, the invention also provides a system for accelerating the calculation of the integrated circuit based on the convolutional neural network algorithm, which comprises the following steps: the device comprises an external memory, an external memory reading engine, an input data distribution controller, an input data header preprocessing buffer area, a multiplication accumulator unit, a convolution kernel data input queue, an input queue of input data and an output data storage unit; the external memory is used for storing a pre-designed convolution kernel matrix and data continuously generated by external input equipment; the external memory reading engine is used for reading and outputting the convolution kernel in the external memory and external data to the input data distribution controller; the input data distribution controller is used for distributing the convolution kernels and the external data to corresponding convolution kernel data input queues and corresponding input data header preprocessing buffer areas respectively; the multiply accumulator unit comprises multiply accumulators, each multiply accumulator of the multiply accumulator unit is used for carrying out corresponding multiply-accumulate operation and respectively outputting the result of the multiply-accumulate operation to the output data storage unit, and the circuit comprises: 1-an external memory read engine; 2-input data distribution controller; 3-an output data storage unit; 4-input data header pre-processing buffer; 5-a column of multiply accumulator unit queues; 6-convolution kernel data input queue; 7-input queue of data, realize that convolution kernel data and data are input to the queue of multiply accumulator unit from different directions in parallel, each multiply accumulator unit processes the data flowing through it in parallel at the same time, therefore, realize the acceleration of calculation of an integrated circuit which processes a large amount of data (such as image processing, sound data processing, etc.) in real time.
The multiply accumulator unit is a column of multiply accumulator unit queues.
As shown in fig. 5, the multiply accumulator unit includes a convolution kernel register, an external data register, a multiplier, an adder, and an activation function module;
the convolution kernel register and one input end of the multiplier jointly receive each convolution kernel data, the external data register and the other input end of the multiplier jointly receive each external data, the multiplier outputs to the adder, the adder outputs to the activation function module, the activation function module outputs to the corresponding port of the output data storage unit, the convolution kernel register outputs to the convolution kernel data input end of the next multiplication accumulator, and the external data register outputs to the external data input end of the next multiplication accumulator.
The multiplication accumulator comprises a convolution kernel register, an external data register, a multiplier, an adder and an activation function module; the convolution kernel register and one input end of the multiplier commonly receive each convolution kernel data, the other input end of the external data register and the other input end of the multiplier commonly receive each external data, the multiplier outputs to the adder, the adder outputs to the activation function module, the activation function module outputs to the corresponding port of the output data storage unit, the convolution kernel register outputs to the convolution kernel data input end of the next multiplication accumulator, the external data register outputs to the external data input end of the next multiplication accumulator, and due to the fact that an input queue of a plurality of data is constructed, data is simultaneously input to a plurality of multiplication accumulator units according to a specific sequence, the plurality of multiplication accumulator units are guaranteed to simultaneously carry out convolution operation in parallel.
Example two:
the following describes a method and system for accelerating computation of an integrated circuit based on a convolutional neural network algorithm by taking image processing as an example, as follows:
step 1: assume that the convolution kernel data matrix is N and the sliding step is M. For example: n is 3, M is 1;
the pre-designed convolution kernel data matrix is placed in an external memory. For example:
|
Convolution kernel data 4 | Convolution kernel data 7 |
|
Convolution kernel data 5 | Convolution kernel data 8 |
|
Convolution kernel data 6 | Convolution kernel data 9 |
Step 2: processing image data continuously output by a camera, subtracting pixels of adjacent images to generate more zeros, compressing a sparse matrix, and putting the sparse matrix into an external memory. For example, the camera outputs an image, which is the first 3 columns of data:
|
Pixel 11 | Pixel 21 |
|
Pixel 12 | Pixel 22 |
|
Pixel 13 | Pixel 23 |
Pixel 4 | Pixel 14 | Pixel 24 |
Pixel 5 | Pixel 15 | Pixel 25 |
Pixel 6 | Pixel 16 | Pixel 26 |
Pixel 7 | Pixel 17 | Pixel 27 |
Pixel 8 | Pixel 18 | Pixel 28 |
Pixel 9 | Pixel 19 | Pixel 29 |
Pixel 10 | Pixel 20 | Pixel 30 |
Pixel 31 | Pixel 41 | Pixel 51 |
Pixel 32 | Pixel 42 | Pixel 52 |
Pixel 33 | Pixel 43 | Pixel 53 |
Pixel 34 | Pixel 44 | Pixel 54 |
Pixel 35 | Pixel 45 | Pixel 55 |
Pixel 36 | Pixel 46 | Pixel 56 |
And step 3:
and the reading engine reads the data of the external memory into the buffer area and decompresses the sparse matrix.
And 4, step 4:
the input data distribution controller queues the convolution kernel data. For example:
convolution kernel data 9 |
Convolution kernel data 8 |
Number of convolution kernelsAccording to 7 |
Convolution kernel data 6 |
Convolution kernel data 5 |
Convolution kernel data 4 |
|
|
|
And 5:
the input data distribution controller puts the image data of the 1 st column to the Nth column and the 1 st line to the N-1 st line into the head preprocessing buffer. For example:
|
|
Pixel 11 | Pixel 12 | Pixel 21 | Pixel 22 |
Step 6:
an input data allocation controller places the image data of the 1 st column to the Nth column and the Nth row in an input queue 1 of image data. For example:
and 7:
an input data allocation controller puts the image data of the 1 st column to the N th column and the N +1 th row into an input queue 2 of the image data. For example:
and 8:
an input data allocation controller places the image data of the 1 st column to the Nth column and the (N + 2) th row in an input queue 3 of image data. For example:
and step 9:
an input data allocation controller puts the image data of the 1 st column to the Nth column and the (N + 3) th row into an input queue 4 of image data. For example:
step 10:
an input data allocation controller puts the image data of the 1 st column to the Nth column and the (N + 4) th row into an input queue 5 of the image data. For example:
step 11:
as shown in fig. 6, which is a 1 st clock cycle after initialization, an exemplary illustrative architecture;
step 12:
as shown in FIG. 7, this figure is the 2 nd clock cycle;
reading convolution kernel data 1 of a convolution kernel data queue by a unit 1;
after a multiply accumulator unit completes the operation, the currently input image data is processed according to the following 3 rules:
assuming that the convolution kernel is a 3 x 3 square matrix, the width of the convolution kernel is 3, the shift step size is 1,
rule (a): for the 1 st input image data, it is directly discarded.
Rule (b): for the 2 nd to 3 rd input image data, if not the last multiply accumulator, the sequence moves down to the adjacent multiply accumulator unit.
Rule (c): for the image data input from the 2 nd to the 3 rd time, if the last multiplication accumulator is used, the image data is moved to the first multiplication accumulator unit, and thus an image data flowing closed loop is constructed.
The number 1 of this arrow of the figure indicates the 1 st time. So pixel 1 is discarded directly after the operation is complete.
Step 13:
as shown in FIG. 8, this figure is the 3 rd clock cycle;
the convolution kernel data 1 is moved to a buffer register;
the unit 1 reads in convolution kernel data 2;
the number 2 of this arrow of the figure indicates the 2 nd time. So after the operation is completed, pixel 2 will be moved to the multiply-accumulate unit 2.
Step 14:
as shown in fig. 9, the lower diagram is the 4 th clock cycle, the convolution kernel data 2 is moved to the buffer register;
reading convolution kernel data 3 by a unit 1, and reading convolution kernel data 1 by a unit 2;
the unit 1 reads in the pixel 3 and the unit 2 reads in the pixel 2;
the units 1 and 2 start multiplication operation and accumulate the result;
step 15:
as shown in fig. 10, which is the 5 th clock cycle, convolution kernel data 1,3 move to the buffer register;
the unit 1 reads convolution kernel data 4, and the unit 2 reads convolution kernel data 2;
the unit 1 reads in the pixel 11 and the unit 2 reads in the pixel 3;
the units 1 and 2 start multiplication operation and accumulate the result;
convolution kernel data 4 × pixels 11, convolution kernel data 2 × pixels 3;
step 16:
as shown in fig. 11, which is the 6 th clock cycle, the convolution kernel data 2,4 is moved to the buffer register;
the units 1,2 and 3 start multiplication operation and accumulate the result;
convolution kernel data 5 × pixels 12, convolution kernel data 3 × pixels 4, convolution kernel data 1 × pixels 3;
and step 17:
as shown in fig. 12, which is the 7 th clock cycle, the convolution kernel data 1,3,5 are moved to the buffer register;
the unit 1 reads convolution kernel data 6, the unit 2 reads convolution kernel data 4, and the unit 3 reads convolution kernel data 2;
the units 1,2 and 3 start multiplication operation and accumulate the result;
convolution kernel data 6 × pixels 13, convolution kernel data 4 × pixels 12, convolution kernel data 2 × pixels 4;
step 18:
as shown in fig. 13, which is the 8 th clock cycle, the convolution kernel data 2,4,6 are moved to the buffer register;
reading convolution kernel data 7 by a unit 1, reading convolution kernel data 5 by a unit 2, reading convolution kernel data 3 by a unit 3, and reading convolution kernel data 1 by a unit 4;
the units 1,2,3 and 4 start multiplication operation and accumulate the results;
convolution kernel data 7 × pixels 21, convolution kernel data 5 × pixels 13, convolution kernel data 3 × pixels 5, convolution kernel data 1 × pixels 4;
step 19:
as shown in fig. 14, which is the 9 th clock cycle, the convolution kernel data 1,3,5,7 move to the buffer register;
reading convolution kernel data 8 by a unit 1, reading convolution kernel data 6 by a unit 2, reading convolution kernel data 4 by a unit 3 and reading convolution kernel data 2 by a unit 4;
the units 1,2,3 and 4 start multiplication operation and accumulate the results;
convolution kernel data 8 × pixels 22, convolution kernel data 6 × pixels 14, convolution kernel data 4 × pixels 13, convolution kernel data 2 × pixels 5;
step 20:
as shown in fig. 15, which is the 10 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;
reading convolution kernel data 9 by a unit 1, reading convolution kernel data 7 by a unit 2, reading convolution kernel data 5 by a unit 3, reading convolution kernel data 3 by a unit 4 and reading convolution kernel data 1 by the unit 5;
the units 1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 9 × pixels 23, convolution kernel data 7 × pixels 22, convolution kernel data 5 × pixels 14, convolution kernel data 3 × pixels 6, convolution kernel data 1 × pixels 5;
step 21:
as shown in fig. 16, which is the 11 th clock cycle, convolution kernel data 1,3,5,7,9 are moved to the buffer register;
the unit 2 reads convolution kernel data 8, the unit 3 reads convolution kernel data 6, the unit 4 reads convolution kernel data 4, and the unit 5 reads convolution kernel data 2;
the accumulated result of the unit 1 is output through an activation function;
the units 2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 8 × pixels 23, convolution kernel data 6 × pixels 15, convolution kernel data 4 × pixels 14, convolution kernel data 2 × pixels 6;
step 22:
as shown in fig. 17, which is the 12 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;
reading convolution kernel data 1 by a unit 1, reading convolution kernel data 9 by a unit 2, reading convolution kernel data 7 by a unit 3, reading convolution kernel data 5 by a unit 4 and reading convolution kernel data 3 by a unit 5;
the units 1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 1 × pixel 6, convolution kernel data 9 × pixel 24, convolution kernel data 7 × pixel 23, convolution kernel data 5 × pixel 15, convolution kernel data 3 × pixel 7;
step 23:
as shown in fig. 18, which is the 13 th clock cycle, convolution kernel data 1,3,5,7,9 move to the buffer register;
reading convolution kernel data 2 by a unit 1, reading convolution kernel data 8 by a unit 3, reading convolution kernel data 6 by a unit 4 and reading convolution kernel data 4 by a unit 5;
the accumulated result of the unit 2 is output through an activation function;
the units 1,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 2 × pixels 7, convolution kernel data 8 × pixels 24, convolution kernel data 6 × pixels 16, convolution kernel data 4 × pixels 15;
step 24:
as shown in fig. 19, which is the 14 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;
reading convolution kernel data 3 by a unit 1, reading convolution kernel data 1 by a unit 2, reading convolution kernel data 9 by the unit 3, reading convolution kernel data 7 by a unit 4 and reading convolution kernel data 5 by a unit 5;
the units 1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 3 × pixels 8, convolution kernel data 1 × pixels 7, convolution kernel data 9 × pixels 25, convolution kernel data 7 × pixels 24, convolution kernel data 5 × pixels 16;
step 25:
as shown in fig. 20, which is the 15 th clock cycle, convolution kernel data 1,3,5,7,9 are moved to the buffer register;
reading convolution kernel data 4 by a unit 1, reading convolution kernel data 2 by a unit 2, reading convolution kernel data 8 by the unit 4, and reading convolution kernel data 6 by a unit 5;
the accumulated result of the unit 3 is output through an activation function;
the units 1,2,4,5 start multiplication, and the accumulation result is: convolution kernel data 4 × pixels 16, convolution kernel data 2 × pixels 8, convolution kernel data 8 × pixels 25, convolution kernel data 6 × pixels 17;
step 26:
as shown in fig. 21, which is the 16 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;
reading convolution kernel data 5 by a unit 1, reading convolution kernel data 3 by a unit 2, reading convolution kernel data 1 by the unit 3, reading convolution kernel data 9 by a unit 4 and reading convolution kernel data 7 by the unit 5;
the units 1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 5 × pixels 17, convolution kernel data 3 × pixels 9, convolution kernel data 1 × pixels 8, convolution kernel data 9 × pixels 26, convolution kernel data 7 × pixels 25;
step 27:
as shown in fig. 22, which is a 17 th clock cycle, convolution kernel data 1,3,5,7,9 are moved to the buffer register;
reading convolution kernel data 6 by a unit 1, reading convolution kernel data 4 by a unit 2, reading convolution kernel data 2 by a unit 3 and reading convolution kernel data 8 by a unit 5;
the accumulated result of the unit 4 is output through an activation function;
the units 1,2,3,5 start multiplication, and the accumulation result is: convolution kernel data 6 × pixels 18, convolution kernel data 4 × pixels 17, convolution kernel data 2 × pixels 9, convolution kernel data 8 × pixels 26;
step 28:
as shown in fig. 23, which is the 18 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;
reading convolution kernel data 7 by a unit 1, reading convolution kernel data 5 by a unit 2, reading convolution kernel data 3 by a unit 3, reading convolution kernel data 1 by a unit 4 and reading convolution kernel data 9 by a unit 5;
the units 1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 7 × pixels 26, convolution kernel data 5 × pixels 18, convolution kernel data 3 × pixels 10, convolution kernel data 1 × pixels 9, convolution kernel data 9 × pixels 27;
step 29:
as shown in fig. 24, which is the 19 th clock cycle, the convolution kernel data 1,3,5,7,9 are moved to the buffer register;
reading convolution kernel data 8 by a unit 1, reading convolution kernel data 6 by a unit 2, reading convolution kernel data 4 by a unit 3 and reading convolution kernel data 2 by a unit 4;
the accumulated result of the unit 5 is output through an activation function;
the units 1,2,3,5 start multiplication, and the accumulation result is: convolution kernel data 8 × pixels 27, convolution kernel data 6 × pixels 19, convolution kernel data 4 × pixels 18, convolution kernel data 2 × pixels 10;
step 30:
as shown in fig. 25, which is the 20 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;
reading convolution kernel data 9 by a unit 1, reading convolution kernel data 7 by a unit 2, reading convolution kernel data 5 by a unit 3, reading convolution kernel data 3 by a unit 4 and reading convolution kernel data 1 by the unit 5;
the units 1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 9 × pixels 28, convolution kernel data 7 × pixels 27, convolution kernel data 5 × pixels 19, convolution kernel data 3 × pixels 31, convolution kernel data 1 × pixels 10;
step 31:
as shown in fig. 26, which is the 21 st clock cycle, the convolution kernel data 1,3,5,7,9 are moved to the buffer register;
the unit 2 reads convolution kernel data 8, the unit 3 reads convolution kernel data 6, the unit 4 reads convolution kernel data 4, and the unit 5 reads convolution kernel data 2;
the accumulated result of the unit 1 is output through an activation function;
the units 2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 8 × pixels 28, convolution kernel data 6 × pixels 20, convolution kernel data 4 × pixels 19, convolution kernel data 2 × pixels 31;
step 32:
as shown in fig. 27, the lower diagram is the 22 th clock cycle, the convolution kernel data 2,4,6,8 moves to the buffer register;
reading convolution kernel data 1 by a unit 1, reading convolution kernel data 9 by a unit 2, reading convolution kernel data 7 by a unit 3, reading convolution kernel data 5 by a unit 4 and reading convolution kernel data 3 by a unit 5;
the units 1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 1 × pixel 31, convolution kernel data 9 × pixel 29, convolution kernel data 7 × pixel 28, convolution kernel data 5 × pixel 20, convolution kernel data 3 × pixel 32;
step 33:
as shown in fig. 28, which is the 23 rd clock cycle, the convolution kernel data 1,3,5,7,9 are moved to the buffer register;
reading convolution kernel data 2 by a unit 1, reading convolution kernel data 8 by a unit 3, reading convolution kernel data 6 by a unit 4 and reading convolution kernel data 4 by a unit 5;
the accumulated result of the unit 2 is output through an activation function;
the units 1,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 2 × pixels 32, convolution kernel data 8 × pixels 29, convolution kernel data 6 × pixels 41, convolution kernel data 4 × pixels 20;
step 34:
as shown in fig. 29, which is the 24 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;
reading convolution kernel data 3 by a unit 1, reading convolution kernel data 1 by a unit 2, reading convolution kernel data 9 by the unit 3, reading convolution kernel data 7 by a unit 4 and reading convolution kernel data 5 by a unit 5;
the units 1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 3 × pixels 33, convolution kernel data 1 × pixels 32, convolution kernel data 9 × pixels 30, convolution kernel data 7 × pixels 29, convolution kernel data 5 × pixels 41;
step 35:
as shown in fig. 30, the lower diagram is the 25 th clock cycle, the convolution kernel data 1,3,5,7,9 moves to the buffer register;
reading convolution kernel data 4 by a unit 1, reading convolution kernel data 2 by a unit 2, reading convolution kernel data 8 by the unit 4, and reading convolution kernel data 6 by a unit 5;
the accumulated result of the unit 3 is output through an activation function;
the units 1,2,4,5 start multiplication, and the accumulation result is: convolution kernel data 4 × pixels 41, convolution kernel data 2 × pixels 33, convolution kernel data 8 × pixels 30, convolution kernel data 6 × pixels 42;
step 36:
as shown in fig. 31, which is the 26 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;
reading convolution kernel data 5 by a unit 1, reading convolution kernel data 3 by a unit 2, reading convolution kernel data 1 by the unit 3, reading convolution kernel data 9 by a unit 4 and reading convolution kernel data 7 by the unit 5;
the units 1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 5 × pixels 42, convolution kernel data 3 × pixels 34, convolution kernel data 1 × pixels 33, convolution kernel data 9 × pixels 51, convolution kernel data 7 × pixels 30;
step 37:
as shown in fig. 32, which is the 27 th clock cycle, convolution kernel data 1,3,5,7,9 are moved to the buffer register;
reading convolution kernel data 6 by a unit 1, reading convolution kernel data 4 by a unit 2, reading convolution kernel data 2 by a unit 3 and reading convolution kernel data 8 by a unit 5;
the accumulated result of the unit 4 is output through an activation function;
the units 1,2,3,5 start multiplication, and the accumulation result is: convolution kernel data 6 × pixels 43, convolution kernel data 4 × pixels 42, convolution kernel data 2 × pixels 34, convolution kernel data 8 × pixels 51;
step 38:
as shown in fig. 33, which is the 28 th clock cycle, the convolution kernel data 2,4,6,8 are moved to the buffer register;
reading convolution kernel data 7 by a unit 1, reading convolution kernel data 5 by a unit 2, reading convolution kernel data 3 by a unit 3, reading convolution kernel data 1 by a unit 4 and reading convolution kernel data 9 by a unit 5;
the units 1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 7 × pixels 51, convolution kernel data 5 × pixels 43, convolution kernel data 3 × pixels 35, convolution kernel data 1 × pixels 34, convolution kernel data 9 × pixels 52;
step 39:
as shown in fig. 34, which is the 29 th clock cycle, the convolution kernel data 1,3,5,7,9 are moved to the buffer register;
reading convolution kernel data 8 by a unit 1, reading convolution kernel data 6 by a unit 2, reading convolution kernel data 4 by a unit 3 and reading convolution kernel data 2 by a unit 4;
the accumulated result of the unit 5 is output through an activation function;
the units 1,2,3,4 start multiplication, and the accumulation result is: convolution kernel data 8 × pixels 52, convolution kernel data 6 × pixels 44, convolution kernel data 4 × pixels 43, convolution kernel data 2 × pixels 35;
step 40:
as shown in fig. 35, which is a 30 th clock cycle, convolution kernel data 2,4,6,8 are moved to a buffer register;
reading convolution kernel data 9 by a unit 1, reading convolution kernel data 7 by a unit 2, reading convolution kernel data 5 by a unit 3, reading convolution kernel data 3 by a unit 4 and reading convolution kernel data 1 by the unit 5;
the units 1,2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 9 × pixels 53, convolution kernel data 7 × pixels 52, convolution kernel data 5 × pixels 44, convolution kernel data 3 × pixels 36, convolution kernel data 1 × pixels 35;
step 41:
as shown in fig. 36, which is the 31 th clock cycle, the convolution kernel data 1,3,5,7,9 are moved to the buffer register;
the unit 2 reads convolution kernel data 8, the unit 3 reads convolution kernel data 6, the unit 4 reads convolution kernel data 4, and the unit 5 reads convolution kernel data 2;
the accumulated result of the unit 1 is output through an activation function;
the units 2,3,4,5 start multiplication, and the accumulation result is: convolution kernel data 8 × pixels 53, convolution kernel data 6 × pixels 45, convolution kernel data 4 × pixels 44, convolution kernel data 2 × pixels 36.
The working principle is as follows:
convolution kernel data and external data are input to a multiplication accumulator unit queue in parallel from different directions; the invention discloses a design method of integrated circuit for accelerating the convolution neural network CNN operation of calculating large data quantity, belonging to a design method of accelerating calculation of integrated circuit for real-time processing large data quantity (such as image processing, sound data processing, etc.), the convolution kernel data and data are parallelly inputted into multiplication accumulator unit queue from different directions, and all multiplication accumulator units simultaneously and parallelly process and output the data passed through it, the invention repeatedly utilizes the inputted data many times to construct data flow closed loop, and greatly reduces the number of times of reading external memory, and the invention greatly improves the utilization rate of multiplication accumulator unit by multi-stage pipeline technology, the circuit can greatly reduce the bandwidth requirement on reading data of an external memory in the data calculation process, and accelerate the CNN calculation of the convolutional neural network of the data.
The technical solutions of the present invention or similar technical solutions designed by those skilled in the art based on the teachings of the technical solutions of the present invention are all within the scope of the present invention to achieve the above technical effects.
Claims (10)
1. A method for accelerating computation of an integrated circuit based on a convolutional neural network algorithm is characterized by comprising the following steps:
inputting convolution kernel data and external data from different directions to a multiplication accumulator unit queue in parallel;
each multiplication accumulator unit in the multiplication accumulator unit queue simultaneously and parallelly carries out corresponding multiplication accumulation processing on convolution kernel data and external data flowing through the multiplication accumulator unit queue, and the convolution kernel data and the external data are respectively output to the data storage unit;
each multiply accumulator unit in the multiply accumulator unit queue simultaneously and parallelly carries out corresponding multiply accumulation processing on convolution kernel data and external data flowing through the multiply accumulator unit queue, and the corresponding multiply accumulation processing further comprises the following steps:
inputting at least one convolution kernel data into corresponding multiplication accumulator units respectively;
respectively inputting external data in the input queue into corresponding multiplication accumulator units according to the queue sequence;
performing convolution operation on convolution kernel data and external data in each row of multiplication accumulator units simultaneously in parallel by a multi-stage pipeline technology;
the convolution kernel data is a pre-designed convolution kernel matrix, and the external data is data continuously generated by external input equipment;
reading in data continuously generated by a pre-designed convolution kernel matrix and an external input device through an external storage reading engine, respectively distributing convolution kernels and external data in the data read in by the reading engine to a convolution kernel data input queue and a head data preprocessing buffer area and a corresponding data input queue through data distribution, wherein the head data of the external data is distributed to the head data preprocessing buffer area, the non-head data of the external data is distributed to the corresponding data input queue, each convolution kernel is distributed to a corresponding multiplication accumulator unit of a row of accumulator multiplication unit queues correspondingly by the convolution kernel data input queue, the head data of the external data is output to a corresponding multiplication accumulator unit of the multiplication accumulator unit queues correspondingly, and the non-head data of the external data is circularly output to a corresponding multiplication accumulator unit of the multiplication accumulator unit queues correspondingly by the accumulator data input queue, and the corresponding multiply accumulators of the multiply accumulator unit queue perform corresponding multiply-accumulate operations and respectively output the results of the corresponding multiply-accumulate operations to output data for storage.
2. The method according to claim 1, wherein if the data continuously generated by the external input device is external data, the CNN operation function of the convolutional neural network is:
the result of the convolution operation is a convolution kernel matrix x external data matrix;
convolution kernel matrix: a linear convolution kernel data matrix;
external data matrix: an external data matrix having a two-dimensional data structure, wherein the external data matrix contains MxN external data;
the result of the convolution operation: the convolution operation result matrix with a two-dimensional data structure comprises MxN convolution operation results, and the convolution operation results are products of convolution kernel data and corresponding external data.
3. The method of claim 1, wherein the multiply accumulator unit only has one queue of multiply accumulator units.
4. The method of claim 1, wherein said multiply accumulator unit forms a data matrix sequential data processing, said sequential data processing comprising:
processing data according to the sequence from top to bottom, or;
processing data according to the sequence from bottom to top, or;
processing the data in the order from left to right, or;
the data is processed in the order from right to left.
5. The method of claim 1, wherein the input processing of the multiply accumulator unit includes an initialization phase including 1 clock cycle through Nth clock cycle and a subsequent phase including clock cycles after Nth clock cycle.
6. The method of claim 5, wherein the initialization phase employs a first input rule of an external data flow closed loop, the first input rule comprising:
before a multiply accumulator unit operation;
if the number of the multiplication accumulator units is 1 st to N-M times, reading an external data header preprocessing buffer area, wherein the data of the preprocessing buffer area are from 1 st to N th columns and from 1 st to N-M th rows;
if the number is from the (N-M + 1) th time to the (N) th time of the 1 st multiplication accumulator unit, reading the first data input queue and repeatedly circulating in sequence;
if the number of the multiplication accumulators is not 1 to N-M times of the 1 st multiplication accumulator, reading data output by the previous-stage multiplication accumulator unit, and if the number of the multiplication accumulators is not 1 to N-M times, stopping operation;
if the number of the multiplication accumulators is not the number from the (N-M + 1) th time to the (N) th time of the 1 st multiplication accumulator, the corresponding input queue of the data is read and the cycle is repeated according to the sequence.
7. The method of claim 5, wherein the subsequent stage employs a second input rule of the external data flow closed loop, the second input rule comprising:
before a multiply accumulator unit operation;
if the 1 st time to the N-M times of the 1 st multiplication accumulator unit, reading the data output by the last multiplication accumulator unit;
if the 1 st multiplication accumulator unit is from the (N-M + 1) th time to the (N) th time, reading an input queue 1 of data and repeatedly circulating in sequence;
if the number of the multiplication accumulators is not 1 to N-M times of the 1 st multiplication accumulator, reading data output by the previous-stage multiplication accumulator unit, and if the number of the multiplication accumulators is not 1 to N-M times, stopping operation;
if the data is not the 1 st multiply accumulator, the corresponding data input queue is read from the N-M +1 th time to the N-th time, and the cycle is repeated according to the sequence.
8. The method of claim 7, wherein if a multiply accumulator unit completes the operation, then processing the current input data according to a data flow closed loop completion operation rule;
the completion operation rule comprises:
if the data is input from the 1 st time to the Mth time, directly discarding the data;
if the data is inputted from the M +1 th time to the Nth time, if the data is not the last multiply accumulator, sequentially moving down to the adjacent multiply accumulator units;
if the data is inputted from the M +1 th time to the Nth time, the data is moved to the first multiply accumulator unit if the data is the last multiply accumulator.
9. A system for integrated circuit accelerated computing based on convolutional neural network algorithm, comprising: the device comprises an external memory, an external memory reading engine, an input data distribution controller, an input data header preprocessing buffer unit, a row of multiplication accumulator unit queues, a convolution kernel data input queue unit, a data input queue unit and an output data storage unit;
the external memory is used for storing a pre-designed convolution kernel matrix and data continuously generated by external input equipment;
the external memory reading engine is used for reading and outputting the convolution kernel matrix and the external data in the external memory to the input data distribution controller;
the input data distribution controller is used for distributing the convolution kernel matrix and the external data in the data read by the reading engine to a convolution kernel data input queue, a head data preprocessing buffer area and a corresponding data input queue through data distribution respectively, wherein the head data of the external data is distributed to the head data preprocessing buffer area, and the non-head data of the external data is distributed to the corresponding data input queue;
the input data header preprocessing buffer unit is used for outputting header data of external data to corresponding multiply accumulator units of the multiply accumulator unit queue;
the convolution kernel data input queue unit is used for correspondingly distributing each convolution kernel to a corresponding multiplication accumulator unit of a row of multiplication accumulator unit queues;
the data input queue unit is used for circularly outputting non-head data of the external data to corresponding multiply accumulator units of the multiply accumulator unit queue;
the array of the multiply accumulator unit queues comprises multiply accumulator units, and the multiply accumulator units are used for corresponding multiply-accumulate operations and respectively output the results of the corresponding multiply-accumulate operations to output data for storage;
the output of the external memory is connected with an external memory reading engine, the output of the external memory reading engine is connected with an input data distribution controller, one output end of the input data distribution controller is connected with a convolution kernel data input queue unit, the other output end of the input data distribution controller is connected with an input data head preprocessing buffer unit and a corresponding data input queue unit, the output of the convolution kernel data input queue unit is connected with the convolution kernel input end of a row of multiplication accumulator unit queues, each multiplication accumulator unit of the row of multiplication accumulator unit queues is connected in turn, the input data head preprocessing buffer unit is connected with the data input end of the row of multiplication accumulator unit queues, and the corresponding multiplication accumulator units of the row of multiplication accumulator unit queues are respectively connected with the corresponding data input queue units, the respective multiply accumulator units of the column of the queue of multiply accumulator units are each output at a respective input of the output data storage unit.
10. The system of claim 9, wherein the multiply accumulator unit comprises a convolution kernel register, an external data register, a multiplier, an adder, an activation function module;
the convolution kernel register and one input end of the multiplier jointly receive each convolution kernel data, the external data register and the other input end of the multiplier jointly receive each external data, the multiplier outputs to the adder, the adder outputs to the activation function module, the activation function module outputs to the corresponding port of the output data storage unit, the convolution kernel register outputs to the convolution kernel data input end of the next multiplication accumulator, and the external data register outputs to the external data input end of the next multiplication accumulator.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910368448.1A CN110188869B (en) | 2019-05-05 | 2019-05-05 | Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910368448.1A CN110188869B (en) | 2019-05-05 | 2019-05-05 | Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110188869A CN110188869A (en) | 2019-08-30 |
CN110188869B true CN110188869B (en) | 2021-08-10 |
Family
ID=67715675
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910368448.1A Active CN110188869B (en) | 2019-05-05 | 2019-05-05 | Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110188869B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110796250A (en) * | 2019-10-11 | 2020-02-14 | 浪潮电子信息产业股份有限公司 | Convolution processing method and system applied to convolutional neural network and related components |
CN110807521B (en) * | 2019-10-29 | 2022-06-24 | 中昊芯英(杭州)科技有限公司 | Processing device, chip, electronic equipment and method supporting vector operation |
CN112784207B (en) * | 2019-11-01 | 2024-02-02 | 中科寒武纪科技股份有限公司 | Operation method and related product |
TWI733334B (en) * | 2020-02-15 | 2021-07-11 | 財團法人工業技術研究院 | Convolutional neural-network calculating apparatus and operation methods thereof |
CN112051981B (en) * | 2020-09-15 | 2023-09-01 | 厦门壹普智慧科技有限公司 | Data pipeline calculation path structure and single-thread data pipeline system |
CN112328962B (en) * | 2020-11-27 | 2021-12-31 | 深圳致星科技有限公司 | Matrix operation optimization method, device and equipment and readable storage medium |
CN118519613A (en) * | 2024-07-23 | 2024-08-20 | 珠海皓泽科技有限公司 | Multiply-accumulator operation cluster and data processing method |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662623A (en) * | 2012-04-28 | 2012-09-12 | 电子科技大学 | Parallel matrix multiplier based on single FPGA and implementation method thereof |
CN104915322A (en) * | 2015-06-09 | 2015-09-16 | 中国人民解放军国防科学技术大学 | Method for accelerating convolution neutral network hardware and AXI bus IP core thereof |
CN105589677A (en) * | 2014-11-17 | 2016-05-18 | 沈阳高精数控智能技术股份有限公司 | Systolic structure matrix multiplier based on FPGA (Field Programmable Gate Array) and implementation method thereof |
CN107704921A (en) * | 2017-10-19 | 2018-02-16 | 北京智芯原动科技有限公司 | The algorithm optimization method and device of convolutional neural networks based on Neon instructions |
CN107844826A (en) * | 2017-10-30 | 2018-03-27 | 中国科学院计算技术研究所 | Neural-network processing unit and the processing system comprising the processing unit |
CN107862374A (en) * | 2017-10-30 | 2018-03-30 | 中国科学院计算技术研究所 | Processing with Neural Network system and processing method based on streamline |
CN108182471A (en) * | 2018-01-24 | 2018-06-19 | 上海岳芯电子科技有限公司 | A kind of convolutional neural networks reasoning accelerator and method |
US10073816B1 (en) * | 2017-05-11 | 2018-09-11 | NovuMind Limited | Native tensor processor, and partitioning of tensor contractions |
CN108537330A (en) * | 2018-03-09 | 2018-09-14 | 中国科学院自动化研究所 | Convolutional calculation device and method applied to neural network |
CN108665059A (en) * | 2018-05-22 | 2018-10-16 | 中国科学技术大学苏州研究院 | Convolutional neural networks acceleration system based on field programmable gate array |
CN108764466A (en) * | 2018-03-07 | 2018-11-06 | 东南大学 | Convolutional neural networks hardware based on field programmable gate array and its accelerated method |
CN109086867A (en) * | 2018-07-02 | 2018-12-25 | 武汉魅瞳科技有限公司 | A kind of convolutional neural networks acceleration system based on FPGA |
CN109146067A (en) * | 2018-11-19 | 2019-01-04 | 东北大学 | A kind of Policy convolutional neural networks accelerator based on FPGA |
CN109190756A (en) * | 2018-09-10 | 2019-01-11 | 中国科学院计算技术研究所 | Arithmetic unit based on Winograd convolution and the neural network processor comprising the device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170344876A1 (en) * | 2016-05-31 | 2017-11-30 | Samsung Electronics Co., Ltd. | Efficient sparse parallel winograd-based convolution scheme |
-
2019
- 2019-05-05 CN CN201910368448.1A patent/CN110188869B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662623A (en) * | 2012-04-28 | 2012-09-12 | 电子科技大学 | Parallel matrix multiplier based on single FPGA and implementation method thereof |
CN105589677A (en) * | 2014-11-17 | 2016-05-18 | 沈阳高精数控智能技术股份有限公司 | Systolic structure matrix multiplier based on FPGA (Field Programmable Gate Array) and implementation method thereof |
CN104915322A (en) * | 2015-06-09 | 2015-09-16 | 中国人民解放军国防科学技术大学 | Method for accelerating convolution neutral network hardware and AXI bus IP core thereof |
US10073816B1 (en) * | 2017-05-11 | 2018-09-11 | NovuMind Limited | Native tensor processor, and partitioning of tensor contractions |
CN107704921A (en) * | 2017-10-19 | 2018-02-16 | 北京智芯原动科技有限公司 | The algorithm optimization method and device of convolutional neural networks based on Neon instructions |
CN107844826A (en) * | 2017-10-30 | 2018-03-27 | 中国科学院计算技术研究所 | Neural-network processing unit and the processing system comprising the processing unit |
CN107862374A (en) * | 2017-10-30 | 2018-03-30 | 中国科学院计算技术研究所 | Processing with Neural Network system and processing method based on streamline |
CN108182471A (en) * | 2018-01-24 | 2018-06-19 | 上海岳芯电子科技有限公司 | A kind of convolutional neural networks reasoning accelerator and method |
CN108764466A (en) * | 2018-03-07 | 2018-11-06 | 东南大学 | Convolutional neural networks hardware based on field programmable gate array and its accelerated method |
CN108537330A (en) * | 2018-03-09 | 2018-09-14 | 中国科学院自动化研究所 | Convolutional calculation device and method applied to neural network |
CN108665059A (en) * | 2018-05-22 | 2018-10-16 | 中国科学技术大学苏州研究院 | Convolutional neural networks acceleration system based on field programmable gate array |
CN109086867A (en) * | 2018-07-02 | 2018-12-25 | 武汉魅瞳科技有限公司 | A kind of convolutional neural networks acceleration system based on FPGA |
CN109190756A (en) * | 2018-09-10 | 2019-01-11 | 中国科学院计算技术研究所 | Arithmetic unit based on Winograd convolution and the neural network processor comprising the device |
CN109146067A (en) * | 2018-11-19 | 2019-01-04 | 东北大学 | A kind of Policy convolutional neural networks accelerator based on FPGA |
Non-Patent Citations (2)
Title |
---|
《Real-time meets approximate computing: An elastic CNN inference accelerator with adaptive trade-off between QoS and QoR》;Ying Wang 等;《2017 54th ACM/EDAC/IEEE Design Automation Conference》;20171031;第1-6页 * |
《一种简洁高效的加速卷积神经网络的方法》;刘进锋等;《科学技术与工程》;20141130;第14卷(第33期);第240-244页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110188869A (en) | 2019-08-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110188869B (en) | Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm | |
CN106970896B (en) | Vector processor-oriented vectorization implementation method for two-dimensional matrix convolution | |
Yepez et al. | Stride 2 1-D, 2-D, and 3-D Winograd for convolutional neural networks | |
EP3659051B1 (en) | Accelerated mathematical engine | |
US20190095776A1 (en) | Efficient data distribution for parallel processing | |
CN110458279B (en) | FPGA-based binary neural network acceleration method and system | |
US10585621B2 (en) | Statically-schedulable feed and drain structure for systolic array architecture | |
CN108537330B (en) | Convolution computing device and method applied to neural network | |
CN111758107B (en) | System and method for hardware-based pooling | |
CN112292694A (en) | Method for accelerating operation and accelerator device | |
CN108629406B (en) | Arithmetic device for convolutional neural network | |
US20240265234A1 (en) | Digital Processing Circuits and Methods of Matrix Operations in an Artificially Intelligent Environment | |
CN109844738A (en) | Arithmetic processing circuit and identifying system | |
CN108170640B (en) | Neural network operation device and operation method using same | |
CN110989920A (en) | Energy efficient memory system and method | |
TW202123093A (en) | Method and system for performing convolution operation | |
CN110738308A (en) | neural network accelerators | |
CN110766128A (en) | Convolution calculation unit, calculation method and neural network calculation platform | |
CN102411558A (en) | Vector processor oriented large matrix multiplied vectorization realizing method | |
CN110851779B (en) | Systolic array architecture for sparse matrix operations | |
CN110705703A (en) | Sparse neural network processor based on systolic array | |
CN110674927A (en) | Data recombination method for pulse array structure | |
CN107680028B (en) | Processor and method for scaling an image | |
CN110580519B (en) | Convolution operation device and method thereof | |
CN115310037A (en) | Matrix multiplication computing unit, acceleration unit, computing system and related method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |