CN114675806B - Pulsation matrix unit and pulsation matrix calculation device - Google Patents
Pulsation matrix unit and pulsation matrix calculation device Download PDFInfo
- Publication number
- CN114675806B CN114675806B CN202210595479.2A CN202210595479A CN114675806B CN 114675806 B CN114675806 B CN 114675806B CN 202210595479 A CN202210595479 A CN 202210595479A CN 114675806 B CN114675806 B CN 114675806B
- Authority
- CN
- China
- Prior art keywords
- matrix
- input data
- accumulator
- register
- systolic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
Abstract
The invention relates to a pulse matrix unit and a pulse matrix calculation device, and belongs to the field of artificial intelligence. The multiplier in the unit is connected with the weight register and the data register; a multiplier multiplies the weight by the input data; the accumulator is connected with the multiplier and the alternative selector; the accumulator accumulates the multiplied result and the accumulated result of the previous clock period and sends the output value of the accumulator to the alternative selector; the alternative selector is connected with the part and the register; the alternative selector outputs the output value of the accumulator according to the first control signal before the input of the input data is finished; the alternative selector stops the output of the output value of the accumulator according to the second control signal after the input of the input data is finished; the section and register store the output value of the accumulator before the input of the input data is completed, and output the output value of the accumulator after the input of the input data is completed. The invention can save time and hardware cost at the same time.
Description
Technical Field
The invention relates to the field of artificial intelligence, in particular to a pulse matrix unit and a pulse matrix calculation device.
Background
With the rise of artificial intelligence, deep learning is increasingly applied in various fields. In deep learning, the operation usage rate related to matrix multiplication is extremely high. At present, a relatively convenient and fast matrix operation mode is a pulse array. The basic idea of the systolic array is as follows: in a matrix multiplication operation of a × B = Y, a matrix B is fixed, a is made to flow in a ripple matrix unit, and Y is continuously output; or a and B are made to flow in the systolic matrix cell and the result Y is stored in the systolic matrix cell.
One calculation method of the systolic array is: in the process of matrix multiplication of the systolic array, input data are transmitted from left to right in a systolic array unit, weight data are transmitted from top to bottom in the systolic array unit, and a final calculation result is stored in each systolic array unit. When the matrix is large, if the extraction of the final calculation result is transmitted by a bus, although the time consumption is short, the hardware cost of the bus is large; if the calculation results are moved and output in the array one by one like the weight data and the input data after the calculation is finished, much time is consumed.
Disclosure of Invention
The invention aims to provide a ripple matrix unit and a ripple matrix calculation device, which can save time and hardware cost at the same time.
In order to achieve the purpose, the invention provides the following scheme:
a systolic matrix cell comprising: the system comprises a weight register, a data register, a multiplier, an accumulator, an alternative selector and a partial sum register;
the weight register is used for storing weights;
the data register is used for storing input data;
the multiplier is respectively connected with the weight register and the data register; the multiplier is used for multiplying the weight and the input data;
the accumulator is respectively connected with the multiplier and the alternative selector; the accumulator is used for accumulating the multiplied result and the accumulated result of the previous clock period and sending the output value of the accumulator to the alternative selector;
the alternative selector is connected with the part and the register; the alternative selector is used for outputting the output value of the accumulator according to a first control signal before input of input data is finished; the alternative selector is also used for stopping the output of the output value of the accumulator according to a second control signal after the input data is input;
the section and the register are used for storing an output value of the accumulator before input of the input data is completed and outputting the output value of the accumulator after the input of the input data is completed.
Optionally, the weight is a 3 x 3 matrix.
Optionally, the input data is a 3 x 3 matrix.
A systolic matrix computing device for implementing said systolic matrix unit, comprising: the system comprises an array controller, a weight storage unit, an output data storage unit, an input data storage unit, a systolic array and a plurality of delayers; the systolic array comprises a plurality of systolic matrix cells;
the weight storage unit is respectively connected with the array controller and the pulse array;
the array controller is respectively connected with the output data storage unit, the input data storage unit and the pulse array;
the systolic array is respectively connected with the input data storage unit and the output data storage unit;
the delayer is arranged between two adjacent ripple matrix units;
the input data storage unit generates a sending signal according to the completion condition of the input data of each pulse matrix unit;
the array controller is used for generating a first control signal and a second control signal according to the sending signal; and the array controller is also used for controlling the output data storage unit to receive the output value of the accumulator of the corresponding pulse matrix unit when the input data of the pulse matrix unit is completely transmitted.
Optionally, the systolic array is a 3 x 3 matrix.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
according to the pulse matrix unit and the pulse matrix calculation device provided by the invention, through the alternative selector, the part and the register, the data in the pulse array unit which is calculated first is taken out when the calculation is not completely finished, the time is saved compared with the case that the data in the pulse array unit which is calculated first is taken out one by one after the calculation is completely finished, and the hardware cost is saved compared with the case that the data in the pulse array unit is taken out through a bus.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic diagram of a systolic matrix unit structure provided in the present invention;
FIG. 2 is a schematic structural diagram of a systolic matrix computing device according to the present invention;
FIG. 3 is a diagram illustrating the transmission of output results between systolic array elements in the same column;
fig. 4-14 are schematic operation flow diagrams of a systolic matrix calculation apparatus according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a ripple matrix unit and a ripple matrix calculation device, which can save time and hardware cost at the same time.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic structural diagram of a systolic matrix unit provided by the present invention, and as shown in fig. 1, the systolic matrix unit provided by the present invention includes: the system comprises a weight register, a data register, a multiplier, an accumulator, an alternative selector and a partial sum register;
the weight register is used for storing weights;
the data register is used for storing input data;
the multiplier is respectively connected with the weight register and the data register; the multiplier is used for multiplying the weight and the input data;
the accumulator is respectively connected with the multiplier and the alternative selector; the accumulator is used for accumulating the multiplied result and the accumulated result of the previous clock period and sending the output value of the accumulator to the alternative selector;
the alternative selector is connected with the part and the register; the alternative selector is used for outputting the output value of the accumulator according to a first control signal before input of input data is finished; the alternative selector is also used for stopping the output of the output value of the accumulator according to a second control signal after the input data is input;
the section and the register are used for storing an output value of the accumulator before input of the input data is completed and outputting the output value of the accumulator after the input of the input data is completed.
The first control signal and the second control signal are high-level signals or low-level signals;
as shown in fig. 1, the weight data is transmitted from top to bottom, the input data is transmitted from left to right, the weight and the input data are multiplied and added to the sum of the previous clock cycle by the accumulator, and these features are the same as those of the conventional systolic array unit, and the difference lies in the structure on the right side. Before all the input data of the first row are input, the array control unit enables the alternative selector to select the output value of the accumulator in the same unit to be output through a first control signal, and the result is stored in a partial sum register.
As a specific example, the weight is a 3 x 3 matrix.
As a specific example, the input data is a 3 x 3 matrix.
As shown in fig. 2, a systolic matrix calculation apparatus provided by the present invention is configured to implement the above-mentioned systolic matrix unit, and includes: the system comprises an array controller, a weight storage unit, an output data storage unit, an input data storage unit, a systolic array and a plurality of delayers; the systolic array comprises a plurality of systolic matrix cells;
the weight storage unit is respectively connected with the array controller and the pulse array;
the array controller is respectively connected with the output data storage unit, the input data storage unit and the pulse array;
the systolic array is respectively connected with the input data storage unit and the output data storage unit;
the delayer is arranged between two adjacent ripple matrix units; the result of the calculation is one cycle later for each unit to the right or down, so the alternative selector should also transmit the accumulator value up one cycle later.
The input data storage unit generates a sending signal according to the completion condition of the input data of each pulse matrix unit;
the array controller is used for generating a first control signal and a second control signal according to the sending signal; and the array controller is also used for controlling the output data storage unit to receive the output value of the accumulator of the corresponding pulse matrix unit when the input data of the pulse matrix unit is completely transmitted.
As a specific example, the pulsation array is a 3 x 3 matrix.
As shown in fig. 2, the part of the systolic array element and the register of the first row are connected to the output memory cell, but the output memory cell does not receive the value of the part of the first row and the register until all the input data of the first row are input. The registers and portions of the systolic array elements of the other rows are followed by the one-out-of-two selector of the previous row. When the input data of the first row are all input, the control unit receives the signal from the input storage unit, and then the control unit changes the control signal of the alternative selector to select the partial sum of the next systolic array unit. At the same time, the control signal controls the output memory unit to start receiving the portion of the first column and the register value, and so on.
As shown in FIG. 3, taking three cells in the same column as an example, A, B, and C respectively represent the final calculation results of the first, second, and third systolic array cells, B-1 represents that the second systolic array cell is still different by one calculation cycle, C-1 represents that the third systolic array cell is still different by one calculation cycle, C-2 represents that the third systolic array cell is still different by two calculation cycles, and time t1 is taken as the time when the final calculation result A of the first systolic array cell enters the first row part and the register.
As shown in FIG. 3, at time t1, A, B-1, C-2 enter the first, second, third row portions and registers, respectively. At time t2, the control unit controls the output storage unit to receive the A value in the register and the portion of the first row, while the values of the register and the portion of the second row and the third row are changed to B and C-1. At time t3, the control unit controls the select signal of the alternative selector to toggle so that B enters the portion and register of the first row and the value of the portion and register of the third row is changed to C. At time t4, B enters the output register and the select signal passes through a delay to change the select signal of the second row of the two-out selector and C enters the portion of the second row and the register. At time t5, C enters the portion and register of the first row. At time t6, C enters the output register. Note that since in this approach a, B, C enter the output register at times t2, t4, t6, respectively, i select the output register to have twice the clock period of the systolic array clock period.
FIGS. 4-14 are schematic diagrams of the operation flow, inputting dataAnd weightThe eleven diagrams in fig. 4 to 14 show the operation flow of the method by taking two 3 × 3 matrix multiplications as an example, and as shown in fig. 4 to 14, the two 3 × 3 matrices are multiplied, all results can be calculated and stored in the output register for subsequent calculation after 11 calculation cycles, which greatly saves the calculation time.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.
Claims (5)
1. A systolic matrix cell, comprising: the system comprises a weight register, a data register, a multiplier, an accumulator, an alternative selector and a partial sum register;
the weight register is used for storing weights;
the data register is used for storing input data;
the multiplier is respectively connected with the weight register and the data register; the multiplier is used for multiplying the weight and the input data;
the accumulator is respectively connected with the multiplier and the alternative selector; the accumulator is used for accumulating the multiplied result and the accumulated result of the previous clock period and sending the output value of the accumulator to the alternative selector;
the alternative selector is connected with the part and the register; the alternative selector is used for outputting the output value of the accumulator according to a first control signal before input of input data is completed; the alternative selector is also used for stopping the output of the output value of the accumulator according to a second control signal after the input data is input;
the part and the register are used for storing the output value of the accumulator before the input of the input data is finished and outputting the output value of the accumulator after the input of the input data is finished;
the delay device is arranged between two adjacent ripple matrix units, the calculation result is delayed by one cycle every time the ripple matrix unit is moved to the right or to the next ripple matrix unit, the alternative selector is correspondingly delayed by one cycle, and then the numerical value of the accumulator is transmitted upwards.
2. The systolic matrix unit of claim 1, where the weights are a 3 x 3 matrix.
3. The systolic matrix unit of claim 1, where the input data is a 3 x 3 matrix.
4. A systolic matrix calculation device for implementing a systolic matrix unit as claimed in any one of claims 1-3, characterized in that it includes: an array controller, a weight storage unit, an output data storage unit, an input data storage unit, a systolic array and a plurality of delays; the systolic array comprises a plurality of systolic matrix cells;
the weight storage unit is respectively connected with the array controller and the pulse array;
the array controller is respectively connected with the output data storage unit, the input data storage unit and the pulse array;
the systolic array is respectively connected with the input data storage unit and the output data storage unit;
the delayer is arranged between two adjacent ripple matrix units;
the input data storage unit generates a sending signal according to the completion condition of the input data of each pulse matrix unit;
the array controller is used for generating a first control signal and a second control signal according to the sending signal; and the array controller is also used for controlling the output data storage unit to receive the output value of the accumulator of the corresponding pulse matrix unit when the input data of the pulse matrix unit is completely transmitted.
5. The systolic matrix calculation device of claim 4, where the systolic matrix is a 3 x 3 matrix.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210595479.2A CN114675806B (en) | 2022-05-30 | 2022-05-30 | Pulsation matrix unit and pulsation matrix calculation device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210595479.2A CN114675806B (en) | 2022-05-30 | 2022-05-30 | Pulsation matrix unit and pulsation matrix calculation device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114675806A CN114675806A (en) | 2022-06-28 |
CN114675806B true CN114675806B (en) | 2022-09-23 |
Family
ID=82080063
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210595479.2A Active CN114675806B (en) | 2022-05-30 | 2022-05-30 | Pulsation matrix unit and pulsation matrix calculation device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114675806B (en) |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107578098B (en) * | 2017-09-01 | 2020-10-30 | 中国科学院计算技术研究所 | Neural network processor based on systolic array |
US11188814B2 (en) * | 2018-04-05 | 2021-11-30 | Arm Limited | Systolic convolutional neural network |
CN111291323B (en) * | 2020-02-17 | 2023-12-12 | 南京大学 | Matrix multiplication processor based on systolic array and data processing method thereof |
KR20220015813A (en) * | 2020-07-31 | 2022-02-08 | 삼성전자주식회사 | Method and apparatus for performing deep learning operations. |
-
2022
- 2022-05-30 CN CN202210595479.2A patent/CN114675806B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN114675806A (en) | 2022-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112711394B (en) | Circuit based on digital domain memory computing | |
CN111291323B (en) | Matrix multiplication processor based on systolic array and data processing method thereof | |
US5333119A (en) | Digital signal processor with delayed-evaluation array multipliers and low-power memory addressing | |
CN113419705A (en) | Memory multiply-add calculation circuit, chip and calculation device | |
US20210256360A1 (en) | Calculation circuit and deep learning system including the same | |
CN113807509B (en) | Neural network acceleration device, method and communication equipment | |
CN113869498A (en) | Convolution operation circuit and operation method thereof | |
US5297069A (en) | Finite impulse response filter | |
US11556614B2 (en) | Apparatus and method for convolution operation | |
CN114675806B (en) | Pulsation matrix unit and pulsation matrix calculation device | |
US5422836A (en) | Circuit arrangement for calculating matrix operations in signal processing | |
CN111581595A (en) | Matrix multiplication calculation method and calculation circuit | |
CN113885831A (en) | Storage and calculation integrated circuit based on mixed data input, chip and calculation device | |
CN212112470U (en) | Matrix multiplication circuit | |
US20230253032A1 (en) | In-memory computation device and in-memory computation method to perform multiplication operation in memory cell array according to bit orders | |
CN107368459B (en) | Scheduling method of reconfigurable computing structure based on arbitrary dimension matrix multiplication | |
CN116702851A (en) | Pulsation array unit and pulsation array structure suitable for weight multiplexing neural network | |
CN110673824B (en) | Matrix vector multiplication circuit and circular neural network hardware accelerator | |
CN115495152A (en) | Memory computing circuit with variable length input | |
US5163018A (en) | Digital signal processing circuit for carrying out a convolution computation using circulating coefficients | |
CN113346895B (en) | Simulation and storage integrated structure based on pulse cut-off circuit | |
CN101840322B (en) | The arithmetic system of the method that filter arithmetic element is multiplexing and wave filter | |
CN110751263B (en) | High-parallelism convolution operation access method and circuit | |
CN103293373A (en) | Electric energy metering device and electric energy metering chip thereof | |
CN114911453B (en) | Multi-bit multiply-accumulate full-digital memory computing device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |