CN114675806B - Pulsation matrix unit and pulsation matrix calculation device - Google Patents

Pulsation matrix unit and pulsation matrix calculation device Download PDF

Info

Publication number
CN114675806B
CN114675806B CN202210595479.2A CN202210595479A CN114675806B CN 114675806 B CN114675806 B CN 114675806B CN 202210595479 A CN202210595479 A CN 202210595479A CN 114675806 B CN114675806 B CN 114675806B
Authority
CN
China
Prior art keywords
matrix
input data
accumulator
register
systolic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210595479.2A
Other languages
Chinese (zh)
Other versions
CN114675806A (en
Inventor
乔树山
张默寒
尚德龙
周玉梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Nanjing Intelligent Technology Research Institute
Original Assignee
Zhongke Nanjing Intelligent Technology Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Nanjing Intelligent Technology Research Institute filed Critical Zhongke Nanjing Intelligent Technology Research Institute
Priority to CN202210595479.2A priority Critical patent/CN114675806B/en
Publication of CN114675806A publication Critical patent/CN114675806A/en
Application granted granted Critical
Publication of CN114675806B publication Critical patent/CN114675806B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Abstract

The invention relates to a pulse matrix unit and a pulse matrix calculation device, and belongs to the field of artificial intelligence. The multiplier in the unit is connected with the weight register and the data register; a multiplier multiplies the weight by the input data; the accumulator is connected with the multiplier and the alternative selector; the accumulator accumulates the multiplied result and the accumulated result of the previous clock period and sends the output value of the accumulator to the alternative selector; the alternative selector is connected with the part and the register; the alternative selector outputs the output value of the accumulator according to the first control signal before the input of the input data is finished; the alternative selector stops the output of the output value of the accumulator according to the second control signal after the input of the input data is finished; the section and register store the output value of the accumulator before the input of the input data is completed, and output the output value of the accumulator after the input of the input data is completed. The invention can save time and hardware cost at the same time.

Description

Pulsation matrix unit and pulsation matrix calculation device
Technical Field
The invention relates to the field of artificial intelligence, in particular to a pulse matrix unit and a pulse matrix calculation device.
Background
With the rise of artificial intelligence, deep learning is increasingly applied in various fields. In deep learning, the operation usage rate related to matrix multiplication is extremely high. At present, a relatively convenient and fast matrix operation mode is a pulse array. The basic idea of the systolic array is as follows: in a matrix multiplication operation of a × B = Y, a matrix B is fixed, a is made to flow in a ripple matrix unit, and Y is continuously output; or a and B are made to flow in the systolic matrix cell and the result Y is stored in the systolic matrix cell.
One calculation method of the systolic array is: in the process of matrix multiplication of the systolic array, input data are transmitted from left to right in a systolic array unit, weight data are transmitted from top to bottom in the systolic array unit, and a final calculation result is stored in each systolic array unit. When the matrix is large, if the extraction of the final calculation result is transmitted by a bus, although the time consumption is short, the hardware cost of the bus is large; if the calculation results are moved and output in the array one by one like the weight data and the input data after the calculation is finished, much time is consumed.
Disclosure of Invention
The invention aims to provide a ripple matrix unit and a ripple matrix calculation device, which can save time and hardware cost at the same time.
In order to achieve the purpose, the invention provides the following scheme:
a systolic matrix cell comprising: the system comprises a weight register, a data register, a multiplier, an accumulator, an alternative selector and a partial sum register;
the weight register is used for storing weights;
the data register is used for storing input data;
the multiplier is respectively connected with the weight register and the data register; the multiplier is used for multiplying the weight and the input data;
the accumulator is respectively connected with the multiplier and the alternative selector; the accumulator is used for accumulating the multiplied result and the accumulated result of the previous clock period and sending the output value of the accumulator to the alternative selector;
the alternative selector is connected with the part and the register; the alternative selector is used for outputting the output value of the accumulator according to a first control signal before input of input data is finished; the alternative selector is also used for stopping the output of the output value of the accumulator according to a second control signal after the input data is input;
the section and the register are used for storing an output value of the accumulator before input of the input data is completed and outputting the output value of the accumulator after the input of the input data is completed.
Optionally, the weight is a 3 x 3 matrix.
Optionally, the input data is a 3 x 3 matrix.
A systolic matrix computing device for implementing said systolic matrix unit, comprising: the system comprises an array controller, a weight storage unit, an output data storage unit, an input data storage unit, a systolic array and a plurality of delayers; the systolic array comprises a plurality of systolic matrix cells;
the weight storage unit is respectively connected with the array controller and the pulse array;
the array controller is respectively connected with the output data storage unit, the input data storage unit and the pulse array;
the systolic array is respectively connected with the input data storage unit and the output data storage unit;
the delayer is arranged between two adjacent ripple matrix units;
the input data storage unit generates a sending signal according to the completion condition of the input data of each pulse matrix unit;
the array controller is used for generating a first control signal and a second control signal according to the sending signal; and the array controller is also used for controlling the output data storage unit to receive the output value of the accumulator of the corresponding pulse matrix unit when the input data of the pulse matrix unit is completely transmitted.
Optionally, the systolic array is a 3 x 3 matrix.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
according to the pulse matrix unit and the pulse matrix calculation device provided by the invention, through the alternative selector, the part and the register, the data in the pulse array unit which is calculated first is taken out when the calculation is not completely finished, the time is saved compared with the case that the data in the pulse array unit which is calculated first is taken out one by one after the calculation is completely finished, and the hardware cost is saved compared with the case that the data in the pulse array unit is taken out through a bus.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic diagram of a systolic matrix unit structure provided in the present invention;
FIG. 2 is a schematic structural diagram of a systolic matrix computing device according to the present invention;
FIG. 3 is a diagram illustrating the transmission of output results between systolic array elements in the same column;
fig. 4-14 are schematic operation flow diagrams of a systolic matrix calculation apparatus according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a ripple matrix unit and a ripple matrix calculation device, which can save time and hardware cost at the same time.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic structural diagram of a systolic matrix unit provided by the present invention, and as shown in fig. 1, the systolic matrix unit provided by the present invention includes: the system comprises a weight register, a data register, a multiplier, an accumulator, an alternative selector and a partial sum register;
the weight register is used for storing weights;
the data register is used for storing input data;
the multiplier is respectively connected with the weight register and the data register; the multiplier is used for multiplying the weight and the input data;
the accumulator is respectively connected with the multiplier and the alternative selector; the accumulator is used for accumulating the multiplied result and the accumulated result of the previous clock period and sending the output value of the accumulator to the alternative selector;
the alternative selector is connected with the part and the register; the alternative selector is used for outputting the output value of the accumulator according to a first control signal before input of input data is finished; the alternative selector is also used for stopping the output of the output value of the accumulator according to a second control signal after the input data is input;
the section and the register are used for storing an output value of the accumulator before input of the input data is completed and outputting the output value of the accumulator after the input of the input data is completed.
The first control signal and the second control signal are high-level signals or low-level signals;
as shown in fig. 1, the weight data is transmitted from top to bottom, the input data is transmitted from left to right, the weight and the input data are multiplied and added to the sum of the previous clock cycle by the accumulator, and these features are the same as those of the conventional systolic array unit, and the difference lies in the structure on the right side. Before all the input data of the first row are input, the array control unit enables the alternative selector to select the output value of the accumulator in the same unit to be output through a first control signal, and the result is stored in a partial sum register.
As a specific example, the weight is a 3 x 3 matrix.
As a specific example, the input data is a 3 x 3 matrix.
As shown in fig. 2, a systolic matrix calculation apparatus provided by the present invention is configured to implement the above-mentioned systolic matrix unit, and includes: the system comprises an array controller, a weight storage unit, an output data storage unit, an input data storage unit, a systolic array and a plurality of delayers; the systolic array comprises a plurality of systolic matrix cells;
the weight storage unit is respectively connected with the array controller and the pulse array;
the array controller is respectively connected with the output data storage unit, the input data storage unit and the pulse array;
the systolic array is respectively connected with the input data storage unit and the output data storage unit;
the delayer is arranged between two adjacent ripple matrix units; the result of the calculation is one cycle later for each unit to the right or down, so the alternative selector should also transmit the accumulator value up one cycle later.
The input data storage unit generates a sending signal according to the completion condition of the input data of each pulse matrix unit;
the array controller is used for generating a first control signal and a second control signal according to the sending signal; and the array controller is also used for controlling the output data storage unit to receive the output value of the accumulator of the corresponding pulse matrix unit when the input data of the pulse matrix unit is completely transmitted.
As a specific example, the pulsation array is a 3 x 3 matrix.
As shown in fig. 2, the part of the systolic array element and the register of the first row are connected to the output memory cell, but the output memory cell does not receive the value of the part of the first row and the register until all the input data of the first row are input. The registers and portions of the systolic array elements of the other rows are followed by the one-out-of-two selector of the previous row. When the input data of the first row are all input, the control unit receives the signal from the input storage unit, and then the control unit changes the control signal of the alternative selector to select the partial sum of the next systolic array unit. At the same time, the control signal controls the output memory unit to start receiving the portion of the first column and the register value, and so on.
As shown in FIG. 3, taking three cells in the same column as an example, A, B, and C respectively represent the final calculation results of the first, second, and third systolic array cells, B-1 represents that the second systolic array cell is still different by one calculation cycle, C-1 represents that the third systolic array cell is still different by one calculation cycle, C-2 represents that the third systolic array cell is still different by two calculation cycles, and time t1 is taken as the time when the final calculation result A of the first systolic array cell enters the first row part and the register.
As shown in FIG. 3, at time t1, A, B-1, C-2 enter the first, second, third row portions and registers, respectively. At time t2, the control unit controls the output storage unit to receive the A value in the register and the portion of the first row, while the values of the register and the portion of the second row and the third row are changed to B and C-1. At time t3, the control unit controls the select signal of the alternative selector to toggle so that B enters the portion and register of the first row and the value of the portion and register of the third row is changed to C. At time t4, B enters the output register and the select signal passes through a delay to change the select signal of the second row of the two-out selector and C enters the portion of the second row and the register. At time t5, C enters the portion and register of the first row. At time t6, C enters the output register. Note that since in this approach a, B, C enter the output register at times t2, t4, t6, respectively, i select the output register to have twice the clock period of the systolic array clock period.
FIGS. 4-14 are schematic diagrams of the operation flow, inputting data
Figure 591793DEST_PATH_IMAGE001
And weight
Figure 164726DEST_PATH_IMAGE002
The eleven diagrams in fig. 4 to 14 show the operation flow of the method by taking two 3 × 3 matrix multiplications as an example, and as shown in fig. 4 to 14, the two 3 × 3 matrices are multiplied, all results can be calculated and stored in the output register for subsequent calculation after 11 calculation cycles, which greatly saves the calculation time.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (5)

1. A systolic matrix cell, comprising: the system comprises a weight register, a data register, a multiplier, an accumulator, an alternative selector and a partial sum register;
the weight register is used for storing weights;
the data register is used for storing input data;
the multiplier is respectively connected with the weight register and the data register; the multiplier is used for multiplying the weight and the input data;
the accumulator is respectively connected with the multiplier and the alternative selector; the accumulator is used for accumulating the multiplied result and the accumulated result of the previous clock period and sending the output value of the accumulator to the alternative selector;
the alternative selector is connected with the part and the register; the alternative selector is used for outputting the output value of the accumulator according to a first control signal before input of input data is completed; the alternative selector is also used for stopping the output of the output value of the accumulator according to a second control signal after the input data is input;
the part and the register are used for storing the output value of the accumulator before the input of the input data is finished and outputting the output value of the accumulator after the input of the input data is finished;
the delay device is arranged between two adjacent ripple matrix units, the calculation result is delayed by one cycle every time the ripple matrix unit is moved to the right or to the next ripple matrix unit, the alternative selector is correspondingly delayed by one cycle, and then the numerical value of the accumulator is transmitted upwards.
2. The systolic matrix unit of claim 1, where the weights are a 3 x 3 matrix.
3. The systolic matrix unit of claim 1, where the input data is a 3 x 3 matrix.
4. A systolic matrix calculation device for implementing a systolic matrix unit as claimed in any one of claims 1-3, characterized in that it includes: an array controller, a weight storage unit, an output data storage unit, an input data storage unit, a systolic array and a plurality of delays; the systolic array comprises a plurality of systolic matrix cells;
the weight storage unit is respectively connected with the array controller and the pulse array;
the array controller is respectively connected with the output data storage unit, the input data storage unit and the pulse array;
the systolic array is respectively connected with the input data storage unit and the output data storage unit;
the delayer is arranged between two adjacent ripple matrix units;
the input data storage unit generates a sending signal according to the completion condition of the input data of each pulse matrix unit;
the array controller is used for generating a first control signal and a second control signal according to the sending signal; and the array controller is also used for controlling the output data storage unit to receive the output value of the accumulator of the corresponding pulse matrix unit when the input data of the pulse matrix unit is completely transmitted.
5. The systolic matrix calculation device of claim 4, where the systolic matrix is a 3 x 3 matrix.
CN202210595479.2A 2022-05-30 2022-05-30 Pulsation matrix unit and pulsation matrix calculation device Active CN114675806B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210595479.2A CN114675806B (en) 2022-05-30 2022-05-30 Pulsation matrix unit and pulsation matrix calculation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210595479.2A CN114675806B (en) 2022-05-30 2022-05-30 Pulsation matrix unit and pulsation matrix calculation device

Publications (2)

Publication Number Publication Date
CN114675806A CN114675806A (en) 2022-06-28
CN114675806B true CN114675806B (en) 2022-09-23

Family

ID=82080063

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210595479.2A Active CN114675806B (en) 2022-05-30 2022-05-30 Pulsation matrix unit and pulsation matrix calculation device

Country Status (1)

Country Link
CN (1) CN114675806B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107578098B (en) * 2017-09-01 2020-10-30 中国科学院计算技术研究所 Neural network processor based on systolic array
US11188814B2 (en) * 2018-04-05 2021-11-30 Arm Limited Systolic convolutional neural network
CN111291323B (en) * 2020-02-17 2023-12-12 南京大学 Matrix multiplication processor based on systolic array and data processing method thereof
KR20220015813A (en) * 2020-07-31 2022-02-08 삼성전자주식회사 Method and apparatus for performing deep learning operations.

Also Published As

Publication number Publication date
CN114675806A (en) 2022-06-28

Similar Documents

Publication Publication Date Title
CN112711394B (en) Circuit based on digital domain memory computing
CN111291323B (en) Matrix multiplication processor based on systolic array and data processing method thereof
US5333119A (en) Digital signal processor with delayed-evaluation array multipliers and low-power memory addressing
CN113419705A (en) Memory multiply-add calculation circuit, chip and calculation device
US20210256360A1 (en) Calculation circuit and deep learning system including the same
CN113807509B (en) Neural network acceleration device, method and communication equipment
CN113869498A (en) Convolution operation circuit and operation method thereof
US5297069A (en) Finite impulse response filter
US11556614B2 (en) Apparatus and method for convolution operation
CN114675806B (en) Pulsation matrix unit and pulsation matrix calculation device
US5422836A (en) Circuit arrangement for calculating matrix operations in signal processing
CN111581595A (en) Matrix multiplication calculation method and calculation circuit
CN113885831A (en) Storage and calculation integrated circuit based on mixed data input, chip and calculation device
CN212112470U (en) Matrix multiplication circuit
US20230253032A1 (en) In-memory computation device and in-memory computation method to perform multiplication operation in memory cell array according to bit orders
CN107368459B (en) Scheduling method of reconfigurable computing structure based on arbitrary dimension matrix multiplication
CN116702851A (en) Pulsation array unit and pulsation array structure suitable for weight multiplexing neural network
CN110673824B (en) Matrix vector multiplication circuit and circular neural network hardware accelerator
CN115495152A (en) Memory computing circuit with variable length input
US5163018A (en) Digital signal processing circuit for carrying out a convolution computation using circulating coefficients
CN113346895B (en) Simulation and storage integrated structure based on pulse cut-off circuit
CN101840322B (en) The arithmetic system of the method that filter arithmetic element is multiplexing and wave filter
CN110751263B (en) High-parallelism convolution operation access method and circuit
CN103293373A (en) Electric energy metering device and electric energy metering chip thereof
CN114911453B (en) Multi-bit multiply-accumulate full-digital memory computing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant