CN114675806B

CN114675806B - Pulsation matrix unit and pulsation matrix calculation device

Info

Publication number: CN114675806B
Application number: CN202210595479.2A
Authority: CN
Inventors: 乔树山; 张默寒; 尚德龙; 周玉梅
Original assignee: Zhongke Nanjing Intelligent Technology Research Institute
Current assignee: Zhongke Nanjing Intelligent Technology Research Institute
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-09-23
Anticipated expiration: 2042-05-30
Also published as: CN114675806A

Abstract

The invention relates to a pulse matrix unit and a pulse matrix calculation device, and belongs to the field of artificial intelligence. The multiplier in the unit is connected with the weight register and the data register; a multiplier multiplies the weight by the input data; the accumulator is connected with the multiplier and the alternative selector; the accumulator accumulates the multiplied result and the accumulated result of the previous clock period and sends the output value of the accumulator to the alternative selector; the alternative selector is connected with the part and the register; the alternative selector outputs the output value of the accumulator according to the first control signal before the input of the input data is finished; the alternative selector stops the output of the output value of the accumulator according to the second control signal after the input of the input data is finished; the section and register store the output value of the accumulator before the input of the input data is completed, and output the output value of the accumulator after the input of the input data is completed. The invention can save time and hardware cost at the same time.

Description

Pulsation matrix unit and pulsation matrix calculation device

Technical Field

The invention relates to the field of artificial intelligence, in particular to a pulse matrix unit and a pulse matrix calculation device.

Background

With the rise of artificial intelligence, deep learning is increasingly applied in various fields. In deep learning, the operation usage rate related to matrix multiplication is extremely high. At present, a relatively convenient and fast matrix operation mode is a pulse array. The basic idea of the systolic array is as follows: in a matrix multiplication operation of a × B = Y, a matrix B is fixed, a is made to flow in a ripple matrix unit, and Y is continuously output; or a and B are made to flow in the systolic matrix cell and the result Y is stored in the systolic matrix cell.

One calculation method of the systolic array is: in the process of matrix multiplication of the systolic array, input data are transmitted from left to right in a systolic array unit, weight data are transmitted from top to bottom in the systolic array unit, and a final calculation result is stored in each systolic array unit. When the matrix is large, if the extraction of the final calculation result is transmitted by a bus, although the time consumption is short, the hardware cost of the bus is large; if the calculation results are moved and output in the array one by one like the weight data and the input data after the calculation is finished, much time is consumed.

Disclosure of Invention

The invention aims to provide a ripple matrix unit and a ripple matrix calculation device, which can save time and hardware cost at the same time.

In order to achieve the purpose, the invention provides the following scheme:

a systolic matrix cell comprising: the system comprises a weight register, a data register, a multiplier, an accumulator, an alternative selector and a partial sum register;

the weight register is used for storing weights;

the data register is used for storing input data;

the multiplier is respectively connected with the weight register and the data register; the multiplier is used for multiplying the weight and the input data;

the accumulator is respectively connected with the multiplier and the alternative selector; the accumulator is used for accumulating the multiplied result and the accumulated result of the previous clock period and sending the output value of the accumulator to the alternative selector;

the alternative selector is connected with the part and the register; the alternative selector is used for outputting the output value of the accumulator according to a first control signal before input of input data is finished; the alternative selector is also used for stopping the output of the output value of the accumulator according to a second control signal after the input data is input;

the section and the register are used for storing an output value of the accumulator before input of the input data is completed and outputting the output value of the accumulator after the input of the input data is completed.

Optionally, the weight is a 3 x 3 matrix.

Optionally, the input data is a 3 x 3 matrix.

A systolic matrix computing device for implementing said systolic matrix unit, comprising: the system comprises an array controller, a weight storage unit, an output data storage unit, an input data storage unit, a systolic array and a plurality of delayers; the systolic array comprises a plurality of systolic matrix cells;

the weight storage unit is respectively connected with the array controller and the pulse array;

the array controller is respectively connected with the output data storage unit, the input data storage unit and the pulse array;

the systolic array is respectively connected with the input data storage unit and the output data storage unit;

the delayer is arranged between two adjacent ripple matrix units;

the input data storage unit generates a sending signal according to the completion condition of the input data of each pulse matrix unit;

the array controller is used for generating a first control signal and a second control signal according to the sending signal; and the array controller is also used for controlling the output data storage unit to receive the output value of the accumulator of the corresponding pulse matrix unit when the input data of the pulse matrix unit is completely transmitted.

Optionally, the systolic array is a 3 x 3 matrix.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the pulse matrix unit and the pulse matrix calculation device provided by the invention, through the alternative selector, the part and the register, the data in the pulse array unit which is calculated first is taken out when the calculation is not completely finished, the time is saved compared with the case that the data in the pulse array unit which is calculated first is taken out one by one after the calculation is completely finished, and the hardware cost is saved compared with the case that the data in the pulse array unit is taken out through a bus.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic diagram of a systolic matrix unit structure provided in the present invention;

FIG. 2 is a schematic structural diagram of a systolic matrix computing device according to the present invention;

FIG. 3 is a diagram illustrating the transmission of output results between systolic array elements in the same column;

fig. 4-14 are schematic operation flow diagrams of a systolic matrix calculation apparatus according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic structural diagram of a systolic matrix unit provided by the present invention, and as shown in fig. 1, the systolic matrix unit provided by the present invention includes: the system comprises a weight register, a data register, a multiplier, an accumulator, an alternative selector and a partial sum register;

the weight register is used for storing weights;

the data register is used for storing input data;

The first control signal and the second control signal are high-level signals or low-level signals;

as shown in fig. 1, the weight data is transmitted from top to bottom, the input data is transmitted from left to right, the weight and the input data are multiplied and added to the sum of the previous clock cycle by the accumulator, and these features are the same as those of the conventional systolic array unit, and the difference lies in the structure on the right side. Before all the input data of the first row are input, the array control unit enables the alternative selector to select the output value of the accumulator in the same unit to be output through a first control signal, and the result is stored in a partial sum register.

As a specific example, the weight is a 3 x 3 matrix.

As a specific example, the input data is a 3 x 3 matrix.

As shown in fig. 2, a systolic matrix calculation apparatus provided by the present invention is configured to implement the above-mentioned systolic matrix unit, and includes: the system comprises an array controller, a weight storage unit, an output data storage unit, an input data storage unit, a systolic array and a plurality of delayers; the systolic array comprises a plurality of systolic matrix cells;

the delayer is arranged between two adjacent ripple matrix units; the result of the calculation is one cycle later for each unit to the right or down, so the alternative selector should also transmit the accumulator value up one cycle later.

As a specific example, the pulsation array is a 3 x 3 matrix.

As shown in fig. 2, the part of the systolic array element and the register of the first row are connected to the output memory cell, but the output memory cell does not receive the value of the part of the first row and the register until all the input data of the first row are input. The registers and portions of the systolic array elements of the other rows are followed by the one-out-of-two selector of the previous row. When the input data of the first row are all input, the control unit receives the signal from the input storage unit, and then the control unit changes the control signal of the alternative selector to select the partial sum of the next systolic array unit. At the same time, the control signal controls the output memory unit to start receiving the portion of the first column and the register value, and so on.

As shown in FIG. 3, taking three cells in the same column as an example, A, B, and C respectively represent the final calculation results of the first, second, and third systolic array cells, B-1 represents that the second systolic array cell is still different by one calculation cycle, C-1 represents that the third systolic array cell is still different by one calculation cycle, C-2 represents that the third systolic array cell is still different by two calculation cycles, and time t1 is taken as the time when the final calculation result A of the first systolic array cell enters the first row part and the register.

As shown in FIG. 3, at time t1, A, B-1, C-2 enter the first, second, third row portions and registers, respectively. At time t2, the control unit controls the output storage unit to receive the A value in the register and the portion of the first row, while the values of the register and the portion of the second row and the third row are changed to B and C-1. At time t3, the control unit controls the select signal of the alternative selector to toggle so that B enters the portion and register of the first row and the value of the portion and register of the third row is changed to C. At time t4, B enters the output register and the select signal passes through a delay to change the select signal of the second row of the two-out selector and C enters the portion of the second row and the register. At time t5, C enters the portion and register of the first row. At time t6, C enters the output register. Note that since in this approach a, B, C enter the output register at times t2, t4, t6, respectively, i select the output register to have twice the clock period of the systolic array clock period.

FIGS. 4-14 are schematic diagrams of the operation flow, inputting data

And weight

The eleven diagrams in fig. 4 to 14 show the operation flow of the method by taking two 3 × 3 matrix multiplications as an example, and as shown in fig. 4 to 14, the two 3 × 3 matrices are multiplied, all results can be calculated and stored in the output register for subsequent calculation after 11 calculation cycles, which greatly saves the calculation time.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A systolic matrix cell, comprising: the system comprises a weight register, a data register, a multiplier, an accumulator, an alternative selector and a partial sum register;

the weight register is used for storing weights;

the data register is used for storing input data;

the alternative selector is connected with the part and the register; the alternative selector is used for outputting the output value of the accumulator according to a first control signal before input of input data is completed; the alternative selector is also used for stopping the output of the output value of the accumulator according to a second control signal after the input data is input;

the part and the register are used for storing the output value of the accumulator before the input of the input data is finished and outputting the output value of the accumulator after the input of the input data is finished;

the delay device is arranged between two adjacent ripple matrix units, the calculation result is delayed by one cycle every time the ripple matrix unit is moved to the right or to the next ripple matrix unit, the alternative selector is correspondingly delayed by one cycle, and then the numerical value of the accumulator is transmitted upwards.

2. The systolic matrix unit of claim 1, where the weights are a 3 x 3 matrix.

3. The systolic matrix unit of claim 1, where the input data is a 3 x 3 matrix.

4. A systolic matrix calculation device for implementing a systolic matrix unit as claimed in any one of claims 1-3, characterized in that it includes: an array controller, a weight storage unit, an output data storage unit, an input data storage unit, a systolic array and a plurality of delays; the systolic array comprises a plurality of systolic matrix cells;

the delayer is arranged between two adjacent ripple matrix units;

5. The systolic matrix calculation device of claim 4, where the systolic matrix is a 3 x 3 matrix.