RU1837321C

RU1837321C - Device for multiplying matrices

Info

Publication number: RU1837321C
Application number: SU904807587A
Authority: RU
Inventors: Лариса Дмитриевна Елфимова; Владимир Викторович Коломейко; Игорь Григорьевич Мороз-Подворчан; Валерий Дисанович Петущак
Original assignee: Институт кибернетики им.В.М.Глушкова
Priority date: 1990-02-19
Filing date: 1990-02-19
Publication date: 1993-08-30

Abstract

Изобретение относитс к вычислительной технике и может быть использовано в системах обработки данных реального масштаба времени дл перемножени матриц в конвейерном режиме. Целью изобретени вл етс упрощение реализации устройства за счет сокращени числа выводных контактов и уменьшени аппаратурных затрат. Поставленна цель достигаетс тем, что в устройстве, содержащем линейную матрицу процессорных блоков и блок управлени , причем каждый процессорный блок содержит узел умножени и выходной регистр, а каждый процессорный блок введен мультиплексор . 1 з.п. ф-лы, 4 ил.The invention relates to computer technology and can be used in real-time data processing systems for multiplying matrices in a pipelined mode. The aim of the invention is to simplify the implementation of the device by reducing the number of output contacts and reducing hardware costs. The goal is achieved in that in a device comprising a linear matrix of processor units and a control unit, each processor unit comprising a multiplication unit and an output register, and a multiplexer introduced by each processor unit. 1 s.p. f-ly, 4 ill.

Description

ww

ЁYo

Изобретение относитс к вычислитель- нфй технике и может быть использовано в si числительных машинах и системах обработки данных в реальном масштабе време- дл перемножени матриц в конвейерном режиме.The invention relates to computing technology and can be used in si computers and real-time data processing systems for multiplying matrices in a conveyor mode.

На фиг. 1 представлена функциональ- схема устройства-прототипа, на фиг. 2 НеIn FIG. 1 shows a functional diagram of a prototype device, in FIG. 2 not

ФF

нкциональна схема узла умножени .Functional diagram of the multiplication node.

Устройство по фиг. 1 содержит п после- дбвательно св занных между собой процессе рных блоков 1.1-1.П, блок управлени 2, представл ющий собой блок микропрограммного управлени , причем каждый процессорный блок содержит узел умножени The device of FIG. 1 contains n sequentially connected process units 1.1-1. P, control unit 2, which is a firmware control unit, each processor unit comprising a multiplication unit

3,3

который выполн ет операцию умножени which performs the multiplication operation

с накоплением m-разр дных элементов исходных матриц А и В, регистр 4 задержки элементов столбцов матрицы В, отведенный дл хранени в каждом такте операндов матрицы В, поступающих из соседних процессорных блоков, выходной регистр 5, предназначенный дл приема получаемого в узле умножени 3 результата операции, блок пам ти 6, представл ющий собой локальное запоминающее устройство 1-го процессорного блока дл хранени элементов 1-й строки исходной матрицы А, где , n - пор док матриц, регистр 7 задержки потока адресов элементов строк матрицы А, который , проход через все процессорные блоки, определ ет дл каждого из них на каждом такте адрес одного из операндов, выбираемых из блоков пам ти 6.1-6.п. Кроме того, в устройство входит адресный блокwith the accumulation of m-bit elements of the original matrices A and B, the delay register 4 of the column elements of the matrix B allocated for storing in each clock cycle the operands of the matrix B coming from neighboring processor blocks, the output register 5, designed to receive the result obtained in the multiplication node 3 operation, a memory unit 6, which is a local storage device of the 1st processor unit for storing the elements of the 1st row of the original matrix A, where, n is the order of the matrices, register 7 of the delay of the stream of addresses of the elements of the rows of the matrix A, which, passing through all the processor blocks, determines for each of them on each clock cycle the address of one of the operands selected from the memory blocks 6.1-6. In addition, the device includes an address block

8.предназначенный дл формировани адресов чеек запоминающих блоков 6, в соответствии с которыми будет производитьс считывание элементов строк матрицы А, записанных в блоках пам ти б, и блок пам ти8. designed to generate the addresses of the cells of the storage blocks 6, in accordance with which the reading will be performed on the elements of the rows of the matrix A recorded in the memory blocks b, and the memory block

9,предназначенный дл хранени элементов всех столбцов матрицы В.9 for storing elements of all columns of matrix B.

0000

со XIwith xi

соwith

NDNd

Устройство-прототип содержит информационные входы 10.1-10.п дл записи элементов п строк матрицы А, информационный вход 11 дл записи элементов всех столбцов матрицы В, входы 12, 13,14 блока управлени 2, соединенные со- ответственно с управл ющей шиной признака начала входных матриц, шиной счета и тактовой шиной устройства, выход блока управлени 15, соединенный с управ- л ющей шиной признака начата результирующей матрицы С, информационные выходы 16.1-16,п результирующей матрицыThe prototype device contains information inputs 10.1-10.p for recording elements of n rows of matrix A, information input 11 for recording elements of all columns of matrix B, inputs 12, 13.14 of control unit 2, connected respectively to the control bus of the start indicator input matrices, counting bus and clock bus of the device, the output of the control unit 15 connected to the control bus of the sign is started by the resulting matrix C, information outputs 16.1-16, p of the resulting matrix

с.with.

Узел умножени 3 1-го процессорного блока, представленный на фиг. 2, содержит конвейерный умножитель 18 и накапливающий сумматор 19, причем умножитель 18 содержит группу регистров 20, представл ющую собой пр моугольную матрицу одно- разр дных регистров (ступени конвейера), группу регистров 21, представл ющую собой треугольную матрицу одноразр дных регистров, группу сумматоров 22.The multiplication unit 3 of the 1st processor unit shown in FIG. 2 comprises a conveyor multiplier 18 and an accumulating adder 19, wherein the multiplier 18 comprises a group of registers 20, which is a rectangular matrix of single-bit registers (conveyor steps), a group of registers 21, which is a triangular matrix of single-bit registers, a group of adders 22.

Устройство-прототип работает следую- щим образом.The prototype device operates as follows.

Устройство перемножает исходные матрицы А aij и В bij с пор дком п, получа результирующую матрицу согласно соотношени (1):The device multiplies the original matrices A aij and B bij with order n, obtaining the resulting matrix according to the relation (1):

Сц- 2 анс . bkj(1)Sc-2 ans. bkj (1)

Элементы п строк матрицы А поступают соответственно на информационные входы 10.1-10.п устройства. Элементы матрицы В поступают по столбцам на информационный , вход 11 устройства. Элементы результирующей матрицы С снимаютс с выходов 16.1-1 б.п устройства построкам. Elements of the n rows of the matrix A are respectively supplied to the information inputs of the device 10.1-10. The elements of the matrix are supplied in columns to the information input 11 of the device. Elements of the resulting matrix C are removed from the outputs 16.1-1 bp of the device in rows.

При подаче на входы 12 и 14 блока уп- равдени 2 соответственно сигнала признака начала входных матриц и тактовых импульсов блок управлени 2 по первому выходу производит запись элементов п строк матрицы А, поступающих одновременно со входов 10.1-Ю.п устройства в запоминающие блоки 6 соответствующих процессорных блоков 1.1-1.П, а также формирование адресов этих элементов строк матрицы А в адресном блоке 8 и запись элементов всех столбцов матрицы В в запоминающий блок 9 со входа 11 устройства. При подаче на вход 1.3 блока управлени 2 сигнала счет поток адресов, считываемый из адресного блока 8, проход через регистры 7 процессорных блоков с помощью синхросигнала , поступающего со второго выхода блока 2 управлени , определ ет дл каждого процессорного блока на каждомWhen applying to inputs 12 and 14 of control unit 2, respectively, a signal indicating the start of input matrices and clock pulses, control unit 2 records on the first output the elements of n rows of matrix A coming simultaneously from the inputs 10.1-U of the device into memory blocks 6 of the corresponding processor blocks 1.1-1.P, as well as the formation of the addresses of these elements of the rows of the matrix A in the address block 8 and the recording of the elements of all columns of the matrix B in the storage unit 9 from the input 11 of the device. When a signal 2 is fed to input 1.3 of the control unit 2, the address stream read out from the address unit 8, the passage through the registers 7 of the processor units by means of a clock signal from the second output of the control unit 2, determines for each processor unit on each

такте адрес соответствующего элемента строк матрицы А. Между адресацией строк матрицы А и поступлением элементов столбцов матрицы В осуществл етс единична задержка. Выходы регистров задержки 7 тактируютс единым синхросигналом, поступающим со второго выхода блока 2 управлени , так что в каждом такте элементы строк матрицы А будут передаватьс в процессорные блоки с единичной.задержкой, что на фиг. 1 схематично представлено в виде скошенного изображени элементов матрицы А. Выходы регистров задержки 4 тактируютс единым синхроимпульсом, поступающим с третьего выхода блока 2 управлени , так что в каждом такте элементы столбцов матрицы В передаютс от одного процессорного блока к другому с единичной задержкой. Таким образом, элементы первой строки матрицы А и элементы первого столбца матрицы А, считанные из блоков пам ти 6 и 9, поступ т в узел умножени 3 первого процессорного блока 1.1 соответственно на его первый и второй входы при помощи синхросигнала, поступающего с третьего выхода блока управлени на первый управл ющий вход узла умножени 3 без задержки. Втора строка матрицы А и первый столбец матрицы В, поступающие в процессорный блок 12 будут задержаны на одну единицу времени, треть строка матрицы А первый столбец матрицы В, поступающие в процессорный блок 1.3 будут задержаны на 2 единицы времени и т.д. Элементы Сц-Cin результирующей матрицы С по мере формировани их в узле умножени 3, где происходит накопление парных произведений aik bkj поступают в выходные регистры 5, которые управл ютс синхросигналами , поступающими с четвертого, п того ,...,(п+3)-го выходов блока 2 управлени . При этом накапливающий сумматор 19 обнул етс . Дл получени всех элементов строк результирующей матрицы С кажда строка матрицы А считываетс из запоминающего блока 6п раз. При поступлении на выходы 16.1-16.п устройство первых элементов п строк результирующей матрицы С с выходов регистров 5 блок 2 управлени формирует управл ющий сигнал признака начала результирующей матрицы, поступающий на выход 15 устройства. Элементы строк результирующей матрицы С передаютс с выходов устройства 16.1-16.П также с единичными задержками, что на фиг. 1 схематично представлено в виде скошенного изображени элементов.a clock address of the corresponding row element of matrix A. Between the addressing of the rows of matrix A and the arrival of the column elements of matrix B, a single delay occurs. The outputs of the delay registers 7 are clocked by a single clock signal coming from the second output of the control unit 2, so that in each cycle the elements of the rows of the matrix A will be transmitted to the processing units with a single delay, as in FIG. 1 is schematically represented as a beveled image of the elements of the matrix A. The outputs of the delay registers 4 are clocked by a single clock from the third output of the control unit 2, so that in each cycle the elements of the columns of the matrix B are transmitted from one processor unit to another with a unit delay. Thus, the elements of the first row of matrix A and the elements of the first column of matrix A, read from memory blocks 6 and 9, go to the multiplication node 3 of the first processor block 1.1, respectively, to its first and second inputs using the clock signal from the third output of the block control to the first control input of the multiplication unit 3 without delay. The second row of the matrix A and the first column of the matrix B entering the processor unit 12 will be delayed by one unit of time, the third row of the matrix A the first column of the matrix B entering the processor unit 1.3 will be delayed by 2 units of time, etc. The elements С-Cin of the resulting matrix С as they are formed in the multiplication node 3, where the pair products aik bkj are accumulated, go to the output registers 5, which are controlled by the clock signals coming from the fourth, fifth, ..., (n + 3) outputs of the control unit 2. In this case, the accumulating adder 19 is reset. To obtain all the elements of the rows of the resulting matrix C, each row of the matrix A is read from the storage unit 6 times. Upon receipt of outputs 16.1-16.p, the device of the first elements of n rows of the resulting matrix C from the outputs of the registers 5, the control unit 2 generates a control signal indicating the start of the resulting matrix, which is output to the output 15 of the device. The elements of the rows of the resulting matrix C are transmitted from the outputs of the device 16.1-16.P also with unit delays, as in FIG. 1 is schematically depicted as a beveled image of elements.

Известные аналоги, реализованные на линейках процессорных блоков, и прототип обладают следующими недостатками:Known analogues implemented on the line of processor units, and the prototype have the following disadvantages:

Claims

1.Low reliability, as The implementation of analogs and prototypes based on serial microcircuits manufactured by our industry requires the use of a large number of crystals due to the large number of external terminals for input-output and data. Existing analogs and prototypes use n m-bit channels in the rows of matrix A and n m-bits to sting output results, where n is the port and deck of matrix A. A large number of external outputs leads to an increase in the probability of failures because It is known that the external terminals of microcircuits are the most unreliable elements. In addition, the placement of processor units on a large number of microcircuits necessitates the manufacture of an appropriate number of signal lines, which in turn reduces the reliability of the equipment.

2. Low manufacturability and fault tolerance due to the need to use a large number of microcircuits with implementation.

3. High cost.

4.Larger hardware costs, because Each processor unit must have a RAM unit. In addition, an address block and a memory block are required in the device to store all columns of matrix B.

The purpose of the invention is to simplify the device by reducing the number of output contacts and reducing hardware costs.

This goal is achieved by the fact that in the device for multiplying matrices, containing a linear matrix of n processor units (n is the order of the matrix), a control unit, and each processor unit contains a multiplication unit and an output register, moreover, in each processor unit current, the first control input of the multiplication unit is connected respectively to the (n + 1) -th input of the control unit, the second output of the multiplication unit is connected to the first information input of the output register, c; the second factor of the multiplication unit of the first processor unit with dynamin with the second information input of the device,), the second control input of the node is multiplied by 1 of the first processor unit and the first control input of the output register of the first processor unit is connected to the (r + 2) -th output of the control unit, the output of the output register of the first the processor unit is connected to the information output device, the (n + 4) -th output of the control unit is connected to the output of the start indicator of the resulting matrix, the first and second

the inputs of the control unit are connected respectively to the input of the sign of the input matrix and the clock input of the device, a multiplexer is introduced, the first and second information inputs of which are connected respectively to the first output of the multiplication unit and to the first information input of the device, the output of the multiplexer is connected to the input of the first node multiplier multiplication, the control inputs of the multiplexers from the first to the n-th processor units are connected respectively to the first to the n-th outputs of the control unit, the third output of the multiplication unit

5 of each previous processor unit is connected to the second information input of the multiplication unit of the subsequent processor unit, the output register of each subsequent processor unit is connected to the second information input of the output register of the previous processor unit, in each processor unit the third control input of the multiplication unit is connected to (n + 3) -th output of the control unit 5, the second control input of the output register is connected to the (p-H} -th output of the control unit, the second control input of the multiplication nodes with second to pth processor units and the first control inputs of the output registers from the second to n-th processor units are connected to the (n + 2) -th output of the control unit.

The multiplication unit of each processor unit contains a pipeline multiplier5 accumulating an adder and a delay element, the pipeline multiplier containing the first and second groups of registers and a group of adders, and the information inputs of the registers are the first

0 and the second group are respectively the inputs of the first and second factors of the multiplication node, the control inputs of the registers of the first and second groups and the first control input of the accumulating

5 adders are connected to the first control input of the multiplication unit, the first outputs of the registers of the first and second groups are connected respectively to the first and second inputs of the corresponding adders

0 groups, the second outputs of the registers of the first and second groups are connected respectively to the first and third outputs of the multiplication node, the outputs of the adders of the group are connected to the corresponding inputs of the accumulating adder, the output of which is connected to. the information input of the delay element, the output of which is connected to the second output of the multiplication unit, the second control input of the accumulating adder and the strobe input of the delay element are connected respectively to the second and third control inputs of the multiplication unit.

Comparative analysis with the prototype allows us to conclude that the inventive device for matrix multiplication is characterized in that the 1st processor unit contains a multiplexer for inputting the elements of the 1st row of the first matrix into the multiplication unit, which commutes the transmission of elements of the 1st row or with device input, or from the first output of the multiplication node. Using a multiplexer, a feedback loop is created, thanks to which the 1st row of the first matrix A circulates in the 1st processor unit, interacting with all the columns of the second matrix B, forming the ith row of the resulting matrix C. This makes it possible to use one input channel for rows of matrix A instead of n input channels in the prototype. In addition, each multiplication node contains a rectangular matrix of single-bit registers, which provides the sequential passage of the columns of the second matrix through all processor blocks - and a block of delay registers connected to a two-input output register, which uses one output channel instead of n channels in the prototype .

The combination of these features ensures that the technical solution meets the novelty criterion. When comparing the claimed solution not only with the prototype, but also with other known technical solutions, solutions having similar features were not found. These features make it possible to exclude memory blocks in each processor block, the memory block of the columns of matrix B and the address block, thereby reducing hardware costs, to use one input channel of the first matrix instead of p input channels of this matrix in existing devices, to use one output channel for all rows of the resulting matrix, instead of n output channels in the prototype and analogs, which makes it possible to implement the device on one chip, increase reliability, manufacturability, fault tolerance and lower the cost of app aratura.

This allows us to conclude that it meets the criterion of significant differences.

In FIG. 3 presents a functional diagram of the proposed device; in Fig. A is a functional diagram of the multiplication unit of the 1st processor unit, where, p is the order of the original matrices A and B,

The device of FIG. 3 comprises n sequentially interconnected processor units 1.1-1. P and a control unit 2, which is a microprogram control unit.

The 1st processor unit contains a multiplication unit 3, performing a multiplication operation with the accumulation of n-bit elements of the original matrices A and B,

multiplexer 4 for inputting elements of the 1st row of matrix A to the multiplication node 3, switching the transmission of elements of the 1st row of matrix A from the input 6 of the device or from the first output of the multiplication node 3, output register

5, intended for storing the result of the multiplication operation obtained in the multiplication unit 3 and for transmitting, in each clock cycle, the results of the operation from one processor unit to another to output 11

devices. The device contains information inputs 6 and 7 of the elements of the input matrices A and B, respectively, inputs 8, 9 of the control unit 2, connected respectively to the control bus signs

the beginning of the input matrices and the clock bus of the device, the output of the control unit 10 connected to the control bus of the sign of the start of the resulting matrix C, the information output 11 of the resulting matrix C.

The multiplication unit 3 of the 1st processor unit shown in FIG. A, contains a conveyor multiplier 12, accumulating the adder 13, and the conveyor multiplier contains the first and second groups of registers 15 and 16 and a group of adders 17, a delay element 14 containing (n-i) series-connected registers for equalizing time delays

the results of the operation obtained in the processor units in order to simultaneously submit them to the output registers

5. All outputs of the registers of block 14 are clocked by a single clock signal supplied to the registers

every n ticks. In this case, data is transferred from one register to another.

The device blocks are connected as follows. The information inputs of the device 6 and 7 are connected respectively to

the second inputs of the multiplexers 4 of each processor unit 1.1-1. P and the second input of the multiplication unit 3 of the first processor unit 1.1, the inputs of the control unit 2 8, 9 and the output 10 are connected

respectively, with the control bus of the sign of the beginning of the input matrices, the clock bus of the device and the bus of the sign of the beginning of the resulting matrix C, the information output of the device 11 is connected to the output of the output register 5

The first processor unit 1.1. the first,

the second n-th outputs of the control unit 2

They are connected, respectively, with the control and inputs of the input multiplexers 4 of the first, second p-th processor units,

in + 1) -u the output of the control unit 2 is connected by the first control input of the multiplication unit 3 and the second control input of the register 5 of each processor lock, (n + 2) -th output of the control unit 2 is connected to the second control the input of the node of multiplication 3 and the first control stroke of the output register 5 of each proessor block (n + 3) -th output of the control unit 2 is connected to the third control stroke of the multiplication node 3 of each processor block, the third output nodes of the multiplier and 3 of each previous processor lock are connected to the second information input m of the multiplication unit 3 of each subsequent processor unit, the output of the output register 5 of each subsequent processor unit is connected to the second information input of the output unit 5 of each previous processor unit, in each processor unit the first output of the multiplication unit 3 is connected with the first input of the multiplexer, the output of which is connected to the first information input of the multiplication unit 3, the second output of which is connected to the first information input of the output region 5.

The blocks of the multiplication unit 3 are connected as follows.

The information inputs of the registers of the first E.YU and second groups 15 and 16 are connected, respectively, with the inputs of the first and second from the multipliers of the multiplication node 3. control inputs of the registers of the first and second groups 5 and 16 and the first control input of the accumulating adder 13 connected to the first control input of the multiplication unit, the first outputs of the registers of the first and second groups 15 and 16 are connected respectively to the first and second inputs of the registers of the group 17, the second outputs of the registers of the first and second groups 15 and 16 are connected respectively with the first and t With the E outputs of the multiplication unit 3, the outputs of the adders of group 17 are connected to the corresponding inputs of the accumulating adder 3, the output of which is with the information input of the delay element 14, the output of which is connected to the second output of the node multiplying h and 3, the second control input of the accumulating adder 13 and the gate input of the delay element 14 is connected respectively to the second and third control inputs of the multiplication unit 3.

The proposed device operates as follows.

The device implements the conveyor principle of processing information in real time, which ensures the multiplication of pairs of matrices AxB, FxK, etc., which are delivered sequentially one after the other to the inputs 6 and 7 of the device, which ensures uniform loading of all processor blocks.

Matrix pairs can arrive continuously, as well as at certain time intervals, multiples of p. The device calculates the product of 5 matrices and, with the dimension php with the matrix element size equal to n, obtaining the resulting matrix C (cij with elements Cij according to the relation: n

0Y 2 aik bkj (1)

k 1

The elements of the matrices A and B form a stream that moves through the line of processor blocks synchronously in two identical directions from left to right. Elements of matrix A are entered sequentially in rows, elements of matrix B are entered sequentially in columns, with each element of matrices A and B entering the device

Q through the inputs respectively 6 and 7 once and advance in the conveyor mode.

Each of the n processor units of the device performs an accumulation multiplication operation, i.e. the final result in accordance with (1) is formed by calculating the intermediate terms accumulating in the form of partial sums in each processor unit. Each 1st processor unit operates with the 1st row of matrix A, forming

Q, i-th row of the matrix C.

Elements of Cs of the resulting matrix C leave the multiplication node 3 of the corresponding processor blocks at those times when they are fully formed and transferred to the output registers 4 from one processor block to another, forming a stream of row elements of the matrix C, moving from right to left to exit 11 devices. When applying to exits

Q 8 and 9 of control unit 2, respectively, of a synchronization signal indicating the start of input matrices and clock pulses, control unit 2, on the first output, sets the control signal of multiplexer 4 of the first processor unit 1.1 to state Г, opening the second input of multiplexer 4, and by ( p + 1) -th output gives a write signal to the control input of the multiplication node 3. Moreover, the elements of the first row of matrix 4 and the elements

the first column of the matrix B, coming continuously from the sensors with an interval of one clock cycle respectively to b and 7 outputs of the device, go to the first processor block 1.1, respectively, to the first and second inputs of the multiplication node 3. The passage of the first row of matrix A from the input 6 of the device to the rest blocks are blocked because the control signals of the multiplexers 4 of the remaining processor units 1.2-1. P at this point in time is set to O. After the end of five clock cycles of the multiplication unit 3 of the first processor unit, when the elements of the first row of matrix A and the elements of the first column of matrix B are filled t registers, respectively, of the first group of registers 15 and the second group of registers 16 of the multiplication node 3, at the first output of the control unit 2, the control signal of the multiplexer 5 of the first processor unit 1.1 is set to O, closing the second input One of this multiplexer and opening its first input, and at the second output of the control unit 2, the control signal of the multiplexer 5 of the second processor unit is set to state 1. In this case, in the first processor unit 1.1, the first row of matrix A again comes from the first output of the multiplication unit 3 through the first input of the multiplexer 5 to the first input of the same multiplication node, to the second input of which comes from the input 7 of the device the second column of the matrix B, and in the second processor unit the first and second inputs of the multiplication node 3 simultaneously post Payuta respectively the second row of the matrix A and the first column of the matrix, respectively, from the input device 6 and the third output node processor 3 of the first multiplying unit. The passage of the first row of matrix A to the remaining processor blocks 1.3-1. P is blocked by the zero control signals of multiplexer 4. Similarly, the third row of matrix A will go to the third processor block 1.3, ..., etc., the n-row of the matrix And it will go to the nth processor unit 1.p. Due to the presence of a feedback loop in the multiplication node of each processor unit, 1- row of matrix A circulates in the 1st processor block (in the register block 15) n times, interacting with all columns of matrix B that are successively shifted along the line of processor blocks 1.1, 1 , 2 1.p,

passing sequentially through the group of registers of 16 multiplication nodes of all processor units. In this case, a 1-row of the resulting matrix C is formed in the 1st processor unit.

The elements Cij of the resulting matrix C, as they are formed in the multiplication nodes of 3 processor blocks, where the paired products aik bkj are accumulated, enter the output registers 5 through the delay elements 14. All the outputs of the delay elements 14 are clocked by a single clock signal coming from (n + 3) -th output of control unit 2, so that data is transmitted in block 14 from one register to another every five clock cycles. The 1st processor block has 14 n-i delay registers in block 14. Thus, the nth processor unit does not have a single delay element and, as soon as the element of the resulting matrix is formed in the nth processor unit, all the results obtained in all processor units are simultaneously transmitted to the output registers 5 using a single clock signal from the (n + 2) -th output of the control unit, which opens the first information input of the register 5, and a single clock signal coming from the (n + 1) -th output of the control unit 2, which records the number in this register. At the next clock cycle, the second information input of register 5 is opened and the data located in these registers is transmitted along their second inputs from one processor unit to another to the information output 11 of the device. The accumulating adder is zeroed at the (n + 2) th output of the control unit 2. When the first element of the resulting matrix C arrives at the device output 11 from the output of the register 5 of the first processor unit 1.1, the control unit 2 generates a control signal indicating the start of the output matrix coming from the output of the device 10.

In the conveyor mode, after the matrices A and B, the next pair of matrices F and K can enter the device if there is a new synchronization signal at the input of control unit 2 of a sign of the beginning of the input matrices

When new pairs of multiplied matrices are continuously supplied to inputs 6 and 7, the device is a synchronous conveyor.

A feature of the structure of the proposed device is the alike feedback loop in the multiplication node of each processor block, due to which the 1-row of matrix A circulates in the 1st processor block as many times as the columns in matrix B, interacting with all columns of matrix B, forming at this is the 1st row of the resulting matrix C. This feature of the device structure makes it possible to use one input channel ex n rows of the matrix A, instead of the n channels of the ode of the rows of the matrix A in the prototype, and eliminates the need to use n shackles memory for storing rows of the matrix

Let us carry out a comparative characterization of the prototype and the proposed device I and on the example of their implementation on the basis of d serial microcircuits launched by our industry.

For an example, let us take the bit number of the multiplied numbers equal to 16, since AND formations from sensors are usually encoded by more than 16 bits, because the analog-to-code converter usually has a size of 10-12 bits. Thus, at 16, to implement the prototype, containing 16 processor units, a control unit, 16 memory units, 1 () x16 Bit is required. 256 external microelectric pins for input of 16 lines of matrix A, 16 external- / ix pins for input of matrix B and x16 bits. 256 external outputs for the output of all rows of the resulting matrix C, its 528 external outputs for input-output.

To implement the prototype, at least 9 microcircuits with a number of pins per microcircuit equal to 60 will be required.

The proposed device containing processor units, a control unit, has less hardware costs, since it does not contain p memory units prior to storing the rows of matrix B, a memory unit for storing columns of matrix B and the address block. In addition.

AC

and

m;

mustache

boo

thu

ne

to solve the problems of matrix multiplication, original 16-bit input channels of matrices A

and one 16-bit output channel

Ritsa S. gives the opportunity to implement

a single chip, as it rubs 48 data pins for input / output,

480 conclusions less than in prototypes. Thus, the proposed structure / structure makes it possible to have a VLSI krill designed for

Using the real-time conveyor processing method, sintering, thanks to the efficient loading of all processor units, I will show pipeline performance up to 100 million additions-multiplied / sec 16-bit words at the input data arrival rate - MHz, number conclusions 53. The reduction in the number of external conclusions and the placement of the device on one chip sintered compared to the prototype

a sharp decrease in overall dimensions, cost reduction, provides high reliability and fault tolerance, high adaptability of the processor.

Formula a

1. A device for multiplying matrices containing a linear matrix of n processor units (n is the order of the matrix), a control unit, each processor unit containing a multiplication unit and an output register, and in each processor unit the first control input of the multiplying unit with (n + 1) -th output of the control unit,

5, the second output of the multiplication unit is connected to the first information input of the output register, the input of the second multiplier of the multiplication unit of the first processor unit is connected to the second information input of the device, the second control input of the multiplication unit of the first processor unit, and the first control input of the output register of the first processor unit are connected with the (n + 2) -th output of the control unit 5, the output of the output register of the first processor unit is connected to the information output of the device, (n + 4) -th output of the control unit is connected with the output of the sign of the beginning of the resulting matrix,

0 the first and second inputs of the control unit, respectively, with the input of the sign of the beginning of the input matrices and the clock input of the device, characterized in that, in order to simplify the device by reducing

5 the number of output contacts and reduce hardware costs, a multiplexer is introduced into each processor unit, the first and second information inputs of which are connected respectively to the first output of the multiplication unit and to the first information input of the device, the output of the multiplexer is connected to the input of the first multiplier of the multiplication unit, first to multiplexer inputs

5 n-th processor units are connected respectively with the first to n-th outputs of the control unit, the third output of the multiplication unit of each previous processor unit is connected to the second information input of the multiplication unit of the subsequent processor unit, the output register of each subsequent processor unit is connected to the second information input output register previous

5 of the processor unit, in each processor unit, the third control input of the multiplication unit is connected to the (n + 3) th output of the control unit, the second control input of the output register is connected to the (n + 1} th output of the control unit, the second control input multiplication nodes from the second to the nth processor units and the first control inputs of the output registers from the second to the nth processor units are connected to the (n + 2) -th output of the control unit.

2. The device according to claim 1, characterized in that the multiplication unit of each processor unit comprises a conveyor multiplier accumulating an adder and a delay element, the conveyor multiplier containing the first and second groups of registers and a group of adders, the information inputs of the registers of the first and second groups are respectively the inputs of the first and second factors of the multiplication node, the control inputs of the registers of the first and second groups and the first control input of the accumulating adder soy0,

YoY

0-GP

gg

- - - P

Bp

Q-Lx Multiply

Fig.}

10

fifteen

are dined with the first control input of the multiplication node, the first outputs of the registers of the first and second groups are connected respectively to the first and second inputs of the corresponding adders of the group, the second outputs of the registers of the first and second groups are connected respectively to the first and third outputs of the multiplication node, the outputs of the adders of the group are connected to the corresponding the inputs of the accumulating adder, the output of which is connected to the information input of the delay element, the output of which is connected to the second output of the multiplication unit, the second control input the accumulating adder and the gate input of the delay element are connected respectively to the second and third control inputs of the multiplication unit,

Opp.

Vln

yЈ-

-. - - -.

Yul; 1-P

with in.

Nbsp

J

Pogositl

FIG. 2

Oik

„Multiplicable (prazr)

Editor

Fig. B

Compiled by L. Elfimrva Tehred M. Morgenthal

Figure 3

Kj

multiplier (prazr)

Proofreader A. Obruchar