CN117786293A

CN117786293A - Matrix device and method of operating the same

Info

Publication number: CN117786293A
Application number: CN202211274537.8A
Authority: CN
Inventors: 郭皇志; 阮郁善; 陈建文; 骆子仁
Original assignee: Chuangxin Wisdom Co ltd
Current assignee: Chuangxin Wisdom Co ltd
Priority date: 2022-09-20
Filing date: 2022-10-18
Publication date: 2024-03-29
Also published as: TWI808000B; US20240111827A1

Abstract

The invention provides a matrix device and an operation method thereof. The transpose circuit is to receive a first string of elements representing a native matrix from a matrix source, wherein all elements of the native matrix are arranged in one of a "row-major" and a "column-major" manner in the first string of elements. The transpose circuit transposes the first element string into a second element string, where the second element string is equivalent to an element string arranged by the other of "row-major" and "column-major" all elements of the native matrix. The memory is coupled to the transpose circuit to receive the second element string.

Description

Matrix device and method of operating the same

Technical Field

The invention relates to a matrix device and an operation method thereof. The transpose circuit is to receive a first string of elements representing a native matrix from a matrix source, wherein all elements of the native matrix are arranged in one of a "row-major" and a "column-major" manner in the first string of elements. The transpose circuit transposes the first element string into a second element string, where the second element string is equivalent to an element string arranged by the other of "row-major" and "column-major" all elements of the native matrix. The memory is coupled to the transpose circuit to receive the second element string.

Background

Matrix multiplication is the fundamental operation in a calculator system. After the operation circuit completes a previous matrix operation, different elements of the matrix (operation result) are sequentially written into the random volatile memory (dynamic random access memory, DRAM) according to the element generation sequence of the previous matrix operation. For example, the matrix may be "column major" or "row major" stored in the DRAM. However, the order of placement of the matrix elements of the previous matrix operation in the DRAM may be detrimental to the retrieval of the next matrix operation. For example, the operation result matrix of the previous matrix operation is stored in the DRAM in a column-wise manner for the next matrix operation, but the input manner of the operand (operand) matrix of the next matrix operation is in a behavior-wise manner. Thus for the next matrix operation, the elements of the operand matrix are discretely placed in different locations (discrete addresses) of the DRAM.

When the elements fetched in the same batch for the next matrix operation are consecutive addresses located in the DRAM, the operation circuit may read the elements at consecutive addresses from the DRAM at a time using a burst read instruction. When the elements fetched by the next matrix operation are discrete addresses located in the DRAM, the operation circuit must use a plurality of read instructions to read the elements from the DRAM a plurality of times. In general, the number of reads to a DRAM is proportional to the power consumption. How to store the matrix generated by the previous matrix operation in DRAM so that the next matrix operation can take the matrix efficiently is one of the important issues. If the DRAM access times can be reduced in the process of taking the matrix from the DRAM, the performance of the matrix operation can be effectively improved, and the circuit power consumption can be effectively reduced.

It should be noted that the content of the "background art" section is intended to aid in understanding the present invention. Some (or all) of the disclosure in the "background art" section may not be known to those of skill in the art. The disclosure in the background section is not presented for the purpose of providing a representation of what has been known to those of ordinary skill in the art prior to the application of the present invention.

Disclosure of Invention

The invention provides a matrix device and an operation method thereof, which are used for improving efficiency.

The invention provides a matrix device, which comprises a transpose circuit and a memory. The transpose circuit is to receive a first string of elements representing the native matrix from the matrix source and transpose the first string of elements into a second string of elements, wherein all elements of the native matrix are arranged in one of a "row-major" and a "column-major" manner to the first string of elements, and the second string of elements is equivalent to a string of elements in the other of a "row-major" and a "column-major" manner to the all elements of the native matrix. The memory is coupled to the transpose circuit to receive the second element string.

In an embodiment of the present invention, the matrix device may be used in a method of operation, including: receiving, by a transpose circuit of the matrix device, a first string of elements representing a native matrix from a matrix source; transpose the first element string into a second element string by the transpose circuit, wherein one of the "behavior-based" and "column-based" elements of the original matrix is arranged in the first element string, and the second element string is equivalent to the element string arranged by the other of the "behavior-based" and "column-based" elements of the original matrix; and receiving the second element string from a memory of the matrix device.

Based on the above, the transpose circuit according to the embodiments of the present invention can match the element arrangement in the memory with the characteristics of the access calculation through the transpose method. Therefore, the efficiency of the matrix device can be effectively improved.

In order to make the above features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

Fig. 1 is a schematic circuit block diagram of a matrix device according to an embodiment of the present invention.

Fig. 2 is a schematic circuit block diagram of a matrix device according to another embodiment of the invention.

FIG. 3 is a diagram illustrating the storage locations of elements in the memory without the transpose circuit.

Fig. 4 is a schematic diagram illustrating the element storage locations in the memory in the case of the transpose circuit 210 for transposition.

FIG. 5 is a schematic diagram showing the storage of elements in a static random access memory.

Fig. 6 is a flow chart of a method of operating a matrix device according to an embodiment of the invention.

Description of the reference numerals

100. 200: matrix device

110. 210: transpose circuit

120. 220, 240: memory

230: matrix multiplication circuit

A0, A1, A2, A3, B0, B1, B2, B3, C0, C1: address of

ES1, ES2, ES3: element string

S601, S602, S603: step (a)

X ₀₀ 、X ₀₁ 、X ₁₀ 、X ₁₁ 、Y ₀₀ 、Y ₀₁ 、Y ₁₀ 、Y ₁₁ : element(s)

Detailed Description

Reference will now be made in detail to the exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.

The term "coupled" as used throughout this specification (including the claims) may refer to any direct or indirect connection. For example, if a first device couples (or connects) to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. The terms first, second and the like in the description and in the claims are used for naming the elements or distinguishing between different embodiments or ranges and not for limiting the number of elements or the order of the elements. In addition, wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts. The components/elements/steps in different embodiments using the same reference numerals or using the same terminology may be referred to with respect to each other.

Fig. 1 is a schematic diagram of a circuit block (circuit block) of a matrix device 100 according to an embodiment of the invention. The matrix device 100 shown in fig. 1 includes a transpose (transfer) circuit 110 and a memory 120. In accordance with various design requirements, in some embodiments, the transpose circuit 110 can be implemented as a hardware (hardware) circuit. In other embodiments, the transpose circuit 110 can be implemented as firmware, software, or a combination of the two. In still other embodiments, the implementation of the transpose circuit 110 can be a combination of multiple of hardware, firmware, or software.

In hardware, the transpose circuit 110 can be implemented as logic on an integrated circuit (integrated circuit). For example, the relevant functions of the transpose circuit 110 can be implemented in various logic blocks, modules, and circuits in one or more controllers, microcontrollers (microcontrollers), microprocessors (microprocessors), application-specific integrated circuits (ASICs), digital signal processors (digital signal processor, DSPs), field programmable gate arrays (Field Programmable Gate Array, FPGAs), and/or other processing units. The above-described matrix devices, transpose circuits, and/or associated functions of memory may be implemented as hardware circuits, such as various logic blocks, modules, and circuits in an integrated circuit, using a hardware description language (hardware description languages, such as Verilog HDL or VHDL) or other suitable programming language.

The functions associated with the transpose circuit 110 described above can be implemented as programming code (programming codes) in software and/or firmware. The transpose circuit 110 is implemented, for example, using a general programming language (programming languages, e.g., C, C ++ or assembly language) or other suitable programming language. The programming code may be recorded/stored on a "non-transitory computer readable medium (non-transitory computer readable medium)". In some embodiments, the non-transitory computer readable medium includes, for example, a semiconductor memory and/or a storage device. The semiconductor Memory includes a Memory card, a Read Only Memory (ROM), a FLASH Memory (FLASH Memory), a programmable logic circuit, or other semiconductor Memory. The storage device includes a tape (tape), a disk (disk), a hard disk (HDD), a Solid-state drive (SSD), or other storage devices. An electronic device, such as a central processing unit (Central Processing Unit, CPU), controller, microcontroller, or microprocessor, may read and execute the programming code from the non-transitory computer readable medium to perform the functions associated with the transpose circuit 110.

The transpose circuit 110 may receive an element string ES1 representing a native matrix from a matrix source (not shown in fig. 1). The present implementation is not limited to the matrix sources. For example, in some embodiments, the matrix source may include a storage device, a network, a matrix multiplication circuit, or other source for providing an operand (operand) matrix. In some embodiments, the matrix multiplication circuit may include an array of product accumulators (multiply accumulate, MAC).

The transpose circuit 110 can transpose the element string ES1 to the element string ES2. Wherein all elements of a native matrix are used as' toOne of the behavior dominant modes "and" column dominant mode "is arranged in the element string ES1, and the element string ES2 is equivalent to one element string in which all elements of the native matrix are arranged in the other of the behavior dominant mode and the column dominant mode. For example, assume that the content of the native matrix a is shown in the following equation 1. The content of the element string ES1 in which the primitive matrix A 'is arranged in the behavior main mode' is { X ] ₀₀ ,X ₀₁ ,X ₁₀ ,X ₁₁ }. After the transposition by the transposition circuit 110, the original matrix A is transposed into an element string ES2 arranged by 'taking the column as the main direction', and the content of the element string ES2 is { X ] ₀₀ ,X ₁₀ ,X ₀₁ ,X ₁₁ }。

The memory 120 is coupled to the transpose circuit 110. The transpose circuit 110 transposes the element string ES1 of the original matrix to obtain an element string ES2, and transmits the element string ES2 to the memory 120. The memory 120 may be any type of memory according to practical designs. For example, in some embodiments, the memory 120 may be a static random access memory (static random access memory, SRAM), a dynamic random access memory (dynamic random access memory, DRAM), a Magnetic Random Access Memory (MRAM), a magnetoresistive random access memory (magnetoresistive random access memory, MRAM), a Flash (Flash) memory, or other memory. The memory 120 receives and stores the element string ES2 as an operand (operand) matrix for the next matrix operation.

For example, fig. 2 is a schematic circuit block diagram of a matrix device 200 according to another embodiment of the invention. The matrix device 200 shown in fig. 2 includes a transpose circuit 210, a memory 220, a matrix multiplication circuit 230, and a memory 240. The matrix device 200, the transpose circuit 210, and the memory 220 of fig. 2 can be described with reference to the matrix device 100, the transpose circuit 110, and the memory 120 of fig. 1, and so forth, and thus are not described in detail herein. The matrix device 200 shown in fig. 2 may be used as one of many embodiments of the matrix device 100 shown in fig. 1, and thus the matrix device 100, the transpose circuit 110, and the memory 120 shown in fig. 1 may be described with reference to the matrix device 200, the transpose circuit 210, and the memory 220 shown in fig. 2.

The matrix multiplication circuit 230 is coupled to the transpose circuit 210, the memory 220, and the memory 240. The matrix multiplication circuit 230 may perform a front-layer calculation of a neural network (neural network) calculation to generate the primordial matrix. The matrix multiplication circuit 230 may serve as a matrix source to provide the element string ES1 of the native matrix to the transpose circuit 210. The transpose circuit 210 can transpose the element string ES1 to the element string ES2. The memory 220 is coupled to the transpose circuit 210 for receiving and storing the element string ES2. Matrix multiplication circuit 230 may read element string ES3 (matrix a) from memory 240 as a weight matrix (weight matrix) and element string ES2 (matrix B) from memory 220 as an input matrix (input matrix) for the next level of computation of the neural network computation. Generally, the weight matrix is a pre-trained parameter.

For example, assume that memory 220 comprises Dynamic Random Access Memory (DRAM). Based on the transpose operation of the transpose circuit 210, all elements of the same column of the native matrix (the result of the previous layer computation) may be stored at multiple sequential addresses in the memory 220. The memory 220 provides all elements of the same column of the native matrix to the matrix multiplication circuit 230 in burst mode to cause the matrix multiplication circuit 230 to perform the next level of computation of neural network computation.

The present embodiment does not limit the matrix operation of the matrix multiplication circuit 230. In some applications, the matrix operations may include matrix addition operations, matrix multiplication operations, multiply-accumulate (MAC) operations, and/or other matrix operations. For example, assume that the content of the native matrix a is shown in the above equation 1, and the content of the native matrix B is shown in the following equation 2. The two 2x2 matrices A, B are multiplied to obtain a matrix Z as shown in the following equation 3.

The matrix multiplication performed by matrix multiplication circuit 230 may include four steps. Step one: matrix multiplication circuit 230 may fetch elements X of matrix a from memory 240 ₀₀ ,X ₀₁ ]Element [ Y ] of matrix B is fetched from memory 220 ₀₀ ,Y ₁₀ ]And calculate X ₀₀ Y ₀₀ +X ₀₁ Y ₁₀ . Step two: matrix multiplication circuit 230 may retain elements [ X0 ] of matrix A ₀ ,X ₀₁ ]Element [ Y ] of matrix B is fetched from memory 220 ₀₁ ,Y ₁₁ ]And calculate X ₀₀ Y ₀₁ +X ₀₁ Y ₁₁ . Step three: matrix multiplication circuit 230 may fetch elements X of matrix a from memory 240 ₁₀ ,X ₁₁ ]Element [ Y ] of matrix B is fetched from memory 220 ₀₀ ,Y ₁₀ ]And calculate X ₁₀ Y ₀₀ +X ₁₁ Y ₁₀ . Step four: matrix multiplication circuit 230 may retain elements [ X ] of matrix A ₁₀ ,X ₁₁ ]Element [ Y ] of matrix B is fetched from memory 220 ₀₁ ,Y ₁₁ ]And calculate X ₁₀ Y ₀₁ +X ₁₁ Y ₁₁ . To this end, the matrix multiplication circuit 230 can obtain a matrix Z shown in equation 3.

The matrix multiplication performed by the matrix multiplication circuit 230 described in the preceding paragraph includes four steps, and six reads are performed on the memory 220. If the calculation is performed on the basis of data reuse, the matrix multiplication can be simplified from four steps to two optimization steps. Optimizing: matrix multiplication circuit 230 may fetch elements X of matrix a from memory 240 ₀₀ ,X ₁₀ ]Element [ Y ] of matrix B is fetched from memory 220 ₀₀ ,Y ₀₁ ]And calculate X ₀₀ Y ₀₀ 、X ₀₀ Y ₀₁ 、X ₁₀ Y ₀₀ X is as follows ₁₀ Y ₀₁ . Optimizing: matrix multiplication circuit 230 may fetch elements X of matrix a from memory 240 ₀₁ ,X ₁₁ ]Element [ Y ] of matrix B is fetched from memory 220 ₁₀ ,Y ₁₁ ]And calculate X ₀₁ Y ₁₀ 、X ₀₁ Y ₁₁ 、X ₁₁ Y ₁₀ 、X ₁₁ Y ₁₁ . To this end, the matrix multiplication circuit 230 may use X of the optimization step one and the optimization step two ₀₀ Y ₀₀ 、X ₀₀ Y ₀₁ 、X ₁₀ Y ₀₀ 、X ₁₀ Y ₀₁ 、X ₀₁ Y ₁₀ 、X ₀₁ Y ₁₁ 、X ₁₁ Y ₁₀ 、X ₁₁ Y ₁₁ Resulting in matrix Z shown in equation 3.

As a comparison to fig. 4, fig. 3 shows the schematic diagram of the element storage locations in the memories 220 and 240 in the case where the transpose circuit 210 is not transposed (i.e., the element string ES2 is identical to the element string ES 1). It is assumed here that matrix a is stored in memory 240 in column-wise fashion, and that all elements of matrix B are also arranged in column-wise fashion in element string ES1. That is, matrix B is stored in memory 220 in column-major fashion. In the first optimization step, the matrix multiplication circuit 230 can extract the element [ X ] of the matrix A from the consecutive addresses A0 and A1 of the memory 240 in a burst mode ₀₀ ,X ₁₀ ]. Because of the elements of matrix B [ Y ] ₀₀ ,Y ₀₁ ]Discrete addresses (discrete addresses) B0 and B2 in memory 220 cannot be fetched using burst, so that matrix multiplication circuit 230 fetches element Y from memory 220 in two times ₀₀ ]And element [ Y ] ₀₁ ]. In the second optimization step, the matrix multiplication circuit 230 can fetch the element [ X ] of the matrix A from the consecutive addresses A2 and A3 of the memory 240 in a burst manner ₀₁ ,X ₁₁ ]. Because of the elements of matrix B [ Y ] ₁₀ ,Y ₁₁ ]Discrete addresses (discrete addresses) B1 and B3 in memory 220 cannot be fetched using burst, so that matrix multiplication circuit 230 fetches element Y from memory 220 in two times ₁₀ ]And element [ Y ] ₁₁ ]。

Fig. 4 is a schematic diagram illustrating the element storage locations in the memories 220 and 240 in the case of the transpose circuit 210 for transposition. It is assumed here that matrix a is stored in memory 240 in column-wise fashion, and that all elements of matrix B are also arranged in column-wise fashion in element string ES1. Based on transposition circuitThe transpose operation of 210, the element string ES2 is equivalent to one element string in which all elements of the native matrix B are arranged in a behavior-dominant manner. The element string ES2 is sequentially and continuously stored in the memory 220. That is, matrix B is stored in memory 220 in a behavior-based manner, as shown in fig. 4. In the first optimization step, the matrix multiplication circuit 230 can extract the element [ X ] of the matrix A from the consecutive addresses A0 and A1 of the memory 240 in a burst mode ₀₀ ,X ₁₀ ]And burst-wise fetching element [ Y ] of matrix B from consecutive addresses B0 and B1 of memory 220 ₀₀ ,Y ₀₁ ]. In the second optimization step, the matrix multiplication circuit 230 can fetch the element [ X ] of the matrix A from the consecutive addresses A2 and A3 of the memory 240 in a burst manner ₀₁ ,X ₁₁ ]And burst-wise fetching element [ Y ] of matrix B from consecutive addresses B2 and B3 of memory 220 ₁₀ ,Y ₁₁ ]。

FIG. 5 is a schematic diagram showing the storage of elements in a Static Random Access Memory (SRAM). In the embodiment shown in fig. 5, the memory 220 may be a piece of SRAM, where the depth of the SRAM is 2 (two addresses) and the data width is 2 (two elements). It is assumed here that all elements of the matrix B are arranged in the element string ES1 in a column-wise manner. Based on the transpose operation of the transpose circuit 210, all elements of the matrix B are arranged in a row-wise fashion in the element string ES2. That is, matrix B is stored in memory 220 (SRAM) in a behavior-based manner, as shown in fig. 5. In the first optimization step, the matrix multiplication circuit 230 may extract the element [ X ] of the matrix A from the consecutive addresses of the memory 240 (e.g. DRAM) in a burst (burst) manner ₀₀ ,X ₁₀ ]And element [ Y ] of matrix B is fetched from address C0 of memory 220 (SRAM) ₀₀ ,Y ₀₁ ]. In the second optimization step, the matrix multiplication circuit 230 may extract the elements [ X ] of the matrix A from the consecutive addresses of the memory 240 (DRAM) in a burst manner ₀₁ ,X ₁₁ ]And extracting element [ Y ] of matrix B from address C1 of memory 220 (SRAM) ₁₀ ,Y ₁₁ ]。

Fig. 6 is a flow chart of a method of operating a matrix device according to an embodiment of the invention. Please refer to fig. 1 and fig. 6. In step S601, the transpose circuit 110 of the matrix apparatus 100 receives an element string ES1 (first element string) representing a native matrix from a matrix source. Wherein all elements of the primordial matrix are arranged in one of a "behavior-based" and a "column-based" manner in the element string ES1. In step S602, the transpose circuit 110 can transpose the element string ES1 to an element string ES2 (second element string). Wherein the element string ES2 is equivalent to one element string in which all elements of the original matrix are arranged in the "behavior-based manner" and the "column-based manner" other. In step S603, the memory 120 of the matrix device 100 receives and stores the element string ES2 as an operand matrix for the next matrix operation.

In summary, the transpose circuit in the above embodiments can make the element arrangement in the memory conform to the characteristics of access computation through the transpose method. Therefore, the matrix device can reduce the energy consumption and time required by memory access and reading, thereby effectively improving the efficiency of the matrix device.

Although the present invention has been described with reference to the above embodiments, it should be understood that the invention is not limited thereto, but rather is capable of modification and variation without departing from the spirit and scope of the present invention.

Claims

1. A matrix device, comprising:

a transpose circuit to receive a first string of elements representing a native matrix from a matrix source, and transpose the first string of elements into a second string of elements, wherein all elements of the native matrix are arranged in one of a behavior dominant manner and a column dominant manner in the first string of elements, and the second string of elements is equivalent to the string of elements in which all elements of the native matrix are arranged in the other of the behavior dominant manner and the column dominant manner; and

and a memory coupled to the transpose circuit for receiving the second element string.

2. The matrix device of claim 1, wherein the matrix source comprises a storage device, a network, or a matrix multiplication circuit.

3. The matrix device of claim 2 wherein the matrix multiplication circuit comprises a multiply accumulator array.

4. The matrix device of claim 1, further comprising:

a matrix multiplication circuit coupled to the transpose circuit and the memory, wherein the matrix multiplication circuit performs a front-level computation of a neural network computation to generate the native matrix, the matrix multiplication circuit acts as the matrix source to provide the first element string of the native matrix to the transpose circuit, and the matrix multiplication circuit reads the second element string from the memory to perform a next-level computation of the neural network computation.

5. The matrix device of claim 4, wherein the memory comprises a dynamic random access memory that provides all elements of a column of the native matrix in burst mode to the matrix multiplication circuit for the next level of computation of the neural network computation.

6. The matrix device of claim 5, wherein all elements of a column of the native matrix are stored at a plurality of consecutive addresses of the memory.

7. The matrix device of claim 1, wherein all elements of the native matrix are arranged in the column-wise manner in the first element string, the second element string is equivalent to the element string in which all elements of the native matrix are arranged in the row-wise manner, and the second element string is sequentially and consecutively stored in the memory.

8. A method of operating a matrix device, comprising:

receiving, by a transpose circuit of the matrix apparatus, a first string of elements representing a native matrix from a matrix source;

transpose, by the transpose circuit, the first string of elements to a second string of elements, wherein all elements of the native matrix are arranged in one of a row-wise dominant manner and a column-wise dominant manner in the first string of elements, and the second string of elements is equivalent to the string of elements in the other of the row-wise dominant manner and the column-wise dominant manner in which all elements of the native matrix are arranged; and

the second element string is received by a memory of the matrix device.

9. The method of claim 8, wherein the matrix source comprises a storage device, a network, or a matrix multiplication circuit.

10. The method of operation of claim 9 wherein the matrix multiplication circuit comprises a multiply accumulator array.

11. The method of operation of claim 8, further comprising:

performing a front-layer computation of a neural network computation by a matrix multiplication circuit of the matrix device to generate the native matrix, wherein the matrix multiplication circuit acts as the matrix source to provide the first string of elements of the native matrix to the transpose circuit; and

and reading the second element string from the memory by the matrix multiplication circuit to perform the next-layer calculation of the neural network calculation.

12. The method of operation of claim 11, wherein the memory comprises dynamic random access memory, the method of operation further comprising:

providing all elements of a column of the native matrix to the matrix multiplication circuit in burst mode by the memory for the next level of computation of the neural network computation.

13. The method of claim 12, wherein all elements of a column of the native matrix are stored at a plurality of consecutive addresses of the memory.

14. The method of claim 8, wherein all elements of the native matrix are arranged in the column-wise manner in the first element string, the second element string is equivalent to the element string in which all elements of the native matrix are arranged in the row-wise manner, and the second element string is sequentially and consecutively stored in the memory.