CN117574036A - Computing device, method of operation, and machine-readable storage medium - Google Patents

Computing device, method of operation, and machine-readable storage medium Download PDF

Info

Publication number
CN117574036A
CN117574036A CN202410056538.8A CN202410056538A CN117574036A CN 117574036 A CN117574036 A CN 117574036A CN 202410056538 A CN202410056538 A CN 202410056538A CN 117574036 A CN117574036 A CN 117574036A
Authority
CN
China
Prior art keywords
matrix
input
fill
elements
buffer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410056538.8A
Other languages
Chinese (zh)
Other versions
CN117574036B (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd, Beijing Bilin Technology Development Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202410056538.8A priority Critical patent/CN117574036B/en
Publication of CN117574036A publication Critical patent/CN117574036A/en
Application granted granted Critical
Publication of CN117574036B publication Critical patent/CN117574036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides an arithmetic device, an operation method and a machine-readable storage medium. The arithmetic device performs multiplication of the input matrix and the weight matrix. The arithmetic device comprises an input buffer, a ring buffer, a control circuit and an arithmetic circuit. All elements of the input matrix are deposited at a plurality of consecutive addresses of the input buffer. The control circuit moves a plurality of elements of the input matrix from a segment of consecutive addresses of the input buffer to the circular buffer. The control circuit defines boundary positions in the circular buffer based on the dimensions of the input matrix and defines a fill range at the boundary positions based on the current weight elements of the weight matrix. The control circuit changes the element values in said filling range of the circular buffer to effect a filling operation of the input matrix. The arithmetic circuitry then uses the elements in the circular buffer to perform the multiplication.

Description

Computing device, method of operation, and machine-readable storage medium
Technical Field
The present invention relates to an arithmetic device, an operating method and a machine-readable storage medium.
Background
The computing device may perform a multiplication of one matrix a (multiplicand) and another matrix B (multiplier). In many matrix multiplication applications, in order to adjust the dimension of the multiplication result matrix C (product), the matrix a is subjected to a padding (padding) operation before the matrix multiplication is performed. The padding operation means that one or more columns (redundant elements) of padding elements are padded on the left and right sides of the matrix a and/or one or more rows (rows) of padding elements are padded on the upper and lower sides of the matrix a. The filler elements are all of one and the same filler value, which is independent of the original matrix a. It is conceivable that the matrix a after the padding operation has many rows/columns of padding elements, and that these large number of padding elements occupy the space of the input buffer of the computing device.
Disclosure of Invention
The invention provides an operation device, an operation method and a machine-readable storage medium, which are used for reducing the space occupation of filling elements on an input buffer in an application scene for realizing filling operation on an input matrix.
In an embodiment according to the invention, the computing means is arranged to perform a multiplication of the input matrix and the weight matrix. The arithmetic device comprises an input buffer (input buffer), a ring buffer (ring buffer), a control circuit and an arithmetic circuit. The input buffer is used for temporarily storing the input matrix, wherein all elements (elements) of the input matrix are stored at a plurality of consecutive addresses of the input buffer. The control circuit is coupled to the input buffer and the ring buffer. The control circuit moves a plurality of elements of the input matrix from a segment of consecutive addresses of the input buffer to the circular buffer. The control circuit defines at least one boundary position in the circular buffer based on the dimension of the input matrix. The control circuit defines at least one fill range at the at least one boundary position based on a current weight element of the weight matrix. The control circuit changes element values of the plurality of elements in the at least one fill range of the circular buffer to effect a fill operation on the input matrix. The operation circuit is coupled to the ring buffer to read the plurality of elements after the filling operation. The arithmetic circuit performs the multiplication using the plurality of elements after the padding operation.
In an embodiment according to the present invention, the operation method of the operation device includes: storing the input matrix at a plurality of consecutive addresses of an input buffer of the computing device; shifting, by a control circuit of the computing device, a plurality of elements of the input matrix from a segment of consecutive addresses of the input buffer to a ring buffer of the computing device; defining, by the control circuit, at least one boundary position in the circular buffer based on the dimension of the input matrix; defining, by the control circuit, at least one fill range at the at least one boundary position based on a current weight element of the weight matrix; changing, by the control circuit, element values of the plurality of elements in the at least one fill range of the circular buffer to effect a fill operation on the input matrix; reading the plurality of elements after the fill operation from the ring buffer by an arithmetic circuit of an arithmetic device; and performing, by an arithmetic circuit, the multiplication using the plurality of elements after the padding operation.
In an embodiment according to the invention, the machine-readable storage medium is for storing non-transitory machine-readable instructions. The method of operation may be implemented when the non-transitory machine-readable instructions are executed by a computer.
Based on the above, the computing device according to the embodiments of the present invention can store all elements of the original input matrix (the matrix in which the filler element does not exist) at a plurality of consecutive addresses of the input buffer of the computing device. In each iteration of the matrix multiplication, a plurality of corresponding elements of the input matrix (a portion of the input matrix) are shifted from a segment of consecutive addresses of the input buffer to the circular buffer, and then the control circuitry selectively changes the element values of the elements in the circular buffer (the changed elements may be considered as fill elements, equivalent to performing a fill operation on the input matrix) in accordance with the current iteration. Therefore, the operation device can reduce the space occupation of the filling elements on the input buffer in the application scene of realizing the filling operation of the input matrix.
Drawings
Fig. 1 is a schematic circuit block diagram of an arithmetic device according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a scenario in which a complete matrix multiplication is performed after an input matrix is filled in, according to an embodiment.
FIG. 3 is a schematic diagram of a scenario in which elements of the input matrix after the fill operation are moved to a circular buffer at different iterations in matrix multiplication, according to an embodiment.
Fig. 4 is a flowchart illustrating an operation method of an operation device according to an embodiment of the invention.
FIG. 5 is a schematic diagram illustrating the operation of moving matrix elements from an input buffer to a circular buffer and then selectively changing portions of the elements in the circular buffer in different iterations of matrix multiplication, according to an embodiment of the present invention.
Description of the reference numerals
100: computing device
110: memory cell
120: arithmetic device
130: next stage circuit
C1: multiplication result matrix
CC1: arithmetic circuit
CONT1: control circuit
IB1: input buffer
IN1: input matrix
IN1': input matrix after filling operation
RB1: ring buffer
t31, t32, t33, t34, t35, t50, t51, t52, t53, t54, t55, t56, t57, t58, t59: time of
W: weight matrix
S410, S420, S430, S440, S450, S460, S470: step (a)
Detailed Description
Reference will now be made in detail to the exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.
The term "coupled" as used throughout this specification (including the claims) may refer to any direct or indirect connection. For example, if a first device couples (or connects) to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. The terms first, second and the like in the description (including the claims) are used for naming the components or distinguishing between different embodiments or ranges and not for limiting the upper or lower limit of the number of components or the order of the components. In addition, wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts. The components/elements/steps in different embodiments using the same reference numerals or using the same terminology may be referred to with respect to each other.
Fig. 1 is a schematic circuit block diagram of an arithmetic device 100 according to an embodiment of the present invention. The computing device 100 shown in fig. 1 includes a storage unit 110, a computing device 120, and a next stage circuit 130. Depending on the actual design and application, in some embodiments, the memory unit 110 may include any type of memory, such as a high bandwidth memory (High Bandwidth Memory, HBM) or other Dynamic random-access memory (DRAM). In other embodiments, the storage unit 110 may include any storage device, such as a Solid-state drive (SSD) or other storage device. The arithmetic device 120 may acquire all elements (elements) of the input matrix IN1 from the external storage unit 110. The input matrix IN1 is an original input matrix (a matrix IN which no filling element exists) IN which a filling operation is not performed. The arithmetic device 120 may perform multiplication of the input matrix IN1 and the weight matrix W. The computing device 120 may implement a fill operation on the local elements of the input matrix (the plurality of corresponding elements of the current iteration) in each iteration of the matrix multiplication. Therefore, the computing device 120 can reduce the space occupation of the filling element on the input buffer IB1 of the computing device 120 IN the application scenario where the filling operation on the input matrix IN1 is implemented. After completing the multiplication of the input matrix IN1 and the weight matrix W, the operation device 120 may provide the operation result of the matrix multiplication to the next stage circuit 130. The next stage 130 is, for example, a memory or other computing device, depending on the actual design and application. In some applications, the computing device 120 may be a tensor core (tensor core), and the other computing device may be a vector core (vector core) or other computing circuit.
In the embodiment shown in fig. 1, the computing device 120 includes an input buffer (input buffer) IB1, a ring buffer (ring buffer) RB1, a control circuit CONT1, and a computing circuit CC1. Based on the actual design, the operation circuit CC1 includes, for example, a general matrix multiplication (GEneral Matrix Multiply, GEMM) circuit or other operation circuits. The input buffer IB1 includes, for example, a general matrix multiplication input buffer (GEMM input buffer, GIB) or other buffer. The input buffer IB1 is used for temporarily storing the input matrix IN1. The control circuit CONT1 is coupled to the input buffer IB1, the ring buffer RB1 and the arithmetic circuit CC1. The operation circuit CC1 is coupled to the ring buffer RB1. In some embodiments, the implementation of the computing device 120, the computing circuit CC1, and/or the control circuit CONT1 may be hardware (hardware) circuits, according to various designs. In other embodiments, the implementation of the computing device 120, the computing circuit CC1, and/or the control circuit CONT1 may be firmware (firmware), software (software), or a combination of the two. In still other embodiments, the implementation of the computing device 120, the computing circuit CC1, and/or the control circuit CONT1 may be in a combination of hardware, firmware, and software.
In hardware, the computing device 120, the computing circuit CC1, and/or the control circuit CONT1 may be implemented as logic circuits on the integrated circuit (integrated circuit). For example, the relevant functions of the computing device 120, the computing circuit CC1, and/or the control circuit CONT1 may be implemented in various logic blocks, modules, and circuits in one or more hardware controllers (hardware controller), microcontrollers (microcontrollers), hardware processors (hardware processor), microprocessors (microprocessors), application-specific integrated circuits (ASICs), digital signal processors (digital signal processor, DSPs), field programmable logic gate arrays (Field Programmable Gate Array, FPGAs), central processing units (Central Processing Unit, CPUs), and/or other processing units. The relevant functions of the computing device 120, the computing circuit CC1 and/or the control circuit CONT1 may be implemented as hardware circuits, such as various logic blocks, modules and circuits in an integrated circuit, using a hardware description language (hardware description languages, such as Verilog HDL or VHDL) or other suitable programming language.
The functions of the computing device 120, the computing circuit CC1 and/or the control circuit CONT1 may be implemented as programming codes (programming codes) in software and/or firmware. For example, the computing device 120, the computing circuit CC1 and/or the control circuit CONT1 are implemented using a general programming language (programming languages, e.g., C, C ++ or assembly language) or other suitable programming language. The programming code may be recorded/deposited on a non-transitory machine readable storage medium (non-transitory machine-readable storage medium). In some embodiments, the non-transitory machine-readable storage medium includes, for example, a semiconductor memory and/or a storage device. The semiconductor Memory includes a Memory card, a Read Only Memory (ROM), a FLASH Memory (FLASH Memory), a programmable logic circuit, or other semiconductor Memory. The storage device includes a tape (tape), a disk (disk), a hard disk (HDD), a Solid State Disk (SSD), or other storage device. An electronic device (e.g., a computer, CPU, hardware controller, microcontroller, hardware processor, or microprocessor) may read and execute the programming code from the non-transitory machine-readable storage medium to implement the relevant functions of the computing device 120, the computing circuit CC1, and/or the control circuit CONT1.
Fig. 2 is a schematic diagram of a scenario IN which the complete matrix multiplication is performed after the input matrix IN1 is filled IN, according to an embodiment. Referring to fig. 1 and 2, the computing device 120 may read the input matrix IN1 from the memory unit 110. The actual dimensions of the input matrix IN1 may be determined according to the actual design. IN the scenario shown IN fig. 2, the input matrix IN1 is assumed to be a 4×10 matrix, where a00, a01, …, a39 represent elements (elements) at different positions of the input matrix IN1. The computing device 120 may perform a filling operation on the input matrix IN1, and then store the input matrix IN1' after the filling operation IN the input buffer IB1. The padding operation means that one or more columns (redundant elements) of padding elements are padded on the left and right sides of the input matrix IN1 and/or one or more rows (rows) of padding elements are padded on the upper and lower sides of the input matrix IN1.
The operation device 120 may first perform a filling operation and then perform multiplication of the input matrix IN1 and the weight matrix W. In the scenario shown in fig. 2, the weight matrix W is assumed to be a 1*5 matrix, where W00, W01, …, W04 represent elements at different positions of the weight matrix W. Based on the dimension of the weight matrix W, IN order for the dimension of the multiplication result matrix C1 to be the same as the dimension of the input matrix IN1, the number of columns that should be filled IN the left and right sides of the input matrix IN1 is (w_column_number-1)/2= (5-1)/2=2, and the number of rows that should be filled IN the upper and lower sides of the input matrix IN1 is (w_row_number-1)/2= (1-1)/2=0, where w_column_number represents the number of columns of the weight matrix W and w_row_number represents the number of rows of the weight matrix W. Accordingly, the arithmetic device 120 fills two columns of filling elements (having the same filling value P0) on each of the left and right sides of the input matrix IN1, thereby forming the input matrix IN1' after the filling operation IN the input buffer IB1. The same padding value P0 may be 0 or other values based on the actual design. After the completion of the filling of the input matrix IN1, the computing device 120 performs the multiplication of the input matrix IN1' and the weight matrix W to obtain a multiplication result matrix C1 (where C00, C01, …, C39 represent elements at different positions of the multiplication result matrix C1). After completing the matrix multiplication, the operation device 120 may provide the multiplication result matrix C1 to the next stage circuit 130.
FIG. 3 is a schematic diagram illustrating a scenario IN which elements of the input matrix IN1' after the filling operation are moved to the circular buffer RB1 IN different iterations of matrix multiplication, according to an embodiment. Referring to fig. 1 and 3, the matrix multiplication has 5 iterations based on the dimension of the weight matrix W. At time t31 (first iteration), columns 1 to 10 of the input matrix IN1' are moved from the input buffer IB1 to the circular buffer RB1, and the arithmetic circuit CC1 may perform a first iteration operation using elements of the circular buffer RB1 and the element W00 IN the weight matrix W. At time t32 (second iteration), the 2 nd to 11 th columns of the input matrix IN1' are moved from the input buffer IB1 to the circular buffer RB1, and the arithmetic circuit CC1 may perform a second iteration operation using the elements of the circular buffer RB1 and the element W01 of the weight matrix W. At time t33 (third iteration), the 3 rd to 12 th columns of the input matrix IN1' are moved from the input buffer IB1 to the circular buffer RB1, and the arithmetic circuit CC1 may perform a third iteration operation using the elements of the circular buffer RB1 and the element W02 of the weight matrix W. At time t34 (fourth iteration), the 4 th to 13 th columns of the input matrix IN1' are moved from the input buffer IB1 to the circular buffer RB1, and the arithmetic circuit CC1 may perform a fourth iteration operation using the elements of the circular buffer RB1 and the element W03 IN the weight matrix W. At time t35 (fifth iteration), the 5 th to 14 th columns of the input matrix IN1' are moved from the input buffer IB1 to the circular buffer RB1, and the arithmetic circuit CC1 may perform a fifth iteration operation using the elements of the circular buffer RB1 and the element W04 IN the weight matrix W.
The arithmetic circuit CC1 may perform matrix multiplication with a plurality of iterations. After completing the multiplication of the input matrix IN1' and the weight matrix W, the arithmetic circuit CC1 may supply the multiplication result matrix C1 to the next stage circuit 130. The present embodiment is not limited to a specific algorithm of matrix multiplication performed by the arithmetic circuit CC1. The arithmetic circuit CC1 may perform a well-known matrix multiplication algorithm or other algorithms according to practical designs. As can be seen from fig. 2 and 3, the filling elements of the input matrix IN1' are all of a same filling value P0, and this filling value P0 is independent of the original input matrix IN1. It is conceivable that the input matrix IN1' after the filling operation has a plurality of columns of filling elements, and the plurality of filling elements occupy the space of the input buffer IB1.
Fig. 4 is a flowchart illustrating an operation method of the computing device 120 according to an embodiment of the invention. In some embodiments, the method of operation illustrated in FIG. 4 may be implemented in firmware or software (i.e., a program). For example, the operations associated with the method of operation illustrated in FIG. 4 may be implemented as non-transitory machine-readable instructions (programming code or program) that may be stored on a machine-readable storage medium. The method of operation illustrated in fig. 4 may be implemented when non-transitory machine readable instructions are executed by a computer. In other embodiments, the method of operation shown in FIG. 4 may be implemented in hardware, such as the computing device 120 shown in FIG. 1.
FIG. 5 is a schematic diagram illustrating the operation of moving matrix elements from the input buffer IB1 to the circular buffer RB1 and then selectively changing some elements in the circular buffer RB1 in different iterations of matrix multiplication, according to an embodiment of the present invention. Referring to fig. 1, 4 and 5, the input buffer IB1 can obtain all elements of the input matrix IN1 from the memory unit 110 outside the computing device 120. IN step S410, the input buffer IB1 places all elements of the original input matrix IN1 that have not yet been filled at a plurality of consecutive addresses of the input buffer IB1. IN the embodiment shown IN fig. 5, the front of the first element a00 of the input matrix IN1 is filled with two filling elements (having the same filling value P0). The padding elements of the padding operation do not exist IN all the elements a00 to a39 of the input matrix IN1 stored at the plurality of consecutive addresses of the input buffer IB1. Compared with the embodiments shown in fig. 2 and 3, the embodiments shown in fig. 4 and 5 can effectively reduce the number of padding elements (padding value P0) in the input buffer IB1. In the case where a large number of filling elements no longer occupy the space of the input buffer IB1, the space utilization efficiency of the input buffer IB1 can be effectively improved.
IN step S420, the control circuit CONT1 moves the plurality of elements of the input matrix IN1 from a segment of the continuous addresses of the input buffer IB1 to the ring buffer RB1. For example, the middle part of fig. 5 shows the storage contents of the ring buffer RB1 at different times (different iterations). As shown in fig. 5, the matrix multiplication has 5 iterations (based on the dimension of the weight matrix W). In time t50 (first iteration corresponding to element W00 in weight matrix W), the 1 st to 40 th consecutive elements "P0, a00, a01, …, a37" of the input buffer IB1 are moved from the input buffer IB1 to the ring buffer RB1. At time t52 (second iteration corresponding to element W01 in weight matrix W), the 2 nd to 41 st consecutive elements "P0, a00, a01, …, a37, a38" of input buffer IB1 are moved from input buffer IB1 to circular buffer RB1. At time t54 (third iteration corresponding to element W02 in weight matrix W), the 3 rd through 42 th consecutive elements "A00, A01, …, A37, A38, A39" of input buffer IB1 are moved from input buffer IB1 to circular buffer RB1. At time t56 (fourth iteration corresponding to element W03 in weight matrix W), the 4 th to 42 th consecutive elements "a01, …, a37, a38, a39" of input buffer IB1 are moved from input buffer IB1 to circular buffer RB1. At time t58 (fifth iteration corresponding to element W04 in weight matrix W), the 5 th through 42 th consecutive elements "A01, …, A37, A38, A39" of input buffer IB1 are moved from input buffer IB1 to circular buffer RB1.
IN step S430, the control circuit CONT1 defines at least one boundary position IN the ring buffer RB1 based on the dimension of the input matrix IN1. IN some embodiments, the control circuit CONT1 may define the boundary position based on the number of columns of the input matrix IN1. For example, but not limited thereto, the distance between two adjacent boundary positions is the number of columns of the input matrix IN1. Taking the input matrix IN1 shown IN fig. 2 as an example, the number of columns of the input matrix IN1 is 10, so that the distance between two adjacent boundary positions is 10 elements, and thus the control circuit CONT1 may be defined at a virtual straight line IN the ring buffer RB1 at least one boundary position is shown IN fig. 5.
In step S440, the control circuit CONT1 defines a filling range at the boundary position based on the current weight element of the weight matrix W. In some embodiments, the control circuit CONT1 may calculate the filling length of the filling range based on the column position of the current weight element in the weight matrix W, wherein the filling range is a range from the boundary position to the filling length. For example (but not limited thereto), the control circuit CONT1 may calculate the following equation a to obtain the filling length. Where mask_offset represents the fill length, w_id represents the column position of the current weight element IN the weight matrix W, and pad_number represents the number of columns that the fill operation should fill on the side of the input matrix IN1. The control circuit CONT1 may calculate the following equation B to obtain the number of columns pad_number that the side of the input matrix IN1 should fill. Where w_column_number represents the number of columns of the weight matrix.
mask_offset=w_id-pad_number equation a
pad_number= (w_column_number-1)/2 equation B
IN step S450, the control circuit CONT1 changes the element values of the elements IN the filling range of the ring buffer RB1 to realize the filling operation of the input matrix IN1. In step S460, the arithmetic circuit CC1 may read the plurality of elements after the filling operation from the ring buffer RB1. In step S470, the arithmetic circuit CC1 may perform matrix multiplication using the elements after the padding operation. Fig. 5 will be used as an illustrative example.
As shown in fig. 5, the matrix multiplication has 5 iterations (based on the dimension of the weight matrix W). The middle part of fig. 5 shows the memory contents of the ring buffer RB1 at different times (different iterations). At time t50 (first iteration corresponding to element W00 in weight matrix W), the 1 st through 40 th consecutive elements "P0, A00, A01, …, A37" of input buffer IB1 are moved from input buffer IB1 to circular buffer RB1. The column position of the current weight element W00 in the weight matrix W is "1 st" (column position w_id is 0). The number of columns pad_number to be filled on one side of the input matrix IN1 is (w_column_number-1)/2= (5-1)/2=2. Accordingly, the control circuit CONT1 may calculate the fill length mask_offset=w_id_pad_number=0-2= -2 of the fill range in step S440. Based on the fill length mask_offset= -2, the control circuit CONT1 may define "a range of two elements from the boundary position (at the virtual straight line shown in fig. 5) in the ring buffer RB1 to the right" as the fill range in step S440. As shown in fig. 5, elements a08 to a09 of the ring buffer RB1 are defined as one filling range, elements a18 to a19 of the ring buffer RB1 are defined as another filling range, and elements a28 to a29 of the ring buffer RB1 are defined as yet another filling range at time t 50.
The upper part of fig. 5 shows the memory contents of the ring buffer RB1 after being changed at different times (different iterations). The control circuit CONT1 may change the element values of the elements in the filling range of the ring buffer RB1 in step S450. For example, at time t51 (first iteration corresponding to element W00 IN the weight matrix W) after time t50, the control circuit CONT1 may reset all elements IN the plurality of filling ranges (elements a08 to a09, elements a18 to a19, and elements a28 to a 29) of the ring buffer RB1 to the same filling value P0 to implement the filling operation on the input matrix IN1. The same padding value P0 may be 0 or other values based on the actual design. After resetting all elements in the fill range to the same fill value P0, the arithmetic circuit CC1 may perform a first iterative operation using elements of the circular buffer RB1 and the element W00 in the weight matrix W.
At time t52 (second iteration corresponding to element W01 in the weight matrix W) after time t51, the 2 nd to 41 st consecutive elements "P0, a00, a01, …, a37, a38" of the input buffer IB1 are moved from the input buffer IB1 to the ring buffer RB1. Since the column position of the current weight element W01 in the weight matrix W is "2 nd" (the column position w_id is 1), the control circuit CONT1 can calculate the filling length mask_offset=1-2= -1 of the filling range. Based on the fill length mask_offset= -1, the control circuit CONT1 may define "a range starting from the boundary position (at the virtual straight line shown in fig. 5) in the ring buffer RB1 to the right by one element" as the fill range in step S440. As shown in fig. 5, element a09 of ring buffer RB1 is defined as one fill range, element a19 of ring buffer RB1 is defined as another fill range, and element a29 of ring buffer RB1 is defined as yet another fill range at time t 52. At time t53 (second iteration corresponding to element W01 IN the weight matrix W) after time t52, the control circuit CONT1 may reset all elements IN the plurality of filling ranges (elements a09, a19, a 29) of the ring buffer RB1 to the same filling value P0 to implement the filling operation on the input matrix IN1. After resetting all elements in the fill range to the same fill value P0, the arithmetic circuit CC1 may perform a second iterative operation with the elements of the circular buffer RB1 and the element W01 in the weight matrix W.
In time t54 after time t53 (third iteration corresponding to element W02 in weight matrix W), the 3 rd to 42 th consecutive elements "a00, a01, …, a37, a38, a39" of input buffer IB1 are moved from input buffer IB1 to circular buffer RB1. The column position of the current weight element W02 in the weight matrix W is "3 rd" (the column position w_id is 2), so the control circuit CONT1 can calculate the filling length mask_offset=2-2=0 of the filling range. Based on the padding length mask_offset=0, the control circuit CONT1 knows that the length of the current padding range is 0, i.e. the element does not need to be set to the padding value P0. At time t55 (third iteration corresponding to element W02 in the weight matrix W) after time t54, the computing circuit CC1 may perform a third iteration operation using the elements of the circular buffer RB1 and the element W02 in the weight matrix W.
In time t56 after time t55 (fourth iteration corresponding to element W03 in the weight matrix W), the 4 th to 42 th consecutive elements "a01, …, a37, a38, a39" of the input buffer IB1 are moved from the input buffer IB1 to the ring buffer RB1. The column position of the current weight element W03 in the weight matrix W is "4 th" (the column position w_id is 3), so the control circuit CONT1 can calculate the filling length mask_offset=3-2=1 of the filling range. Based on the fill length mask_offset=1, the control circuit CONT1 may define "a range of one element to the left from the boundary position (at the virtual straight line shown in fig. 5) in the ring buffer RB 1" as the fill range in step S440. As shown in fig. 5, element a10 of ring buffer RB1 is defined as one fill range, element a20 of ring buffer RB1 is defined as another fill range, and element a30 of ring buffer RB1 is defined as yet another fill range at time t 56. At time t57 (fourth iteration corresponding to element W03 IN the weight matrix W) after time t56, the control circuit CONT1 may reset all elements IN the plurality of filling ranges (elements a10, a20, a 30) of the ring buffer RB1 to the same filling value P0 to implement the filling operation on the input matrix IN1. After resetting all elements in the fill range to the same fill value P0, the arithmetic circuit CC1 may perform a fourth iterative operation with the elements W03 in the weight matrix W using the elements of the ring buffer RB1.
In time t58 after time t57 (fifth iteration corresponding to element W04 in weight matrix W), the 5 th to 42 th consecutive elements "a01, …, a37, a38, a39" of input buffer IB1 are moved from input buffer IB1 to ring buffer RB1. The column position of the current weight element W04 in the weight matrix W is "5 th" (the column position w_id is 4), so the control circuit CONT1 can calculate the filling length mask_offset=4-2=2 of the filling range. Based on the fill length mask_offset=2, the control circuit CONT1 may define "a range of two elements from the boundary position (at the virtual straight line shown in fig. 5) in the ring buffer RB1 to the left" as the fill range in step S440. As shown in fig. 5, elements a10 to a11 of the ring buffer RB1 are defined as one filling range, elements a20 to a21 of the ring buffer RB1 are defined as another filling range, and elements a30 to a31 of the ring buffer RB1 are defined as yet another filling range at time t 58. At time t59 (fifth iteration corresponding to element W04 IN the weight matrix W) after time t58, the control circuit CONT1 may reset all elements IN the plurality of filling ranges (elements a10 to a11, elements a20 to a21, and elements a30 to a 31) of the ring buffer RB1 to the same filling value P0 to implement the filling operation on the input matrix IN1. After resetting all elements in the fill range to the same fill value P0, the arithmetic circuit CC1 may perform a fifth iterative operation using elements of the circular buffer RB1 and the element W04 in the weight matrix W.
As described above, the arithmetic circuit CC1 may perform matrix multiplication having a plurality of iterations. After completing the matrix multiplication, the arithmetic circuit CC1 may supply the multiplication result matrix C1 to the next stage circuit 130.
IN summary, the computing device 120 may store all elements of the original input matrix IN1 (the matrix without padding elements) at a plurality of consecutive addresses of the input buffer IB1. IN each iteration IN the matrix multiplication, a plurality of corresponding elements of the input matrix IN1 (a portion of the input matrix IN 1) are shifted from a segment of the continuous address of the input buffer IB1 to the ring buffer RB1, and then the control circuit CONT1 selectively changes the element values of the elements IN the ring buffer RB1 according to the current iteration (the changed elements may be regarded as filling elements, which is equivalent to performing a filling operation on the input matrix IN 1). Therefore, the computing device 120 may reduce the space occupation of the input buffer IB1 by the filling element IN an application scenario IN which the filling operation on the input matrix IN1 is implemented.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (31)

1. An arithmetic device for performing multiplication of an input matrix and a weight matrix, the arithmetic device comprising:
an input buffer for buffering the input matrix, wherein all elements of the input matrix are stored at a plurality of consecutive addresses of the input buffer;
a ring buffer;
a control circuit coupled to the input buffer and the ring buffer, wherein the control circuit moves a plurality of elements of the input matrix to the ring buffer from a segment of consecutive addresses of the input buffer, the control circuit defines at least one boundary position in the ring buffer based on a dimension of the input matrix, the control circuit defines at least one fill range at the at least one boundary position based on a current weight element of the weight matrix, and the control circuit changes element values of the plurality of elements in the at least one fill range of the ring buffer to achieve a fill operation on the input matrix; and
and an arithmetic circuit coupled to the ring buffer to read the plurality of elements after the fill operation, wherein the arithmetic circuit uses the plurality of elements after the fill operation to perform the multiplication.
2. The computing device of claim 1, wherein the input buffer retrieves all elements of the input matrix from a storage unit external to the computing device.
3. The computing device of claim 2, wherein the input buffer comprises a universal matrix multiplication input buffer.
4. The computing device of claim 2, wherein the storage unit comprises a high bandwidth memory.
5. The computing device of claim 1, wherein no fill elements of the fill operation are present in all elements of the input matrix that are deposited at the plurality of consecutive addresses of the input buffer.
6. The computing device of claim 1, wherein the computing circuit provides the multiplied result to a memory or other computing device.
7. The computing device of claim 6, wherein the computing device is a tensor kernel and the computing circuit comprises a general-purpose matrix multiplication circuit.
8. The computing device of claim 6, wherein the other computing device comprises a vector core.
9. The computing device of claim 1, wherein the control circuit defines the at least one boundary position based on a number of columns of the input matrix.
10. The computing device of claim 9, wherein a distance between two adjacent boundary positions among the at least one boundary position is the number of columns of the input matrix.
11. The computing device of claim 1, wherein the control circuit calculates a fill length of the at least one fill range based on a column position of the current weight element in the weight matrix, and the at least one fill range is a range starting from the at least one boundary position to the fill length.
12. The computing device of claim 11, wherein the control circuit calculates mask_offset = w_id-pad_number to obtain the fill length, wherein mask_offset represents the fill length, w_id represents the column position of the current weight element in the weight matrix, and pad_number represents a number of columns that the fill operation should fill on one side of the input matrix.
13. The computing device of claim 12, wherein the number of columns pad_number to be filled on the side of the input matrix is pad_number= (w_column_number-1)/2, where w_column_number represents the number of columns of the weight matrix.
14. The computing device of claim 1, wherein the control circuit resets all of the plurality of elements in the at least one fill range of the circular buffer to one and the same fill value to effect a fill operation on the input matrix.
15. The computing device of claim 14, wherein the same fill value is 0.
16. A method of operation of an arithmetic device to perform multiplication of an input matrix and a weight matrix, the method comprising:
storing the input matrix at a plurality of consecutive addresses of an input buffer of the computing device;
shifting a plurality of elements of the input matrix from a segment of consecutive addresses of the input buffer to a circular buffer of the computing device;
defining at least one boundary position in the circular buffer based on a dimension of the input matrix;
defining at least one filling range at the at least one boundary position based on a current weight element of the weight matrix;
changing element values of the plurality of elements in the at least one fill range of the ring buffer to effect a fill operation on the input matrix;
reading, by an arithmetic circuit of the arithmetic device, the plurality of elements from the ring buffer after the fill operation, wherein the arithmetic circuit is coupled to the ring buffer; and
the multiplication is performed by the arithmetic circuit using the plurality of elements after the padding operation.
17. The method of operation of claim 16, further comprising:
all elements of the input matrix are retrieved by the input buffer from a storage unit external to the computing device.
18. The method of operation of claim 17 wherein the input buffer comprises a universal matrix multiplication input buffer.
19. The method of claim 17, wherein the storage unit comprises a high bandwidth memory.
20. The method of operation of claim 16, wherein no fill elements of the fill operation are present in all elements of the input matrix deposited at the plurality of consecutive addresses of the input buffer.
21. The method of operation of claim 16, further comprising:
the operation result of the multiplication is provided to a memory or other operation devices by the operation circuit.
22. The method of operation of claim 21 wherein the computing device is a tensor kernel and the computing circuit comprises a general-purpose matrix multiplication circuit.
23. The method of operation of claim 21, wherein the other computing device comprises a vector core.
24. The method of operation of claim 16, further comprising:
the at least one boundary position is defined based on a number of columns of the input matrix.
25. The method of operation of claim 24, wherein a distance between two adjacent boundary positions among the at least one boundary position is the number of columns of the input matrix.
26. The method of operation of claim 16, further comprising:
a fill length of the at least one fill range is calculated based on a column position of the current weight element in the weight matrix, wherein the at least one fill range is a range starting from the at least one boundary position to the fill length.
27. The method of operation of claim 26, further comprising:
calculating mask_offset=w_id-pad_number to obtain the padding length, wherein mask_offset represents the padding length, w_id represents the column position of the current weight element in the weight matrix, and pad_number represents the number of columns that the padding operation should pad on one side of the input matrix.
28. The method of operation of claim 27, wherein the number of columns pad_number that should be filled on the side of the input matrix is pad_number= (w_column_number-1)/2, where w_column_number represents the number of columns of the weight matrix.
29. The method of operation of claim 16, further comprising:
the plurality of elements in the at least one fill range of the ring buffer are all reset to one and the same fill value to effect a fill operation on the input matrix.
30. The method of operation of claim 29, wherein the same fill value is 0.
31. A machine-readable storage medium storing non-transitory machine-readable instructions which, when executed by a computer, implement the method of operation of any one of claims 16-30.
CN202410056538.8A 2024-01-16 2024-01-16 Computing device, method of operation, and machine-readable storage medium Active CN117574036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410056538.8A CN117574036B (en) 2024-01-16 2024-01-16 Computing device, method of operation, and machine-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410056538.8A CN117574036B (en) 2024-01-16 2024-01-16 Computing device, method of operation, and machine-readable storage medium

Publications (2)

Publication Number Publication Date
CN117574036A true CN117574036A (en) 2024-02-20
CN117574036B CN117574036B (en) 2024-04-12

Family

ID=89895869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410056538.8A Active CN117574036B (en) 2024-01-16 2024-01-16 Computing device, method of operation, and machine-readable storage medium

Country Status (1)

Country Link
CN (1) CN117574036B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109313663A (en) * 2018-01-15 2019-02-05 深圳鲲云信息科技有限公司 Artificial intelligence calculates Auxiliary Processing Unit, method, storage medium and terminal
US20190340486A1 (en) * 2018-05-04 2019-11-07 Apple Inc. Performing multiply and accumulate operations in neural network processor
CN111201525A (en) * 2017-10-18 2020-05-26 三菱电机株式会社 Arithmetic circuit and arithmetic method
CN112214727A (en) * 2017-07-07 2021-01-12 华为技术有限公司 Operation accelerator
US10990650B1 (en) * 2018-03-22 2021-04-27 Amazon Technologies, Inc. Reducing computations for data including padding
CN114902179A (en) * 2019-12-30 2022-08-12 高通股份有限公司 Method and apparatus for performing matrix multiplication in a streaming processor
CN116861149A (en) * 2023-09-05 2023-10-10 之江实验室 Convolution operation optimization method, device and processor
CN117194867A (en) * 2023-09-28 2023-12-08 西安电子科技大学 Arbitrary dimension matrix multiplication arithmetic unit based on FPGA

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214727A (en) * 2017-07-07 2021-01-12 华为技术有限公司 Operation accelerator
CN111201525A (en) * 2017-10-18 2020-05-26 三菱电机株式会社 Arithmetic circuit and arithmetic method
CN109313663A (en) * 2018-01-15 2019-02-05 深圳鲲云信息科技有限公司 Artificial intelligence calculates Auxiliary Processing Unit, method, storage medium and terminal
US10990650B1 (en) * 2018-03-22 2021-04-27 Amazon Technologies, Inc. Reducing computations for data including padding
US20190340486A1 (en) * 2018-05-04 2019-11-07 Apple Inc. Performing multiply and accumulate operations in neural network processor
CN114902179A (en) * 2019-12-30 2022-08-12 高通股份有限公司 Method and apparatus for performing matrix multiplication in a streaming processor
CN116861149A (en) * 2023-09-05 2023-10-10 之江实验室 Convolution operation optimization method, device and processor
CN117194867A (en) * 2023-09-28 2023-12-08 西安电子科技大学 Arbitrary dimension matrix multiplication arithmetic unit based on FPGA

Also Published As

Publication number Publication date
CN117574036B (en) 2024-04-12

Similar Documents

Publication Publication Date Title
CN112214726B (en) Operation accelerator
US11423285B2 (en) Buffer addressing for a convolutional neural network
US10942986B2 (en) Hardware implementation of convolutional layer of deep neural network
US11755474B2 (en) Tile based interleaving and de-interleaving for digital signal processing
WO2018160738A2 (en) Reconfigurable matrix multiplier system and method
US10846089B2 (en) Unified logic for aliased processor instructions
CN111316261B (en) Matrix computing engine
US20130262548A1 (en) Matrix calculation unit
JPH10187438A (en) Method for reducing transition to input of multiplier
US10877733B2 (en) Segment divider, segment division operation method, and electronic device
EP3093757B1 (en) Multi-dimensional sliding window operation for a vector processor
EP4318275A1 (en) Matrix multiplier and method for controlling matrix multiplier
US20210373895A1 (en) Method and tensor traversal engine for strided memory access during execution of neural networks
CN112506567A (en) Data reading method and data reading circuit
CN114943057A (en) Dot product based processing element
CN117574036B (en) Computing device, method of operation, and machine-readable storage medium
US20230169315A1 (en) Sparse index generator
US9715343B2 (en) Multidimensional partitioned storage array and method utilizing input shifters to allow multiple entire columns or rows to be accessed in a single clock cycle
CN114579925A (en) Convolution operation method and device and convolution kernel splitting method and unit
KR102561205B1 (en) Mobilenet hardware accelator with distributed sram architecture and channel stationary data flow desigh method thereof
Ochoa-Ruiz et al. A novel approach for accelerating bitstream relocation in many-core partially reconfigurable applications
US9442661B2 (en) Multidimensional storage array and method utilizing an input shifter to allow an entire column or row to be accessed in a single clock cycle
Lin et al. Thermal-controlled design flow for the three-dimensional dual-mode forward error correction architecture
US20230237122A1 (en) Matrix computing method and related device
CN114936636A (en) General lightweight convolutional neural network acceleration method based on FPGA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant