Detailed Description
Reference will now be made in detail to the exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts.
The term "coupled" as used throughout this specification (including the claims) may refer to any direct or indirect connection. For example, if a first device couples (or connects) to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. The terms first, second and the like in the description (including the claims) are used for naming components, and are not used for limiting the number of components, i.e. upper or lower, or the order of the components. In addition, wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts. The components/elements/steps in different embodiments using the same reference numerals or using the same terminology may be referred to with respect to each other.
By way of example, the embodiments described below may be applied to data management of a register file (REGISTER FILE) of a general-purpose computing (GPGPU) on a graphics processing unit. By way of example, but not limitation, the following embodiments may be applicable to the application scenario of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI). The following embodiments will describe a data management manner of a pooling device for a register file, which improves the use efficiency of the register file and reduces the access bandwidth of the register file.
Fig. 1 is a schematic diagram of a circuit block (circuit block) of a pooling device 100 according to an embodiment of the invention. The pooling device 100 may perform a pooling operation on a data matrix (e.g., tensor, image, etc.). The pooling device 100 shown in fig. 1 includes a register file 110 and an Execution Unit (EU) 120. Depending on the space of the register file 110, the data matrix to be pooled may be divided into a plurality of blocks, and a complete pooling operation may be divided into a plurality of batch operations. During a first batch of the pooling operation, at least one first block and at least one second block of the plurality of blocks of the data matrix from the system 10 may be loaded into the register file 110. Depending on the implementation, the source of the data elements for the data matrix may be a memory (memory) or other operation module (e.g., another execution unit) of the system 10.
The execution unit 120 is coupled to the register file 110. The execution unit 120 may perform a first batch operation of pooling operations on the at least one first block and the at least one second block in the register file 110 during the first batch.
FIG. 2 is a flow chart of a pooling method according to an embodiment of the invention. Please refer to fig. 1 and fig. 2. In step S210, based on the requirements of the execution unit 120, the system 10 may load at least one first block and at least one second block of the data matrix into the register file 110 during the first batch of the pooling operation. The data matrix to be pooled may be divided into a plurality of blocks, and the size of the blocks may be determined according to the space of the register file 110. After the loading is completed, the execution unit 120 may perform a first batch operation of pooling the at least one first block and the at least one second block in the register file 110 during the first batch (step S220). After completing the first batch operation, the register file 110 may discard the at least one first block and reserve the at least one second block during a second batch of the pooling operation, and load at least one third block of the data matrix into the register file 110 during the second batch (step S230). In step S240, the execution unit 120 may perform a second batch operation of pooling the at least one second block and the at least one third block in the register file 110 during the second batch. If two batches complete a complete pooling operation, then execution unit 120 may generate a pooled matrix (i.e., the iteration is complete) after completing the second batch. If the iteration is not complete, the pooling device 100 may perform other batch operations of the pooling operation on other portions of the data matrix. The "other batch operations for pooling other portions of the data matrix" may be analogized with reference to the descriptions of steps S230 to S240, and will not be repeated.
Based on the above, the data matrix may be divided into a plurality of blocks (each block containing one or more data elements) according to the limited space of the register file 110, and a complete pooling operation may be divided into a plurality of batch operations. In a first batch of pooled operations, one or more blocks corresponding to the first batch may be loaded into the register file 110. After each batch operation, one or more blocks that were used may be retained in register file 110, given that different batch operations may use the same blocks. For example, after the first batch operation is completed, the second block that is used for both the first batch operation and the second batch operation is reserved in the register file 110 to save the data transmission bandwidth. One or more other blocks (third blocks) that would be used in the second batch operation may then be loaded into the register file 110. In the register file 110, the first block that was used by the first batch operation but is no longer needed by the second batch operation is discarded (e.g., replaced/overwritten by the third block) to save space in the register file 110. Thus, the pooling device 100 can efficiently manage and use the space of the register file 110 for the pooling operation.
FIG. 3 is a flow chart of a pooling method according to another embodiment of the invention. Please refer to fig. 1 and 3. Before performing the pooling operation, the execution unit 120 may calculate the size of the workshops (workshop) according to the size of the data matrix, the size of the register file 110, the size of the pooling window (or DEPTH SLICE) of the pooling operation, and the stride (stride) of the pooling operation, and allocate the space of the register file 110 to the pooling operation according to the size of the workshops (step S310). The size of the workshops may be determined based on the actual design. For example, in some embodiments, the number of rows (row) of the mill size is greater than or equal to the number of rows of the pooling window of the pooling operation, and the number of columns (column) of the mill size is greater than the number of columns of the pooling window. In other embodiments, the number of columns of the mill size is greater than or equal to the number of columns of the pooling window, and the number of rows of the mill size is greater than the number of rows of the pooling window. In still other embodiments, the size of the workshops may be determined using equation 1 below. The capacity of a single register depends on a hardware design index, the data size of a filling area depends on the size of a pooling window and a calculation step length, and the total data size or the data size of single loading can be reasonably planned according to hardware resources. After the calculation of the workshops size is completed, the execution unit 120 may perform the pooling operation (steps S320 to S340).
Workshop size =Equation 1
In step S320, the execution unit 120 may load data into the register file 110. For example, the execution unit 120 may load at least one first block and at least one second block of the data matrix to the register file 110. In step S330, the execution unit 120 may perform a pooling calculation on the data in the register file 110. For example, after completing the loading, the execution unit 120 may perform a first batch operation of the pooling operation on the first block and the second block in the register file 110 during the first batch of the pooling operation. In step S340, the execution unit 120 may determine whether the iteration is complete. For example, execution unit 120 may check whether all batch operations for a pooling operation of one data matrix are complete. When the pooling operation has not been performed (no in step S340), the execution unit 120 may return to step S320 to load the next (or next batch) block into the register file 110. When all the batch operations of the pooling operation have been completed (yes in step S340), the execution unit 120 may output the result of the pooling operation (via the pooling matrix) to the system 10 (step S350).
Fig. 4 is a flow diagram of a pooling method according to yet another embodiment of the invention. The steps S410, S420, S440 and S450 shown in fig. 4 can be analogized with respect to the steps S310, S320, S330 and S340 shown in fig. 3, and thus are not repeated. For example, the execution unit 120 may load at least one first block and at least one second block of the data matrix to the register file 110. After the first block and the second block are loaded into the register file 110 (step S420), and before the first batch operation of the pooling operation is performed on the first block and the second block (step S440), the register file 110 may perform data rearrangement on the first block and the second block (step S430) to meet the requirement of the pooling operation. The pooling method shown in fig. 4 will be described below with a specific example. The requirements of so-called pooling operations include: when the hardware performs pooling operation, the relation of self-increment of register numbers in the row direction or the column direction is needed, namely, the row is continuous or the column is continuous, so that the hardware addressing is facilitated. The data rearrangement will be specifically illustrated in fig. 5C and 5G to facilitate hardware addressing.
Fig. 5A to 5H are schematic diagrams illustrating the operation scenario of the pooling device 100 for pooling the data matrix in different steps of the flow shown in fig. 4 according to an embodiment of the present invention. The system 10 and the register file 110 shown in fig. 5A can refer to the related descriptions of the system 10 and the register file 110 shown in fig. 1, and thus will not be described in detail. Please refer to fig. 1, fig. 4 and fig. 5A. It is assumed here that the data matrix DM to be pooled in the system 10 is a 6*6 matrix. In the data matrix DM, each small rectangle represents a data element of the data matrix DM, and the numbers in these 6*6 small rectangles represent data elements of different locations of the data matrix DM. That is, the numbers 1-36 in these 6*6 small rectangles should not limit the actual data elements of the data matrix DM. The data matrix DM to be pooled may be divided into a plurality of blocks, and the size of the blocks may be determined according to the space of the register file 110. For example, the data matrix DM may be divided into a block DM1, a block DM2 and a block DM3, as shown in fig. 5A. The block DM1 includes a first row and a second row of the data matrix DM, the block DM2 includes a third row and a fourth row of the data matrix DM, and the block DM3 includes a fifth row and a sixth row of the data matrix DM. In any case, the division of the data matrix DM should not be limited to the example shown in fig. 5A. For example, in other embodiments, the block DM1 includes a first column and a second column of the data matrix DM, the block DM2 includes a third column and a fourth column of the data matrix DM, and the block DM3 includes a fifth column and a sixth column of the data matrix DM.
Before performing the pooling operation, the execution unit 120 may calculate a workshop size according to the size of the data matrix DM, the size of the register file 110, the size of the pooling window of the pooling operation, and/or the stride of the pooling operation, and allocate the space of the register file 110 to the pooling operation according to the workshop size (step S410). In the embodiment shown in fig. 5A-5H, register file 110 is assumed to allocate registers R1-R24 to the pooling operation according to the mill size.
Please refer to fig. 1, fig. 4 and fig. 5B. In step S420, the execution unit 120 may load data into the register file 110. For example, the execution unit 120 can load at least one first block (e.g., block DM 1) and at least one second block (e.g., block DM 2) of the data matrix DM into the registers R1-R24 of the register file 110, as shown in FIG. 5B. It is assumed here that the data elements of the first row of the data matrix DM are loaded into the registers R1, R3, R9, R11, R17 and R19, respectively, the data elements of the second row of the data matrix DM are loaded into the registers R2, R4, R10, R12, R18 and R20, respectively, the data elements of the third row of the data matrix DM are loaded into the registers R5, R7, R13, R15, R21 and R23, respectively, and the data elements of the fourth row of the data matrix DM are loaded into the registers R6, R8, R14, R16, R22 and R24, respectively. Since the data placement of the registers R1 to R24 shown in fig. 5B is not suitable for the requirement of the pooling operation, the register file 110 may perform data rearrangement in step S430.
Please refer to fig. 1, fig. 4 and fig. 5C. The register file 110 may reorder the data of the blocks DM1 and DM2 in step S430 to meet the requirement of the pooling operation. It is assumed here that the data elements of the first row of the data matrix DM are rearranged to the register R1, the register R5, the register R9, the register R13, the register R17 and the register R21, respectively, the data elements of the second row of the data matrix DM are rearranged to the register R2, the register R6, the register R10, the register R14, the register R18 and the register R22, respectively, the data elements of the third row of the data matrix DM are rearranged to the register R3, the register R7, the register R11, the register R15, the register R19 and the register R23, respectively, and the data elements of the fourth row of the data matrix DM are rearranged to the register R4, the register R8, the register R12, the register R16, the register R20 and the register R24, respectively.
The execution unit 120 shown in fig. 5D may refer to the related description of the execution unit 120 shown in fig. 1, and thus will not be described again. Please refer to fig. 1, fig. 4 and fig. 5D. In step S440, execution unit 120 may perform a pooling calculation on the data in registers R1-R24 of register file 110. For example, after completing the data rearrangement, the execution unit 120 may perform the first batch operation of the pooling operation on the blocks DM1 and DM2 in the registers R1-R24 of the register file 110 during the first batch of the pooling operation. By way of example, but not limitation, the pooling operation performed by the execution unit 120 may be assumed to be maximum pooling (Max pooling), the size of the pooling window PW of the pooling operation may be assumed to be 3*3 window, and the stride of the pooling operation may be assumed to be 2. The execution unit 120 may perform a first batch of pooling operations on the blocks DM1 and DM2 in the registers R1-R24 to generate a portion of the data elements of the pooled matrix PM (as shown in FIG. 5D). After the first batch operation is completed, the block DM2 used for both the first batch operation and the second batch operation is reserved in the register file 110 to save the data transmission bandwidth. In the register file 110, the block DM1 used by the first batch operation but no longer needed by the second batch operation is discarded (overwritten) to save space in the register file 110.
After the first batch of pooling operations on the blocks DM1 and DM2 (step S440), the execution unit 120 may determine whether the iteration is complete (step S450). For example, the execution unit 120 may check whether all batch operations of the pooling operation for one data matrix DM are completed. When the data matrix DM has not been partially pooled (no in step S450), the execution unit 120 may proceed to step S460. After the first batch operation (step S440) of pooling operations on the block DM1 (first block) and the block DM2 (second block), and before the block DM3 (third block) is loaded into the register file 110 (step S420), the register file 110 may reorder the data on the block DM2 (step S460). Step S460 may discard the block DM1 and reserve the block DM2.
Please refer to fig. 1, fig. 4 and fig. 5E. In step S460, the register file 110 may reorder the data of the block DM2, as shown in fig. 5E. It is assumed that the data elements of the third row of the data matrix DM are rearranged to the register R1, the register R3, the register R9, the register R11, the register R17 and the register R19, respectively, and the data elements of the fourth row of the data matrix DM are rearranged to the register R2, the register R4, the register R10, the register R12, the register R18 and the register R20, respectively. The registers R5, R7, R13, R15, R21 and R23 are used to load data elements of one new row of the data matrix DM, while the registers R6, R8, R14, R16, R22 and R24 are used to load data elements of another new row of the data matrix DM.
After the data rearrangement of block DM2 (second block) (step S460), the register file 110 loads block DM3 (third block) during the second batch of the pooling operation (step S420). Please refer to fig. 1, fig. 4 and fig. 5F. In step S420, the register file 110 may load a block DM3 of the data matrix DM into the register file 110, as shown in fig. 5F. It is assumed here that the data elements of the fifth row of the data matrix DM are loaded into the register R5, the register R7, the register R13, the register R15, the register R21 and the register R23, respectively, and the data elements of the sixth row of the data matrix DM are loaded into the register R6, the register R8, the register R14, the register R16, the register R22 and the register R24, respectively. Since the data placement manners of the registers R1 to R24 shown in fig. 5F are not suitable for the requirement of the pooling operation, the register file 110 may perform data rearrangement in step S430.
After the block DM3 (the third block) is loaded into the register file 110 (step S420), and before the second batch operation (step S440) of pooling operations are performed on the block DM2 (the second block) and the block DM3 (the third block), the register file 110 may perform data rearrangement (step S430) on the block DM2 and the block DM3 to meet the requirement of the pooling operation. Please refer to fig. 1, fig. 4 and fig. 5G. It is assumed here that the data elements of the third row of the data matrix DM are rearranged to the register R1, the register R5, the register R9, the register R13, the register R17 and the register R21, respectively, the data elements of the fourth row of the data matrix DM are rearranged to the register R2, the register R6, the register R10, the register R14, the register R18 and the register R22, respectively, the data elements of the fifth row of the data matrix DM are rearranged to the register R3, the register R7, the register R11, the register R15, the register R19 and the register R23, respectively, and the data elements of the sixth row of the data matrix DM are rearranged to the register R4, the register R8, the register R12, the register R16, the register R20 and the register R24, respectively.
Please refer to fig. 1, fig. 4 and fig. 5H. After completing the data rearrangement (step S430), the execution unit 120 may perform the second batch operation of the pooling operation on the blocks DM1 and DM2 in the registers R1 to R24 of the register file 110 during the second batch of the pooling operation (step S440). The execution unit 120 may perform a second batch of pooling operations on the blocks DM2 and DM3 in the registers R1-R24 to generate another portion of the data elements of the pooled matrix PM (as shown in FIG. 5H). In the embodiment shown in fig. 5A-5H, all batch operations of the pooling operation have been completed after the second batch operation is completed. When all the batch operations of the pooling operation have been completed (yes in step S450), the execution unit 120 may output the pooled matrix PM (the result of the pooling operation) to the system 10 (step S470).
In summary, the data matrix DM may be divided into a plurality of blocks (e.g., blocks DM1, DM2, and DM 3). Depending on the limited space of the register file 110, a complete pooling operation may be divided into a plurality of batch operations. In the first batch operation of the pooling operation, the blocks DM1 and DM2 corresponding to the first batch operation may be loaded into the register file 110. After the first batch operation is completed for the blocks DM1 and DM2, the block DM2 used for both the first batch operation and the second batch operation is reserved in the register file 110 to save the data transmission bandwidth. Then, the block DM3 that would be used for the second batch operation may be loaded into the register file 110. In the register file 110, the block DM1 used by the first batch operation but no longer needed by the second batch operation is discarded (e.g., replaced/overwritten by the block DM 3) to save space in the register file 110. Thus, the pooling device 100 can efficiently manage and use the space of the register file 110 for the pooling operation.
Depending on design requirements, the implementation of the execution unit 120 may be hardware (hardware), firmware (firmware), software (software), or a combination of any three. In hardware, execution unit 120 may be implemented as logic circuitry on an integrated circuit (INTEGRATED CIRCUIT). The relevant functions of execution unit 120 may be implemented as hardware using a hardware description language (hardware description languages, such as Verilog HDL or VHDL) or other suitable programming language. For example, the relevant functions of execution unit 120 may be implemented in various logic blocks, modules, and circuits in one or more controllers, microcontrollers, microprocessors, application-specific integrated circuits (ASICs), digital Signal Processors (DSPs), field programmable logic gate arrays (Field Programmable GATE ARRAY, FPGA), and/or other processing units. The relevant functions of the execution unit 120 may be implemented as programming code (programming codes) in software and/or firmware. For example, execution unit 120 is implemented using a general programming language (programming languages, such as C, C ++ or assembly language) or other suitable programming language. The programming code may be recorded/stored on a "non-transitory computer readable medium (non-transitory computer readable medium)". In some embodiments, the non-transitory computer readable medium includes, for example, tape (tape), disk (disk), card (card), semiconductor memory, programmable logic, and/or storage device. The storage device includes a hard disk (HARD DISK DRIVE, HDD), a Solid state disk (Solid-state drive, STATE DRIVE, SSD), or other storage device. A central processing unit (Central Processing Unit, CPU), controller, microcontroller, or microprocessor can read and execute the programming code from the non-transitory computer readable medium to perform the associated functions of execution unit 120.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.