CN114139107B

CN114139107B - Pooling device and pooling method

Info

Publication number: CN114139107B
Application number: CN202111465581.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Filing date: 2021-12-03
Publication date: 2024-07-12
Anticipated expiration: 2041-12-03

Abstract

The invention provides a pooling device and a pooling method, which are used for efficiently using the space of a register file to pool a data matrix. The data matrix is divided into a plurality of blocks. The pooling device comprises a register file and an execution unit. The register file loads a first block and a second block in the data matrix during a first batch of pooling operations. The register file performs a first batch operation during a first batch that pools a first block and a second block in the register file. During a second batch after completion of the first batch operation, the register file discards the first block and retains the second block, and loads a third block in the data matrix. The execution unit performs a second batch operation of pooling operations on the second block and the third block in the register file during the second batch.

Description

Pooling device and pooling method

Technical Field

The present invention relates to an electronic device, and more particularly, to a pooling device and pooling method.

Background

General-purpose computing on graphics processing units (GPGPU) on a graphics processing unit is a general-purpose computing task that utilizes a graphics processor that processes graphics tasks to compute what would otherwise be processed by a central processing unit (central processing unit, CPU). These general purpose computing tasks may not be relevant to graphics processing. Graphics processors may also process non-graphics data due to their powerful parallel processing capabilities and programmable pipelines (pipeline). In particular, single instruction stream multiple data streams (single instruction multiple data, SIMD), GPGPU is far more efficient than conventional CPU applications. In the design of GPGPU, each Execution Unit (EU) has a register file (REGISTER FILE) belonging to it. In performing convolutional neural network (Convolutional Neural Network, CNN) operations or other operations, the execution units load data from memory (or other operation module) into the register file and then use the data in the register file to perform the operations.

Convolutional neural network operations perform computational tasks such as convolution (convolution), activation (activation), pooling (pooling), and the like. Pooling is a non-linear downsampling (non-linear downsampling) in convolutional neural network operations. Common pooling operations include maximum pooling (Max pooling), average pooling (Average pooling), or other pooling algorithms. The pooling operation is to move the pooling window (or DEPTH SLICE) to different positions of the data matrix (e.g., tensor, image) according to stride (stride), and then calculate (e.g., maximum or average) the plurality of data elements (DATA ELEMENT) in the pooling window after each move of the pooling window to obtain one data element of the pooled matrix. After the pooling window completely scans the entire data matrix according to the stride, the execution unit may store the pooled matrix into memory. The pooling operation can continuously reduce the size of the data matrix, thereby reducing the number of parameters and the calculated amount of the convolutional neural network operation. The pooling operation may control the overfitting to some extent (overfitting).

In order to perform the pooling operation, the complete data matrix must be loaded into the register file. After the complete data matrix is loaded into the register file, the execution unit can carry out pooling operation on the complete data matrix in the register file. In any event, the space of the register file is limited. As the size of the data matrix increases, the space of the register file will not be used. How to more efficiently manage and use the space of a register file to facilitate the pooling operation is one of the important technical subjects in the art.

Disclosure of Invention

The present invention provides a pooling apparatus and a pooling method for efficiently using the space of a register file (REGISTER FILE) for pooling operations.

In an embodiment according to the invention, the pooling device is used to perform a pooling operation on a data matrix (data matrix). The data matrix is divided into a plurality of blocks. The pooling device includes a register file and an Execution Unit (EU). The register file is used to load at least one first block and at least one second block of the plurality of blocks of the data matrix during a first batch of pooling operations. The execution unit is coupled to the register file. The register file is used for carrying out first batch operation of pooling operation on the at least one first block and the at least one second block in the register file during a first batch. After the first batch operation is completed, the register file discards the at least one first block and retains the at least one second block during a second batch of the pooling operation. The register file loads at least one third block of the plurality of blocks of the data matrix during the second batch. The execution unit performs a second batch operation of pooling operations on the at least one second block and the at least one third block in the register file during a second batch.

In an embodiment according to the invention, the pooling method is used for pooling the data matrix. The data matrix is divided into a plurality of blocks. The pooling method comprises the following steps: loading at least one first block and at least one second block of the plurality of blocks of the data matrix to a register file during a first batch of pooling operations; a first batch operation that pools the at least one first block and the at least one second block in the register file during a first batch; discarding the at least one first block and retaining the at least one second block during a second batch of the pooling operation after completion of the first batch operation; loading at least one third block of the plurality of blocks of the data matrix to the register file during a second batch; and a second batch operation of pooling the at least one second block and the at least one third block in the register file during a second batch.

Based on the above, the data matrix may be divided into a plurality of blocks according to the limited space of the register file, and one complete pooling operation may be divided into a plurality of batch operations. In a first batch operation of the pooling operation, one or more blocks corresponding to the first batch operation may be loaded into the register file. After each batch operation, the used block or blocks (data elements) are retained in the register file, considering that different batch operations may use the same block (data element). For example, after the first batch operation is completed, one or more second blocks that are used by both the first batch operation and the second batch operation are reserved in the register file to save data transmission bandwidth. Then, the other block(s) that would be used for the second batch operation(s) may be loaded into the register file. In the register file, one or more first blocks that were used by the first batch operation but are no longer needed by the second batch operation are discarded to save space in the register file. Thus, the pooling device can efficiently manage and use the space of the register file to perform the pooling operation.

Drawings

Fig. 1 is a schematic block diagram of a pooling device according to an embodiment of the present invention.

FIG. 2 is a flow chart of a pooling method according to an embodiment of the invention.

FIG. 3 is a flow chart of a pooling method according to another embodiment of the invention.

Fig. 4 is a flow diagram of a pooling method according to yet another embodiment of the invention.

Fig. 5A to 5H are schematic diagrams illustrating an operation scenario of the pooling device 100 for pooling the data matrix in different steps of the flow shown in fig. 4 according to an embodiment of the present invention.

Description of the reference numerals

10: System and method for controlling a system

100: Pooling device

110: Register file

120: Execution unit

DM: data matrix

DM1, DM2, DM3: block block

PM: pooled matrix

PW (pseudo wire): pooling window

R1 to R24: register

S210 to S240, S310 to S350, S410 to S470: and (3) step (c).

Detailed Description

Reference will now be made in detail to the exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts.

The term "coupled" as used throughout this specification (including the claims) may refer to any direct or indirect connection. For example, if a first device couples (or connects) to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. The terms first, second and the like in the description (including the claims) are used for naming components, and are not used for limiting the number of components, i.e. upper or lower, or the order of the components. In addition, wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts. The components/elements/steps in different embodiments using the same reference numerals or using the same terminology may be referred to with respect to each other.

By way of example, the embodiments described below may be applied to data management of a register file (REGISTER FILE) of a general-purpose computing (GPGPU) on a graphics processing unit. By way of example, but not limitation, the following embodiments may be applicable to the application scenario of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI). The following embodiments will describe a data management manner of a pooling device for a register file, which improves the use efficiency of the register file and reduces the access bandwidth of the register file.

Fig. 1 is a schematic diagram of a circuit block (circuit block) of a pooling device 100 according to an embodiment of the invention. The pooling device 100 may perform a pooling operation on a data matrix (e.g., tensor, image, etc.). The pooling device 100 shown in fig. 1 includes a register file 110 and an Execution Unit (EU) 120. Depending on the space of the register file 110, the data matrix to be pooled may be divided into a plurality of blocks, and a complete pooling operation may be divided into a plurality of batch operations. During a first batch of the pooling operation, at least one first block and at least one second block of the plurality of blocks of the data matrix from the system 10 may be loaded into the register file 110. Depending on the implementation, the source of the data elements for the data matrix may be a memory (memory) or other operation module (e.g., another execution unit) of the system 10.

The execution unit 120 is coupled to the register file 110. The execution unit 120 may perform a first batch operation of pooling operations on the at least one first block and the at least one second block in the register file 110 during the first batch.

FIG. 2 is a flow chart of a pooling method according to an embodiment of the invention. Please refer to fig. 1 and fig. 2. In step S210, based on the requirements of the execution unit 120, the system 10 may load at least one first block and at least one second block of the data matrix into the register file 110 during the first batch of the pooling operation. The data matrix to be pooled may be divided into a plurality of blocks, and the size of the blocks may be determined according to the space of the register file 110. After the loading is completed, the execution unit 120 may perform a first batch operation of pooling the at least one first block and the at least one second block in the register file 110 during the first batch (step S220). After completing the first batch operation, the register file 110 may discard the at least one first block and reserve the at least one second block during a second batch of the pooling operation, and load at least one third block of the data matrix into the register file 110 during the second batch (step S230). In step S240, the execution unit 120 may perform a second batch operation of pooling the at least one second block and the at least one third block in the register file 110 during the second batch. If two batches complete a complete pooling operation, then execution unit 120 may generate a pooled matrix (i.e., the iteration is complete) after completing the second batch. If the iteration is not complete, the pooling device 100 may perform other batch operations of the pooling operation on other portions of the data matrix. The "other batch operations for pooling other portions of the data matrix" may be analogized with reference to the descriptions of steps S230 to S240, and will not be repeated.

Based on the above, the data matrix may be divided into a plurality of blocks (each block containing one or more data elements) according to the limited space of the register file 110, and a complete pooling operation may be divided into a plurality of batch operations. In a first batch of pooled operations, one or more blocks corresponding to the first batch may be loaded into the register file 110. After each batch operation, one or more blocks that were used may be retained in register file 110, given that different batch operations may use the same blocks. For example, after the first batch operation is completed, the second block that is used for both the first batch operation and the second batch operation is reserved in the register file 110 to save the data transmission bandwidth. One or more other blocks (third blocks) that would be used in the second batch operation may then be loaded into the register file 110. In the register file 110, the first block that was used by the first batch operation but is no longer needed by the second batch operation is discarded (e.g., replaced/overwritten by the third block) to save space in the register file 110. Thus, the pooling device 100 can efficiently manage and use the space of the register file 110 for the pooling operation.

FIG. 3 is a flow chart of a pooling method according to another embodiment of the invention. Please refer to fig. 1 and 3. Before performing the pooling operation, the execution unit 120 may calculate the size of the workshops (workshop) according to the size of the data matrix, the size of the register file 110, the size of the pooling window (or DEPTH SLICE) of the pooling operation, and the stride (stride) of the pooling operation, and allocate the space of the register file 110 to the pooling operation according to the size of the workshops (step S310). The size of the workshops may be determined based on the actual design. For example, in some embodiments, the number of rows (row) of the mill size is greater than or equal to the number of rows of the pooling window of the pooling operation, and the number of columns (column) of the mill size is greater than the number of columns of the pooling window. In other embodiments, the number of columns of the mill size is greater than or equal to the number of columns of the pooling window, and the number of rows of the mill size is greater than the number of rows of the pooling window. In still other embodiments, the size of the workshops may be determined using equation 1 below. The capacity of a single register depends on a hardware design index, the data size of a filling area depends on the size of a pooling window and a calculation step length, and the total data size or the data size of single loading can be reasonably planned according to hardware resources. After the calculation of the workshops size is completed, the execution unit 120 may perform the pooling operation (steps S320 to S340).

Workshop size =Equation 1

In step S320, the execution unit 120 may load data into the register file 110. For example, the execution unit 120 may load at least one first block and at least one second block of the data matrix to the register file 110. In step S330, the execution unit 120 may perform a pooling calculation on the data in the register file 110. For example, after completing the loading, the execution unit 120 may perform a first batch operation of the pooling operation on the first block and the second block in the register file 110 during the first batch of the pooling operation. In step S340, the execution unit 120 may determine whether the iteration is complete. For example, execution unit 120 may check whether all batch operations for a pooling operation of one data matrix are complete. When the pooling operation has not been performed (no in step S340), the execution unit 120 may return to step S320 to load the next (or next batch) block into the register file 110. When all the batch operations of the pooling operation have been completed (yes in step S340), the execution unit 120 may output the result of the pooling operation (via the pooling matrix) to the system 10 (step S350).

Fig. 4 is a flow diagram of a pooling method according to yet another embodiment of the invention. The steps S410, S420, S440 and S450 shown in fig. 4 can be analogized with respect to the steps S310, S320, S330 and S340 shown in fig. 3, and thus are not repeated. For example, the execution unit 120 may load at least one first block and at least one second block of the data matrix to the register file 110. After the first block and the second block are loaded into the register file 110 (step S420), and before the first batch operation of the pooling operation is performed on the first block and the second block (step S440), the register file 110 may perform data rearrangement on the first block and the second block (step S430) to meet the requirement of the pooling operation. The pooling method shown in fig. 4 will be described below with a specific example. The requirements of so-called pooling operations include: when the hardware performs pooling operation, the relation of self-increment of register numbers in the row direction or the column direction is needed, namely, the row is continuous or the column is continuous, so that the hardware addressing is facilitated. The data rearrangement will be specifically illustrated in fig. 5C and 5G to facilitate hardware addressing.

Fig. 5A to 5H are schematic diagrams illustrating the operation scenario of the pooling device 100 for pooling the data matrix in different steps of the flow shown in fig. 4 according to an embodiment of the present invention. The system 10 and the register file 110 shown in fig. 5A can refer to the related descriptions of the system 10 and the register file 110 shown in fig. 1, and thus will not be described in detail. Please refer to fig. 1, fig. 4 and fig. 5A. It is assumed here that the data matrix DM to be pooled in the system 10 is a 6*6 matrix. In the data matrix DM, each small rectangle represents a data element of the data matrix DM, and the numbers in these 6*6 small rectangles represent data elements of different locations of the data matrix DM. That is, the numbers 1-36 in these 6*6 small rectangles should not limit the actual data elements of the data matrix DM. The data matrix DM to be pooled may be divided into a plurality of blocks, and the size of the blocks may be determined according to the space of the register file 110. For example, the data matrix DM may be divided into a block DM1, a block DM2 and a block DM3, as shown in fig. 5A. The block DM1 includes a first row and a second row of the data matrix DM, the block DM2 includes a third row and a fourth row of the data matrix DM, and the block DM3 includes a fifth row and a sixth row of the data matrix DM. In any case, the division of the data matrix DM should not be limited to the example shown in fig. 5A. For example, in other embodiments, the block DM1 includes a first column and a second column of the data matrix DM, the block DM2 includes a third column and a fourth column of the data matrix DM, and the block DM3 includes a fifth column and a sixth column of the data matrix DM.

Before performing the pooling operation, the execution unit 120 may calculate a workshop size according to the size of the data matrix DM, the size of the register file 110, the size of the pooling window of the pooling operation, and/or the stride of the pooling operation, and allocate the space of the register file 110 to the pooling operation according to the workshop size (step S410). In the embodiment shown in fig. 5A-5H, register file 110 is assumed to allocate registers R1-R24 to the pooling operation according to the mill size.

Please refer to fig. 1, fig. 4 and fig. 5B. In step S420, the execution unit 120 may load data into the register file 110. For example, the execution unit 120 can load at least one first block (e.g., block DM 1) and at least one second block (e.g., block DM 2) of the data matrix DM into the registers R1-R24 of the register file 110, as shown in FIG. 5B. It is assumed here that the data elements of the first row of the data matrix DM are loaded into the registers R1, R3, R9, R11, R17 and R19, respectively, the data elements of the second row of the data matrix DM are loaded into the registers R2, R4, R10, R12, R18 and R20, respectively, the data elements of the third row of the data matrix DM are loaded into the registers R5, R7, R13, R15, R21 and R23, respectively, and the data elements of the fourth row of the data matrix DM are loaded into the registers R6, R8, R14, R16, R22 and R24, respectively. Since the data placement of the registers R1 to R24 shown in fig. 5B is not suitable for the requirement of the pooling operation, the register file 110 may perform data rearrangement in step S430.

Please refer to fig. 1, fig. 4 and fig. 5C. The register file 110 may reorder the data of the blocks DM1 and DM2 in step S430 to meet the requirement of the pooling operation. It is assumed here that the data elements of the first row of the data matrix DM are rearranged to the register R1, the register R5, the register R9, the register R13, the register R17 and the register R21, respectively, the data elements of the second row of the data matrix DM are rearranged to the register R2, the register R6, the register R10, the register R14, the register R18 and the register R22, respectively, the data elements of the third row of the data matrix DM are rearranged to the register R3, the register R7, the register R11, the register R15, the register R19 and the register R23, respectively, and the data elements of the fourth row of the data matrix DM are rearranged to the register R4, the register R8, the register R12, the register R16, the register R20 and the register R24, respectively.

The execution unit 120 shown in fig. 5D may refer to the related description of the execution unit 120 shown in fig. 1, and thus will not be described again. Please refer to fig. 1, fig. 4 and fig. 5D. In step S440, execution unit 120 may perform a pooling calculation on the data in registers R1-R24 of register file 110. For example, after completing the data rearrangement, the execution unit 120 may perform the first batch operation of the pooling operation on the blocks DM1 and DM2 in the registers R1-R24 of the register file 110 during the first batch of the pooling operation. By way of example, but not limitation, the pooling operation performed by the execution unit 120 may be assumed to be maximum pooling (Max pooling), the size of the pooling window PW of the pooling operation may be assumed to be 3*3 window, and the stride of the pooling operation may be assumed to be 2. The execution unit 120 may perform a first batch of pooling operations on the blocks DM1 and DM2 in the registers R1-R24 to generate a portion of the data elements of the pooled matrix PM (as shown in FIG. 5D). After the first batch operation is completed, the block DM2 used for both the first batch operation and the second batch operation is reserved in the register file 110 to save the data transmission bandwidth. In the register file 110, the block DM1 used by the first batch operation but no longer needed by the second batch operation is discarded (overwritten) to save space in the register file 110.

After the first batch of pooling operations on the blocks DM1 and DM2 (step S440), the execution unit 120 may determine whether the iteration is complete (step S450). For example, the execution unit 120 may check whether all batch operations of the pooling operation for one data matrix DM are completed. When the data matrix DM has not been partially pooled (no in step S450), the execution unit 120 may proceed to step S460. After the first batch operation (step S440) of pooling operations on the block DM1 (first block) and the block DM2 (second block), and before the block DM3 (third block) is loaded into the register file 110 (step S420), the register file 110 may reorder the data on the block DM2 (step S460). Step S460 may discard the block DM1 and reserve the block DM2.

Please refer to fig. 1, fig. 4 and fig. 5E. In step S460, the register file 110 may reorder the data of the block DM2, as shown in fig. 5E. It is assumed that the data elements of the third row of the data matrix DM are rearranged to the register R1, the register R3, the register R9, the register R11, the register R17 and the register R19, respectively, and the data elements of the fourth row of the data matrix DM are rearranged to the register R2, the register R4, the register R10, the register R12, the register R18 and the register R20, respectively. The registers R5, R7, R13, R15, R21 and R23 are used to load data elements of one new row of the data matrix DM, while the registers R6, R8, R14, R16, R22 and R24 are used to load data elements of another new row of the data matrix DM.

After the data rearrangement of block DM2 (second block) (step S460), the register file 110 loads block DM3 (third block) during the second batch of the pooling operation (step S420). Please refer to fig. 1, fig. 4 and fig. 5F. In step S420, the register file 110 may load a block DM3 of the data matrix DM into the register file 110, as shown in fig. 5F. It is assumed here that the data elements of the fifth row of the data matrix DM are loaded into the register R5, the register R7, the register R13, the register R15, the register R21 and the register R23, respectively, and the data elements of the sixth row of the data matrix DM are loaded into the register R6, the register R8, the register R14, the register R16, the register R22 and the register R24, respectively. Since the data placement manners of the registers R1 to R24 shown in fig. 5F are not suitable for the requirement of the pooling operation, the register file 110 may perform data rearrangement in step S430.

After the block DM3 (the third block) is loaded into the register file 110 (step S420), and before the second batch operation (step S440) of pooling operations are performed on the block DM2 (the second block) and the block DM3 (the third block), the register file 110 may perform data rearrangement (step S430) on the block DM2 and the block DM3 to meet the requirement of the pooling operation. Please refer to fig. 1, fig. 4 and fig. 5G. It is assumed here that the data elements of the third row of the data matrix DM are rearranged to the register R1, the register R5, the register R9, the register R13, the register R17 and the register R21, respectively, the data elements of the fourth row of the data matrix DM are rearranged to the register R2, the register R6, the register R10, the register R14, the register R18 and the register R22, respectively, the data elements of the fifth row of the data matrix DM are rearranged to the register R3, the register R7, the register R11, the register R15, the register R19 and the register R23, respectively, and the data elements of the sixth row of the data matrix DM are rearranged to the register R4, the register R8, the register R12, the register R16, the register R20 and the register R24, respectively.

Please refer to fig. 1, fig. 4 and fig. 5H. After completing the data rearrangement (step S430), the execution unit 120 may perform the second batch operation of the pooling operation on the blocks DM1 and DM2 in the registers R1 to R24 of the register file 110 during the second batch of the pooling operation (step S440). The execution unit 120 may perform a second batch of pooling operations on the blocks DM2 and DM3 in the registers R1-R24 to generate another portion of the data elements of the pooled matrix PM (as shown in FIG. 5H). In the embodiment shown in fig. 5A-5H, all batch operations of the pooling operation have been completed after the second batch operation is completed. When all the batch operations of the pooling operation have been completed (yes in step S450), the execution unit 120 may output the pooled matrix PM (the result of the pooling operation) to the system 10 (step S470).

In summary, the data matrix DM may be divided into a plurality of blocks (e.g., blocks DM1, DM2, and DM 3). Depending on the limited space of the register file 110, a complete pooling operation may be divided into a plurality of batch operations. In the first batch operation of the pooling operation, the blocks DM1 and DM2 corresponding to the first batch operation may be loaded into the register file 110. After the first batch operation is completed for the blocks DM1 and DM2, the block DM2 used for both the first batch operation and the second batch operation is reserved in the register file 110 to save the data transmission bandwidth. Then, the block DM3 that would be used for the second batch operation may be loaded into the register file 110. In the register file 110, the block DM1 used by the first batch operation but no longer needed by the second batch operation is discarded (e.g., replaced/overwritten by the block DM 3) to save space in the register file 110. Thus, the pooling device 100 can efficiently manage and use the space of the register file 110 for the pooling operation.

Depending on design requirements, the implementation of the execution unit 120 may be hardware (hardware), firmware (firmware), software (software), or a combination of any three. In hardware, execution unit 120 may be implemented as logic circuitry on an integrated circuit (INTEGRATED CIRCUIT). The relevant functions of execution unit 120 may be implemented as hardware using a hardware description language (hardware description languages, such as Verilog HDL or VHDL) or other suitable programming language. For example, the relevant functions of execution unit 120 may be implemented in various logic blocks, modules, and circuits in one or more controllers, microcontrollers, microprocessors, application-specific integrated circuits (ASICs), digital Signal Processors (DSPs), field programmable logic gate arrays (Field Programmable GATE ARRAY, FPGA), and/or other processing units. The relevant functions of the execution unit 120 may be implemented as programming code (programming codes) in software and/or firmware. For example, execution unit 120 is implemented using a general programming language (programming languages, such as C, C ++ or assembly language) or other suitable programming language. The programming code may be recorded/stored on a "non-transitory computer readable medium (non-transitory computer readable medium)". In some embodiments, the non-transitory computer readable medium includes, for example, tape (tape), disk (disk), card (card), semiconductor memory, programmable logic, and/or storage device. The storage device includes a hard disk (HARD DISK DRIVE, HDD), a Solid state disk (Solid-state drive, STATE DRIVE, SSD), or other storage device. A central processing unit (Central Processing Unit, CPU), controller, microcontroller, or microprocessor can read and execute the programming code from the non-transitory computer readable medium to perform the associated functions of execution unit 120.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A pooling device for pooling a data matrix, wherein the data matrix is divided into a plurality of blocks, the pooling device comprising:

a register file to load at least one first block and at least one second block of the plurality of blocks of the data matrix during a first batch of the pooling operation; and

An execution unit, coupled to the register file, for performing a first batch operation of the pooling operation on the at least one first block and the at least one second block in the register file during the first batch,

Wherein after completion of the first batch operation, the register file discards the at least one first block and retains the at least one second block during a second batch of the pooling operation, the register file loads at least one third block of the plurality of blocks of the data matrix during the second batch, and the execution unit performs a second batch operation of the pooling operation on the at least one second block and the at least one third block in the register file during the second batch.

2. The pooling device of claim 1, wherein the execution unit calculates a workshop size in accordance with a size of the data matrix, a size of the register file, a size of a pooling window of the pooling operation, and a stride of the pooling operation, and allocates space of the register file to the pooling operation in accordance with the workshop size, prior to performing the pooling operation.

3. The pooling device of claim 2, wherein,

The number of lines of the size of the workshops is larger than or equal to the number of lines of the pooling window, and the number of columns of the size of the workshops is larger than the number of columns of the pooling window; or is or

The number of columns of the workshop size is larger than or equal to the number of columns of the pooling window, and the number of rows of the workshop size is larger than the number of rows of the pooling window.

4. The pooling device of claim 1, wherein the register file reorders data of the at least one first block and the at least one second block to meet a requirement of the pooling operation after the at least one first block and the at least one second block are loaded into the register file and before the execution unit performs the first batch of the pooling operation on the at least one first block and the at least one second block.

5. The pooling device of claim 1, wherein after the first batch of the pooling operations by the execution unit on the at least one first block and the at least one second block, and before the at least one third block is loaded into the register file, the register file reorders the at least one second block to discard the at least one first block and preserve the at least one second block.

6. The pooling device of claim 5,

After the at least one third block is loaded into the register file and before the execution unit performs the second batch of the pooling operations on the at least one second block and the at least one third block, the register file performs data rearrangement on the at least one second block and the at least one third block to meet the requirements of the pooling operations.

7. A pooling method for pooling a data matrix, wherein the data matrix is divided into a plurality of blocks, the pooling method comprising:

loading at least one first block and at least one second block of the plurality of blocks of the data matrix to a register file during a first batch of the pooling operation;

Performing a first batch operation of the pooling operation on the at least one first block and the at least one second block in the register file during the first batch;

discarding the at least one first block and retaining the at least one second block during a second batch of the pooling operations after completion of the first batch operation;

loading at least one third block of the plurality of blocks of the data matrix to the register file during the second batch; and

A second batch operation of the pooling operation is performed on the at least one second block and the at least one third block in the register file during the second batch.

8. The pooling method of claim 7, further comprising:

Before the pooling operation is performed, calculating a workshop size according to the size of the data matrix, the size of the register file, the size of a pooling window of the pooling operation and the stride of the pooling operation; and

And allocating the space of the register file to the pooling operation according to the size of the workshops.

9. The pooling method of claim 8, wherein,

10. The pooling method of claim 7, further comprising:

After the at least one first block and the at least one second block are loaded into the register file, and before the first batch of the pooling operations is performed on the at least one first block and the at least one second block, data rearrangement is performed on the at least one first block and the at least one second block in the register file to meet the requirements of the pooling operations.

11. The pooling method of claim 7, further comprising:

After the first batch of the pooling operations is performed on the at least one first block and the at least one second block, and before the at least one third block is loaded into the register file, the at least one second block in the register file is data rearranged to discard the at least one first block and to preserve the at least one second block.

12. The pooling method of claim 11, wherein the pooling method further comprises:

after the at least one third block is loaded into the register file and before the second batch of the pooling operations is performed on the at least one second block and the at least one third block, data rearrangement is performed on the at least one second block and the at least one third block in the register file to meet the requirements of the pooling operations.