CN111459543B

CN111459543B - Method for managing register file unit

Info

Publication number: CN111459543B
Application number: CN201910052633.XA
Authority: CN
Inventors: 王刚; 王震宇; 王平; 李晶晶
Original assignee: Shanghai Denglin Technology Co ltd
Current assignee: Shanghai Denglin Technology Co ltd
Priority date: 2019-01-21
Filing date: 2019-01-21
Publication date: 2022-09-13
Anticipated expiration: 2039-01-21
Also published as: CN111459543A

Abstract

The invention provides a method for managing register file units. The register file unit is formed of a single port memory, the single port memory being a register for providing operands associated with the threads, the method comprising: distributing relevant registers for a plurality of threads, and organizing the distributed registers into a plurality of register groups; uniformly distributing registers associated with each thread within a plurality of register sets and storing data associated with different threads at the same location of the plurality of register sets; for the read-write operation of the register, only one read operation or only one write operation is performed on a plurality of register groups of the register file unit in the same clock cycle by scheduling the arrangement mode of operands related to a plurality of threads. The method of the invention can use single-port memory to simulate the function of a multi-port memory, thereby reducing the design cost of the register file unit and improving the memory access performance.

Description

Method for managing register file unit

Technical Field

The present invention relates to the field of processor design, and more particularly, to a method for managing register file units.

Background

A Register File unit (Register File), also called a Register File, is an array composed of a plurality of registers in a processor such as a CPU or GPU, and may be implemented by a flip-flop or a Static Random Access Memory (SRAM).

A General-Purpose graphics Processing Unit (GPGPU) is a massively parallel processor that has been successfully applied in the high-performance computing field with significant thread parallelism, and can process multiple threads simultaneously per clock cycle. In such parallel processing, each thread independently processes a different data set, and these data and intermediate operation results need to be temporarily stored in an on-chip register file unit. Because the number of threads supported is large and the number of registers required by each thread is also large, the GPGPU employs a Static Random Access Memory (SRAM) instead of a flip-flop to implement a register file unit to reduce area and power consumption. The SRAM has special read ports and write ports, and can access different registers in a multi-path and concurrent mode.

GPGPU implements a specific function by executing program instructions, which in the case of scalar instructions, read at most three source operands for computation and write back to one destination operand, e.g., multiply-accumulate instructions, thus SRAM requires three read ports and one write port, but three read-one write SRAM requires a special custom design, which is time consuming and expensive.

In this way, if the write-back operation needs to be completed in a single cycle, four 32-bit data needs the SRAM to comprise four write ports, and the three read ports are added, so that the total of three read ports and four write ports are needed. The implementation cost and complexity of such a multi-port SRAM design are not affordable, and if a write-back operation is performed in multiple cycles, the performance is lost.

Therefore, there is a need for improvements in the prior art to emulate a multi-port memory using a single-port memory, thereby reducing the design cost of the register file unit and improving the memory access performance.

Disclosure of Invention

It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a method for managing register file units.

According to a first aspect of the present invention, there is provided a method of managing a register file unit, the register file unit being constituted by a single-port memory, the single-port memory being a register for providing operands associated with a thread, the method comprising the steps of:

step 1: allocating associated registers for a plurality of threads and organizing the allocated registers into a plurality of register sets, wherein the number of register sets is equal to the number of threads of the plurality of threads;

step 2: uniformly distributing registers associated with each thread within the plurality of register sets and storing data associated with different threads in the same location of the plurality of register sets;

and step 3: for the read-write operation of the register, the arrangement mode of the operands related to the threads is scheduled, so that the register groups of the register file unit only have one read operation or one write operation in the same clock cycle.

In one embodiment, in step 3, for a read operation, operands associated with the plurality of threads are read from the plurality of register sets and the operands associated with each thread are scheduled to be respectively arranged into a corresponding set for dispatch execution.

In one embodiment, in step 3, for a write operation, the operands associated with each of the plurality of threads are evenly distributed within the plurality of register sets and placed in corresponding locations of the plurality of register sets.

According to a second aspect of the invention, a register file unit is provided. The register file unit includes:

a plurality of register sets: registers for providing operands associated with a plurality of threads, wherein the registers are comprised of single port memory, the number of register sets is equal to the number of threads of the plurality of threads, and the registers associated with each thread are evenly distributed within the plurality of register sets and store data associated with different threads at the same location of the plurality of register sets;

a scheduling unit: and the register file unit is used for scheduling the arrangement mode of operands related to the threads for the read-write operation of the registers, so that the register file unit has only one read operation or only one write operation in the same clock cycle.

In one embodiment, the scheduling unit includes a read operation scheduling unit and a write operation scheduling unit, wherein:

the read operation scheduling unit reads operands associated with the plurality of threads from the plurality of register sets and schedules the operands associated with each thread to be respectively arranged in a corresponding set for distributed execution for a read operation;

the write operation scheduling unit evenly distributes operands associated with each of the plurality of threads within the plurality of register sets and places the operands associated with each thread in corresponding locations of the plurality of register sets for a write operation.

According to a third aspect of the invention, a computing system is provided. The computing system comprises a plurality of register file units, execution units and a switching network provided according to the invention, wherein:

each register file unit for receiving a request for a sub-thread group and providing an associated operand;

the execution unit is used for executing the program instruction corresponding to the sub-thread group;

the switching network is used for distributing program instructions and related operands corresponding to the sub-thread group to the execution units and distributing operation results obtained by the execution units to corresponding register file units.

In one embodiment, the computing system of the present invention further comprises a thread group management unit for dividing the task to be processed into a plurality of sub-thread groups and distributing to the plurality of register file units.

In one embodiment of the computing system of the invention, for the plurality of sub-thread groups, the plurality of source operands are read from the respective register file cells in sequence with a phase shift such that the register groups of each register file have only one read operation or only one write operation per cycle.

In one embodiment of the computing system of the present invention, for a vector access instruction, four consecutive addresses of data retrieved by a thread are written back to the register file location simultaneously in the same cycle.

In one embodiment of the computing system of the present invention, the execution unit is shared by the plurality of sub-thread groups in a time-multiplexed manner.

In one embodiment of the computing system of the present invention, where the number of sub-thread groups is set to 4, when three source operands need to be read, the following steps are performed:

in a first clock cycle, a first sub-thread group reads a first source operand;

in a second clock cycle, the first sub-thread group reads the second source operand, and the second sub-thread group reads the first source operand;

in a third clock cycle, the first sub-thread group reads a third source operand, the second sub-thread group reads a second source operand, and the third sub-thread group reads the first source operand;

on a fourth clock cycle, the second sub-thread group reads the third source operand, the third sub-thread group reads the second source operand, and the fourth sub-thread group reads the first source operand.

According to a fourth aspect of the present invention, there is provided an electronic device comprising the register file unit provided by the present invention.

Compared with the prior art, the invention has the advantages that: the register file unit is formed by using the single-port memory, and the function of the multi-port memory can be simulated by using the single-port memory by combining a proper scheduling strategy, so that the design complexity of the register file unit is reduced, and the memory access performance is improved.

Drawings

The invention is illustrated and described only by way of example and not by way of limitation in the scope of the invention as set forth in the following drawings, in which:

FIG. 1 is a diagram illustrating the architecture of a compute engine in a GPGPU in accordance with one embodiment of the present invention;

FIG. 2 is a diagram illustrating a computational core unit in a compute engine according to one embodiment of the invention;

FIG. 3 illustrates a schematic diagram of a register file unit, according to one embodiment of the present invention;

FIG. 4 is a diagram illustrating the organization of register data in register file units, according to one embodiment of the invention;

FIG. 5 illustrates a schematic diagram of the scheduling of read operations to a register file location, according to one embodiment of the invention;

FIG. 6 shows a process diagram for reading an operand according to one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions, design methods, and advantages of the present invention more apparent, the present invention will be further described in detail by specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not as a limitation. Thus, other examples of the exemplary embodiments may have different values.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

The structure and scheduling strategy of the register file unit will be described below by taking a GPGPU processor as an example.

FIG. 1 is a diagram illustrating a compute engine architecture, applicable to a GPGPU processor, according to one embodiment of the present invention. The compute engine 100 includes a thread group construction unit 110, an instruction cache unit 120, a thread group management unit 130, and a compute core unit 140, where the compute core unit 140 includes a register file unit.

The thread group construction unit 110 is configured to construct the task to be processed into a plurality of thread groups, and is communicatively connected to the thread group management unit 130 and the compute core unit 140, for example, initial location information of each thread in the thread group may be written into a register file unit in the compute core unit 140 and status information of the thread group may be written into the thread group management unit 130.

The thread group management unit 130 is configured to schedule a plurality of thread groups, and is communicatively connected to the instruction cache unit 120 and the compute core unit 140, for example, to schedule a thread group to access the instruction cache unit 120 to read instructions and to send instructions corresponding to the thread group to the compute core unit 140 for execution, where a plurality of threads in each thread group execute a same program instruction.

Taking a thread group with a size of 16 threads as an example, the following description divides a single thread group into 4 sub-thread groups with a size of 4 threads, each sub-thread group corresponding to one register file unit of the computational core unit 140, and thread groups and sub-thread groups with other sizes can also be implemented in a similar manner.

FIG. 2 shows an embodiment of a computational core unit, wherein the computational core unit 140 comprises a plurality of register file units (shown as 4 register file units, respectively labeled as register file unit 211 and 214), a switching network 220, and a plurality of execution units (shown as 4 execution units, respectively labeled as execution units 231 and 234), wherein each register file unit corresponds to a sub-thread group (i.e. 4 threads).

Each register file unit is used for receiving program instructions (not shown) of the thread group management unit and storing operands, calculation results and the like related to the program instructions executed by the sub-thread groups. The program instruction may comprise a maximum of three operands which are stored in registers of the register file unit for access when executing the program, e.g. a multiply instruction comprising two operands and a multiply-accumulate instruction comprising three operands.

The register file unit processes read and write requests of the sub-thread groups, reads operands from and writes operands to the registers, and schedules the processing of each request to avoid write back conflicts.

The execution unit is used for executing program instructions, for example, executing multiply-accumulate instructions to obtain calculation results.

The switching network 220 is used for data exchange between the register file unit and the execution units, e.g. to send program instructions and related operands to the execution units for processing. The switching network 220 may be implemented by a crossbar-based network, and may be configured as a direct connection network or a tree network.

Based on the embodiment of fig. 2, the data processing procedure is: the thread group management unit reads the source operand from the corresponding register file unit according to the program instruction obtained by the instruction cache unit; after the register file unit determines that all source operands have been obtained, the program instructions and associated source operands are sent to the execution units via the switched network, and after execution by the execution units, the destination operands are written back to the corresponding register file units via the switched network. For example, for a multiply-accumulate program instruction, the register file unit includes 3 read requests and one write request.

In the present invention, for one register file unit, the registers in the register file unit are organized based on the number of threads of the sub-thread groups and the operands are scheduled by a policy, enabling the register file unit to emulate a multi-port register using single-port registers. The following will describe the organization and data scheduling of registers in register file units, according to one embodiment of the present invention.

FIG. 3 illustrates an internal block diagram of a register file unit 300, which includes a write operation schedule unit 310, a read operation schedule unit 320, and a plurality of register sets, 4 of which are shown, labeled as register sets 0-3, respectively, according to one embodiment of the present invention.

The write operation scheduling unit 310 is configured to process a write request, and schedule the write operands associated with the sub-thread group according to a certain rule (which will be described in detail below), so as to enable a write operation to be implemented by using a single-port register.

The read operation scheduling unit 320 is configured to process the read request, and schedule and organize the read operands associated with the sub-thread groups to enable the register file unit to implement the read operation using the single-port register.

Register sets 0-3 are used to store operands associated with sub-thread groups, including read operands and write operands, each register set containing a plurality of registers, and for sub-thread groups comprising 4 threads, the function of a multi-port memory can be emulated using a single-port memory by appropriately scheduling the read and write operands.

In one embodiment, the registers in the register file unit 300 are organized in the form of FIG. 4, divided into a number of register sets equal to the number of threads of a sub-thread group, e.g., 4 threads in a sub-thread group, divided into 4 storage areas, labeled as register sets 0-3, respectively, with the threads in the sub-thread group labeled sequentially as T0, T1, T2, T3. The organization of this example is: for register set 0, operands associated with the four threads T0, T1, T2 and T3 respectively correspond to registers R4, R3, R2 and R1, and thread T0 also corresponds to register R0; for register set 1, operands associated with four threads T0, T1, T2 and T3 respectively correspond to registers R1, R4, R3 and R2, and thread T1 corresponds to R0, and the organization of other register sets is similar, which can be specifically seen in fig. 4.

The organization shown in fig. 4 has the following features: from the register point of view, the registers R0 of the 4 threads are all placed at the same position 0 of the four register sets, the register R1 is all placed at the same position 1 of the four register sets, and correspondingly, the register Rn is all placed at the same position n of the four registers. From the thread perspective, the registers for each thread are evenly distributed within 4 register sets. For example, register R0 of thread T0 is distributed across register set 0, R1 is distributed across register set 1, R2 is distributed across register set 2, and R3 is distributed across register set 3; the registers R0 of thread T1 are distributed across register set 1, R1 across register set 2, R2 across register set 3, and R3 across register set 0. In short, the registers R0 of thread Tn (n <4) are distributed over register set n, R1 is distributed over register set (n + 1)% 4, R2 is distributed over register set (n + 2)% 4, and R3 is distributed over register set (n + 3)% 4.

With the register organization of fig. 4, a single-port memory can be used to implement the function of a multi-port memory for read and write operations through scheduling.

In one embodiment, for a read register operation, the source operands read from the register bank are as shown in fig. 5(a), for thread T0, its associated register R0 is in the first column from left to right, R1 is in the second column, R2 is in the third column, R3 is in the fourth column, R4 is back to the first column, similarly for thread T1, its associated register R0 is in the second column from left to right, R1 is in the third column, R2 is in the fourth column, R3 is in the first column, and R4 is back to the second column. After the source operands are read out, the scheduling order via the read operation scheduling unit is organized into the form shown in fig. 5(b), i.e., R0, R1, R2, R3, and R4 associated with the thread T0 are all adjusted to be located in the first column from left to right, and accordingly, the source operands associated with T1 are all located in the second column, the source operands associated with the thread T2 are all located in the third column, and the source operands associated with the thread T3 are all located in the fourth column.

Note that R0, R1, R2, R3, R4, and the like shown in fig. 5(a) indicate register positions, and R0, R1, R2, R3, R4, and the like shown in fig. 5(b) indicate data read from the corresponding registers in fig. 5 (a).

After the scheduling and sorting of the read operation scheduling unit, the operands of each thread are positioned in the same column, and the threads and the execution units have one-to-one correspondence. For example, T0 for the sub-thread group corresponds to execution unit 0, T1 corresponds to execution unit 1, T2 corresponds to execution unit 2, and T3 corresponds to execution unit 3. After the scheduling units are sorted, the operands of each thread are adjusted into corresponding columns, so that the operands can be directly sent to corresponding execution units to be executed through a switching network.

In one embodiment, for a write register operation, the reverse procedure is used to read the register, i.e. when the write operation scheduling unit receives a write request, the write operands of the threads are organized into the form shown in fig. 5 (a).

Fig. 6 illustrates the read/write process of the register by taking the existence of four sub-thread groups as an example, wherein the four sub-thread groups sequentially read three source operands in a phase-staggered manner by one cycle. Specifically, in the first clock cycle 0, the first sub-thread group reads the first source operand SRC 0; cycle 1, the first sub-thread group reads the second source operand SRC1, the second sub-thread group reads the first source operand SRC 0; cycle 2, the first sub-thread group reads the third source operand SRC2, the second sub-thread group reads the second source operand SRC1, and the third sub-thread group reads the first source operand SRC 0; in cycle 3, the second sub-thread group reads the third source operand SRC2, the third sub-thread group reads the second source operand SRC1, and the fourth sub-thread group reads the first source operand SRC 0. And so on until after all source operands are read, the same set of execution units are shared by the four sub-thread groups in a time-multiplexed manner. By this scheduling, from the perspective of the sub-thread group, the first clock cycle reads the first source operand, the second cycle reads the second source operand, and the third cycle reads the third source operand. By the scheduling mode, the register group of each register file unit only has one read operation or write operation at most in each period, so that the register file unit can be realized by simulating a multi-port memory through a single-port memory.

For vector access instructions, each thread can retrieve 4 32-bit data at a time, and the 4 32-bit data for each thread needs to be written back to the registers at the same time to improve performance. By adopting the register organization mode shown in fig. 4 and utilizing the scheduling sorting of the write operation scheduling unit, 32-bit data of any four consecutive destination addresses can be written back to the register groups at the same time, i.e. R0R1R2R3, R1R2R3R4, and the like can be written back at the same time, and the registers of any 4 consecutive destination addresses are uniformly distributed in the four register groups. This way of organizing the registers avoids unnecessary hardware limitations, allowing a better optimization of the use of the registers.

It should be understood that the above embodiments are for illustrative purposes only, the inventive concept is applicable to any number of sub-thread groups, the number of threads included in a sub-thread group may be any number, and the invention is applicable to the accessing of any number of bits of scalar data and vector data. In addition, the read operation scheduling unit and the write operation scheduling unit may be integrated into one scheduling unit.

The method for managing the register file unit or simulating the multi-port memory by using the single-port memory provided by the invention can be applied to any electronic device containing the register file unit, such as a desktop computer, a portable computer, a tablet computer, a smart phone or any other type of computing device (such as a GPGPU-based device). The electronic equipment can be applied to the fields of word processing, voice recognition and processing, multinational language translation, image recognition, biological feature recognition, intelligent control and the like, and can be used as intelligent computing processing equipment, robots, mobile equipment and the like.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that holds and stores the instructions for use by the instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of managing a register file unit, the register file unit being constituted by a single-port memory, the single-port memory being a register for providing operands associated with a thread, the method comprising the steps of:

step 2: a respective register that evenly distributes registers associated with each thread within the plurality of register banks and that stores the same operands associated with different threads in the same location of the plurality of register banks;

2. The method of claim 1, wherein, in step 3, for a read operation, operands associated with the plurality of threads are read from the plurality of register sets and the operands associated with each thread are scheduled to be respectively arranged into a corresponding set for dispatch execution.

3. The method of claim 1, wherein, in step 3, for a write operation, operands associated with each of the plurality of threads are evenly distributed within the plurality of register sets and placed in corresponding locations of the plurality of register sets.

4. A register file unit, comprising:

a plurality of register sets: registers for providing operands associated with a plurality of threads, wherein the registers are constituted by single port memories, the number of register banks is equal to the number of threads of the plurality of threads, and the registers associated with each thread are evenly distributed within the plurality of register banks and store respective registers of the same operand associated with different threads at the same location of the plurality of register banks;

5. The register file unit of claim 4, wherein the scheduling unit comprises a read operation scheduling unit and a write operation scheduling unit, wherein:

the read operation scheduling unit reads operands associated with the plurality of threads from the plurality of register sets and schedules the operands associated with each thread into a corresponding set for distributed execution, respectively, for a read operation;

6. A computing system comprising a plurality of register file units, execution units and switching networks according to any of claims 4 to 5, wherein:

the switching network is used for distributing the program instructions and the related operands corresponding to the sub-thread groups to the execution units and distributing the operation results obtained by the execution units to the corresponding register file units.

7. The system of claim 6, further comprising a thread group management unit to divide the task to be processed into a plurality of sub-thread groups and distribute to the plurality of register file units.

8. The system of claim 7, wherein for the plurality of sub-thread groups, the plurality of source operands are read from the corresponding register file locations in sequence with a phase shift such that each register file's register group has only one read operation or only one write operation per cycle.

9. The system of claim 6, wherein for a vector access instruction, data for four consecutive addresses fetched by a thread is written back to the register file location simultaneously in the same cycle.

10. The system of claim 8, wherein the execution units are shared by the plurality of sub-thread groups in a time-multiplexed manner.

11. The system of claim 10, wherein the number of sub-thread groups is set to 4, and when three source operands need to be read, the following steps are performed:

in a first clock cycle, a first sub-thread group reads a first source operand;

in a third clock cycle, the first sub thread group reads a third source operand, the second sub thread group reads a second source operand, and the third sub thread group reads the first source operand;

12. An electronic device, characterized in that the electronic device comprises a register file unit according to any of claims 4 to 5.