CN111459543B - Method for managing register file unit - Google Patents

Method for managing register file unit Download PDF

Info

Publication number
CN111459543B
CN111459543B CN201910052633.XA CN201910052633A CN111459543B CN 111459543 B CN111459543 B CN 111459543B CN 201910052633 A CN201910052633 A CN 201910052633A CN 111459543 B CN111459543 B CN 111459543B
Authority
CN
China
Prior art keywords
register
thread
sub
threads
register file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910052633.XA
Other languages
Chinese (zh)
Other versions
CN111459543A (en
Inventor
王刚
王震宇
王平
李晶晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Denglin Technology Co ltd
Original Assignee
Shanghai Denglin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Denglin Technology Co ltd filed Critical Shanghai Denglin Technology Co ltd
Priority to CN201910052633.XA priority Critical patent/CN111459543B/en
Publication of CN111459543A publication Critical patent/CN111459543A/en
Application granted granted Critical
Publication of CN111459543B publication Critical patent/CN111459543B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30141Implementation provisions of register files, e.g. ports

Abstract

The invention provides a method for managing register file units. The register file unit is formed of a single port memory, the single port memory being a register for providing operands associated with the threads, the method comprising: distributing relevant registers for a plurality of threads, and organizing the distributed registers into a plurality of register groups; uniformly distributing registers associated with each thread within a plurality of register sets and storing data associated with different threads at the same location of the plurality of register sets; for the read-write operation of the register, only one read operation or only one write operation is performed on a plurality of register groups of the register file unit in the same clock cycle by scheduling the arrangement mode of operands related to a plurality of threads. The method of the invention can use single-port memory to simulate the function of a multi-port memory, thereby reducing the design cost of the register file unit and improving the memory access performance.

Description

Method for managing register file unit
Technical Field
The present invention relates to the field of processor design, and more particularly, to a method for managing register file units.
Background
A Register File unit (Register File), also called a Register File, is an array composed of a plurality of registers in a processor such as a CPU or GPU, and may be implemented by a flip-flop or a Static Random Access Memory (SRAM).
A General-Purpose graphics Processing Unit (GPGPU) is a massively parallel processor that has been successfully applied in the high-performance computing field with significant thread parallelism, and can process multiple threads simultaneously per clock cycle. In such parallel processing, each thread independently processes a different data set, and these data and intermediate operation results need to be temporarily stored in an on-chip register file unit. Because the number of threads supported is large and the number of registers required by each thread is also large, the GPGPU employs a Static Random Access Memory (SRAM) instead of a flip-flop to implement a register file unit to reduce area and power consumption. The SRAM has special read ports and write ports, and can access different registers in a multi-path and concurrent mode.
GPGPU implements a specific function by executing program instructions, which in the case of scalar instructions, read at most three source operands for computation and write back to one destination operand, e.g., multiply-accumulate instructions, thus SRAM requires three read ports and one write port, but three read-one write SRAM requires a special custom design, which is time consuming and expensive.
In this way, if the write-back operation needs to be completed in a single cycle, four 32-bit data needs the SRAM to comprise four write ports, and the three read ports are added, so that the total of three read ports and four write ports are needed. The implementation cost and complexity of such a multi-port SRAM design are not affordable, and if a write-back operation is performed in multiple cycles, the performance is lost.
Therefore, there is a need for improvements in the prior art to emulate a multi-port memory using a single-port memory, thereby reducing the design cost of the register file unit and improving the memory access performance.
Disclosure of Invention
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a method for managing register file units.
According to a first aspect of the present invention, there is provided a method of managing a register file unit, the register file unit being constituted by a single-port memory, the single-port memory being a register for providing operands associated with a thread, the method comprising the steps of:
step 1: allocating associated registers for a plurality of threads and organizing the allocated registers into a plurality of register sets, wherein the number of register sets is equal to the number of threads of the plurality of threads;
step 2: uniformly distributing registers associated with each thread within the plurality of register sets and storing data associated with different threads in the same location of the plurality of register sets;
and step 3: for the read-write operation of the register, the arrangement mode of the operands related to the threads is scheduled, so that the register groups of the register file unit only have one read operation or one write operation in the same clock cycle.
In one embodiment, in step 3, for a read operation, operands associated with the plurality of threads are read from the plurality of register sets and the operands associated with each thread are scheduled to be respectively arranged into a corresponding set for dispatch execution.
In one embodiment, in step 3, for a write operation, the operands associated with each of the plurality of threads are evenly distributed within the plurality of register sets and placed in corresponding locations of the plurality of register sets.
According to a second aspect of the invention, a register file unit is provided. The register file unit includes:
a plurality of register sets: registers for providing operands associated with a plurality of threads, wherein the registers are comprised of single port memory, the number of register sets is equal to the number of threads of the plurality of threads, and the registers associated with each thread are evenly distributed within the plurality of register sets and store data associated with different threads at the same location of the plurality of register sets;
a scheduling unit: and the register file unit is used for scheduling the arrangement mode of operands related to the threads for the read-write operation of the registers, so that the register file unit has only one read operation or only one write operation in the same clock cycle.
In one embodiment, the scheduling unit includes a read operation scheduling unit and a write operation scheduling unit, wherein:
the read operation scheduling unit reads operands associated with the plurality of threads from the plurality of register sets and schedules the operands associated with each thread to be respectively arranged in a corresponding set for distributed execution for a read operation;
the write operation scheduling unit evenly distributes operands associated with each of the plurality of threads within the plurality of register sets and places the operands associated with each thread in corresponding locations of the plurality of register sets for a write operation.
According to a third aspect of the invention, a computing system is provided. The computing system comprises a plurality of register file units, execution units and a switching network provided according to the invention, wherein:
each register file unit for receiving a request for a sub-thread group and providing an associated operand;
the execution unit is used for executing the program instruction corresponding to the sub-thread group;
the switching network is used for distributing program instructions and related operands corresponding to the sub-thread group to the execution units and distributing operation results obtained by the execution units to corresponding register file units.
In one embodiment, the computing system of the present invention further comprises a thread group management unit for dividing the task to be processed into a plurality of sub-thread groups and distributing to the plurality of register file units.
In one embodiment of the computing system of the invention, for the plurality of sub-thread groups, the plurality of source operands are read from the respective register file cells in sequence with a phase shift such that the register groups of each register file have only one read operation or only one write operation per cycle.
In one embodiment of the computing system of the present invention, for a vector access instruction, four consecutive addresses of data retrieved by a thread are written back to the register file location simultaneously in the same cycle.
In one embodiment of the computing system of the present invention, the execution unit is shared by the plurality of sub-thread groups in a time-multiplexed manner.
In one embodiment of the computing system of the present invention, where the number of sub-thread groups is set to 4, when three source operands need to be read, the following steps are performed:
in a first clock cycle, a first sub-thread group reads a first source operand;
in a second clock cycle, the first sub-thread group reads the second source operand, and the second sub-thread group reads the first source operand;
in a third clock cycle, the first sub-thread group reads a third source operand, the second sub-thread group reads a second source operand, and the third sub-thread group reads the first source operand;
on a fourth clock cycle, the second sub-thread group reads the third source operand, the third sub-thread group reads the second source operand, and the fourth sub-thread group reads the first source operand.
According to a fourth aspect of the present invention, there is provided an electronic device comprising the register file unit provided by the present invention.
Compared with the prior art, the invention has the advantages that: the register file unit is formed by using the single-port memory, and the function of the multi-port memory can be simulated by using the single-port memory by combining a proper scheduling strategy, so that the design complexity of the register file unit is reduced, and the memory access performance is improved.
Drawings
The invention is illustrated and described only by way of example and not by way of limitation in the scope of the invention as set forth in the following drawings, in which:
FIG. 1 is a diagram illustrating the architecture of a compute engine in a GPGPU in accordance with one embodiment of the present invention;
FIG. 2 is a diagram illustrating a computational core unit in a compute engine according to one embodiment of the invention;
FIG. 3 illustrates a schematic diagram of a register file unit, according to one embodiment of the present invention;
FIG. 4 is a diagram illustrating the organization of register data in register file units, according to one embodiment of the invention;
FIG. 5 illustrates a schematic diagram of the scheduling of read operations to a register file location, according to one embodiment of the invention;
FIG. 6 shows a process diagram for reading an operand according to one embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions, design methods, and advantages of the present invention more apparent, the present invention will be further described in detail by specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not as a limitation. Thus, other examples of the exemplary embodiments may have different values.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
The structure and scheduling strategy of the register file unit will be described below by taking a GPGPU processor as an example.
FIG. 1 is a diagram illustrating a compute engine architecture, applicable to a GPGPU processor, according to one embodiment of the present invention. The compute engine 100 includes a thread group construction unit 110, an instruction cache unit 120, a thread group management unit 130, and a compute core unit 140, where the compute core unit 140 includes a register file unit.
The thread group construction unit 110 is configured to construct the task to be processed into a plurality of thread groups, and is communicatively connected to the thread group management unit 130 and the compute core unit 140, for example, initial location information of each thread in the thread group may be written into a register file unit in the compute core unit 140 and status information of the thread group may be written into the thread group management unit 130.
The thread group management unit 130 is configured to schedule a plurality of thread groups, and is communicatively connected to the instruction cache unit 120 and the compute core unit 140, for example, to schedule a thread group to access the instruction cache unit 120 to read instructions and to send instructions corresponding to the thread group to the compute core unit 140 for execution, where a plurality of threads in each thread group execute a same program instruction.
Taking a thread group with a size of 16 threads as an example, the following description divides a single thread group into 4 sub-thread groups with a size of 4 threads, each sub-thread group corresponding to one register file unit of the computational core unit 140, and thread groups and sub-thread groups with other sizes can also be implemented in a similar manner.
FIG. 2 shows an embodiment of a computational core unit, wherein the computational core unit 140 comprises a plurality of register file units (shown as 4 register file units, respectively labeled as register file unit 211 and 214), a switching network 220, and a plurality of execution units (shown as 4 execution units, respectively labeled as execution units 231 and 234), wherein each register file unit corresponds to a sub-thread group (i.e. 4 threads).
Each register file unit is used for receiving program instructions (not shown) of the thread group management unit and storing operands, calculation results and the like related to the program instructions executed by the sub-thread groups. The program instruction may comprise a maximum of three operands which are stored in registers of the register file unit for access when executing the program, e.g. a multiply instruction comprising two operands and a multiply-accumulate instruction comprising three operands.
The register file unit processes read and write requests of the sub-thread groups, reads operands from and writes operands to the registers, and schedules the processing of each request to avoid write back conflicts.
The execution unit is used for executing program instructions, for example, executing multiply-accumulate instructions to obtain calculation results.
The switching network 220 is used for data exchange between the register file unit and the execution units, e.g. to send program instructions and related operands to the execution units for processing. The switching network 220 may be implemented by a crossbar-based network, and may be configured as a direct connection network or a tree network.
Based on the embodiment of fig. 2, the data processing procedure is: the thread group management unit reads the source operand from the corresponding register file unit according to the program instruction obtained by the instruction cache unit; after the register file unit determines that all source operands have been obtained, the program instructions and associated source operands are sent to the execution units via the switched network, and after execution by the execution units, the destination operands are written back to the corresponding register file units via the switched network. For example, for a multiply-accumulate program instruction, the register file unit includes 3 read requests and one write request.
In the present invention, for one register file unit, the registers in the register file unit are organized based on the number of threads of the sub-thread groups and the operands are scheduled by a policy, enabling the register file unit to emulate a multi-port register using single-port registers. The following will describe the organization and data scheduling of registers in register file units, according to one embodiment of the present invention.
FIG. 3 illustrates an internal block diagram of a register file unit 300, which includes a write operation schedule unit 310, a read operation schedule unit 320, and a plurality of register sets, 4 of which are shown, labeled as register sets 0-3, respectively, according to one embodiment of the present invention.
The write operation scheduling unit 310 is configured to process a write request, and schedule the write operands associated with the sub-thread group according to a certain rule (which will be described in detail below), so as to enable a write operation to be implemented by using a single-port register.
The read operation scheduling unit 320 is configured to process the read request, and schedule and organize the read operands associated with the sub-thread groups to enable the register file unit to implement the read operation using the single-port register.
Register sets 0-3 are used to store operands associated with sub-thread groups, including read operands and write operands, each register set containing a plurality of registers, and for sub-thread groups comprising 4 threads, the function of a multi-port memory can be emulated using a single-port memory by appropriately scheduling the read and write operands.
In one embodiment, the registers in the register file unit 300 are organized in the form of FIG. 4, divided into a number of register sets equal to the number of threads of a sub-thread group, e.g., 4 threads in a sub-thread group, divided into 4 storage areas, labeled as register sets 0-3, respectively, with the threads in the sub-thread group labeled sequentially as T0, T1, T2, T3. The organization of this example is: for register set 0, operands associated with the four threads T0, T1, T2 and T3 respectively correspond to registers R4, R3, R2 and R1, and thread T0 also corresponds to register R0; for register set 1, operands associated with four threads T0, T1, T2 and T3 respectively correspond to registers R1, R4, R3 and R2, and thread T1 corresponds to R0, and the organization of other register sets is similar, which can be specifically seen in fig. 4.
The organization shown in fig. 4 has the following features: from the register point of view, the registers R0 of the 4 threads are all placed at the same position 0 of the four register sets, the register R1 is all placed at the same position 1 of the four register sets, and correspondingly, the register Rn is all placed at the same position n of the four registers. From the thread perspective, the registers for each thread are evenly distributed within 4 register sets. For example, register R0 of thread T0 is distributed across register set 0, R1 is distributed across register set 1, R2 is distributed across register set 2, and R3 is distributed across register set 3; the registers R0 of thread T1 are distributed across register set 1, R1 across register set 2, R2 across register set 3, and R3 across register set 0. In short, the registers R0 of thread Tn (n <4) are distributed over register set n, R1 is distributed over register set (n + 1)% 4, R2 is distributed over register set (n + 2)% 4, and R3 is distributed over register set (n + 3)% 4.
With the register organization of fig. 4, a single-port memory can be used to implement the function of a multi-port memory for read and write operations through scheduling.
In one embodiment, for a read register operation, the source operands read from the register bank are as shown in fig. 5(a), for thread T0, its associated register R0 is in the first column from left to right, R1 is in the second column, R2 is in the third column, R3 is in the fourth column, R4 is back to the first column, similarly for thread T1, its associated register R0 is in the second column from left to right, R1 is in the third column, R2 is in the fourth column, R3 is in the first column, and R4 is back to the second column. After the source operands are read out, the scheduling order via the read operation scheduling unit is organized into the form shown in fig. 5(b), i.e., R0, R1, R2, R3, and R4 associated with the thread T0 are all adjusted to be located in the first column from left to right, and accordingly, the source operands associated with T1 are all located in the second column, the source operands associated with the thread T2 are all located in the third column, and the source operands associated with the thread T3 are all located in the fourth column.
Note that R0, R1, R2, R3, R4, and the like shown in fig. 5(a) indicate register positions, and R0, R1, R2, R3, R4, and the like shown in fig. 5(b) indicate data read from the corresponding registers in fig. 5 (a).
After the scheduling and sorting of the read operation scheduling unit, the operands of each thread are positioned in the same column, and the threads and the execution units have one-to-one correspondence. For example, T0 for the sub-thread group corresponds to execution unit 0, T1 corresponds to execution unit 1, T2 corresponds to execution unit 2, and T3 corresponds to execution unit 3. After the scheduling units are sorted, the operands of each thread are adjusted into corresponding columns, so that the operands can be directly sent to corresponding execution units to be executed through a switching network.
In one embodiment, for a write register operation, the reverse procedure is used to read the register, i.e. when the write operation scheduling unit receives a write request, the write operands of the threads are organized into the form shown in fig. 5 (a).
Fig. 6 illustrates the read/write process of the register by taking the existence of four sub-thread groups as an example, wherein the four sub-thread groups sequentially read three source operands in a phase-staggered manner by one cycle. Specifically, in the first clock cycle 0, the first sub-thread group reads the first source operand SRC 0; cycle 1, the first sub-thread group reads the second source operand SRC1, the second sub-thread group reads the first source operand SRC 0; cycle 2, the first sub-thread group reads the third source operand SRC2, the second sub-thread group reads the second source operand SRC1, and the third sub-thread group reads the first source operand SRC 0; in cycle 3, the second sub-thread group reads the third source operand SRC2, the third sub-thread group reads the second source operand SRC1, and the fourth sub-thread group reads the first source operand SRC 0. And so on until after all source operands are read, the same set of execution units are shared by the four sub-thread groups in a time-multiplexed manner. By this scheduling, from the perspective of the sub-thread group, the first clock cycle reads the first source operand, the second cycle reads the second source operand, and the third cycle reads the third source operand. By the scheduling mode, the register group of each register file unit only has one read operation or write operation at most in each period, so that the register file unit can be realized by simulating a multi-port memory through a single-port memory.
For vector access instructions, each thread can retrieve 4 32-bit data at a time, and the 4 32-bit data for each thread needs to be written back to the registers at the same time to improve performance. By adopting the register organization mode shown in fig. 4 and utilizing the scheduling sorting of the write operation scheduling unit, 32-bit data of any four consecutive destination addresses can be written back to the register groups at the same time, i.e. R0R1R2R3, R1R2R3R4, and the like can be written back at the same time, and the registers of any 4 consecutive destination addresses are uniformly distributed in the four register groups. This way of organizing the registers avoids unnecessary hardware limitations, allowing a better optimization of the use of the registers.
It should be understood that the above embodiments are for illustrative purposes only, the inventive concept is applicable to any number of sub-thread groups, the number of threads included in a sub-thread group may be any number, and the invention is applicable to the accessing of any number of bits of scalar data and vector data. In addition, the read operation scheduling unit and the write operation scheduling unit may be integrated into one scheduling unit.
The method for managing the register file unit or simulating the multi-port memory by using the single-port memory provided by the invention can be applied to any electronic device containing the register file unit, such as a desktop computer, a portable computer, a tablet computer, a smart phone or any other type of computing device (such as a GPGPU-based device). The electronic equipment can be applied to the fields of word processing, voice recognition and processing, multinational language translation, image recognition, biological feature recognition, intelligent control and the like, and can be used as intelligent computing processing equipment, robots, mobile equipment and the like.
It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that holds and stores the instructions for use by the instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (12)

1. A method of managing a register file unit, the register file unit being constituted by a single-port memory, the single-port memory being a register for providing operands associated with a thread, the method comprising the steps of:
step 1: allocating associated registers for a plurality of threads and organizing the allocated registers into a plurality of register sets, wherein the number of register sets is equal to the number of threads of the plurality of threads;
step 2: a respective register that evenly distributes registers associated with each thread within the plurality of register banks and that stores the same operands associated with different threads in the same location of the plurality of register banks;
and step 3: for the read-write operation of the register, the arrangement mode of the operands related to the threads is scheduled, so that the register groups of the register file unit only have one read operation or one write operation in the same clock cycle.
2. The method of claim 1, wherein, in step 3, for a read operation, operands associated with the plurality of threads are read from the plurality of register sets and the operands associated with each thread are scheduled to be respectively arranged into a corresponding set for dispatch execution.
3. The method of claim 1, wherein, in step 3, for a write operation, operands associated with each of the plurality of threads are evenly distributed within the plurality of register sets and placed in corresponding locations of the plurality of register sets.
4. A register file unit, comprising:
a plurality of register sets: registers for providing operands associated with a plurality of threads, wherein the registers are constituted by single port memories, the number of register banks is equal to the number of threads of the plurality of threads, and the registers associated with each thread are evenly distributed within the plurality of register banks and store respective registers of the same operand associated with different threads at the same location of the plurality of register banks;
a scheduling unit: and the register file unit is used for scheduling the arrangement mode of operands related to the threads for the read-write operation of the registers, so that the register file unit has only one read operation or only one write operation in the same clock cycle.
5. The register file unit of claim 4, wherein the scheduling unit comprises a read operation scheduling unit and a write operation scheduling unit, wherein:
the read operation scheduling unit reads operands associated with the plurality of threads from the plurality of register sets and schedules the operands associated with each thread into a corresponding set for distributed execution, respectively, for a read operation;
the write operation scheduling unit evenly distributes operands associated with each of the plurality of threads within the plurality of register sets and places the operands associated with each thread in corresponding locations of the plurality of register sets for a write operation.
6. A computing system comprising a plurality of register file units, execution units and switching networks according to any of claims 4 to 5, wherein:
each register file unit for receiving a request for a sub-thread group and providing an associated operand;
the execution unit is used for executing the program instruction corresponding to the sub-thread group;
the switching network is used for distributing the program instructions and the related operands corresponding to the sub-thread groups to the execution units and distributing the operation results obtained by the execution units to the corresponding register file units.
7. The system of claim 6, further comprising a thread group management unit to divide the task to be processed into a plurality of sub-thread groups and distribute to the plurality of register file units.
8. The system of claim 7, wherein for the plurality of sub-thread groups, the plurality of source operands are read from the corresponding register file locations in sequence with a phase shift such that each register file's register group has only one read operation or only one write operation per cycle.
9. The system of claim 6, wherein for a vector access instruction, data for four consecutive addresses fetched by a thread is written back to the register file location simultaneously in the same cycle.
10. The system of claim 8, wherein the execution units are shared by the plurality of sub-thread groups in a time-multiplexed manner.
11. The system of claim 10, wherein the number of sub-thread groups is set to 4, and when three source operands need to be read, the following steps are performed:
in a first clock cycle, a first sub-thread group reads a first source operand;
in a second clock cycle, the first sub-thread group reads the second source operand, and the second sub-thread group reads the first source operand;
in a third clock cycle, the first sub thread group reads a third source operand, the second sub thread group reads a second source operand, and the third sub thread group reads the first source operand;
on a fourth clock cycle, the second sub-thread group reads the third source operand, the third sub-thread group reads the second source operand, and the fourth sub-thread group reads the first source operand.
12. An electronic device, characterized in that the electronic device comprises a register file unit according to any of claims 4 to 5.
CN201910052633.XA 2019-01-21 2019-01-21 Method for managing register file unit Active CN111459543B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910052633.XA CN111459543B (en) 2019-01-21 2019-01-21 Method for managing register file unit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910052633.XA CN111459543B (en) 2019-01-21 2019-01-21 Method for managing register file unit

Publications (2)

Publication Number Publication Date
CN111459543A CN111459543A (en) 2020-07-28
CN111459543B true CN111459543B (en) 2022-09-13

Family

ID=71679088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910052633.XA Active CN111459543B (en) 2019-01-21 2019-01-21 Method for managing register file unit

Country Status (1)

Country Link
CN (1) CN111459543B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115129369A (en) * 2021-03-26 2022-09-30 上海阵量智能科技有限公司 Command distribution method, command distributor, chip and electronic device
CN114546329B (en) * 2022-03-01 2023-07-18 上海壁仞智能科技有限公司 Method, apparatus and medium for implementing data parity rearrangement
CN115129480B (en) * 2022-08-26 2022-11-08 上海登临科技有限公司 Scalar processing unit and access control method thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218208A (en) * 2011-12-06 2013-07-24 辉达公司 System and method for performing shaped memory access operations
CN103257931A (en) * 2011-12-22 2013-08-21 辉达公司 Shaped register file reads

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6546461B1 (en) * 2000-11-22 2003-04-08 Integrated Device Technology, Inc. Multi-port cache memory devices and FIFO memory devices having multi-port cache memory devices therein
CN1853379A (en) * 2002-12-31 2006-10-25 肯奈克斯特公司 System and method for providing quality of service in asynchronous transfer mode cell transmission
US7339592B2 (en) * 2004-07-13 2008-03-04 Nvidia Corporation Simulating multiported memories using lower port count memories
US8533435B2 (en) * 2009-09-24 2013-09-10 Nvidia Corporation Reordering operands assigned to each one of read request ports concurrently accessing multibank register file to avoid bank conflict
US8458446B2 (en) * 2009-09-30 2013-06-04 Oracle America, Inc. Accessing a multibank register file using a thread identifier
US10303472B2 (en) * 2016-11-22 2019-05-28 Advanced Micro Devices, Inc. Bufferless communication for redundant multithreading using register permutation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218208A (en) * 2011-12-06 2013-07-24 辉达公司 System and method for performing shaped memory access operations
CN103257931A (en) * 2011-12-22 2013-08-21 辉达公司 Shaped register file reads

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于活跃周期的低端口数低能耗寄存器堆设计;赵雨来等;《计算机学报》;20080215(第02期);117-126 *

Also Published As

Publication number Publication date
CN111459543A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
US9830156B2 (en) Temporal SIMT execution optimization through elimination of redundant operations
US5832290A (en) Apparatus, systems and method for improving memory bandwidth utilization in vector processing systems
US10255228B2 (en) System and method for performing shaped memory access operations
JP5422614B2 (en) Simulate multiport memory using low port count memory
CN109997115B (en) Low power and low latency GPU co-processor for persistent computation
US8639882B2 (en) Methods and apparatus for source operand collector caching
US10007527B2 (en) Uniform load processing for parallel thread sub-sets
CN111459543B (en) Method for managing register file unit
CN107408040A (en) It is configured with executing out the vector processor for operating variable-length vector
CN107766079B (en) Processor and method for executing instructions on processor
CN111656339B (en) Memory device and control method thereof
US9626191B2 (en) Shaped register file reads
US9286114B2 (en) System and method for launching data parallel and task parallel application threads and graphics processing unit incorporating the same
US20150082007A1 (en) Register mapping with multiple instruction sets
US11934827B2 (en) Partition and isolation of a processing-in-memory (PIM) device
JP2008524723A (en) Evaluation unit for flag register of single instruction multiple data execution engine
US10409610B2 (en) Method and apparatus for inter-lane thread migration
US20080082797A1 (en) Configurable Single Instruction Multiple Data Unit
US8413151B1 (en) Selective thread spawning within a multi-threaded processing system
TW201539189A (en) Method, apparatus and computer readable recording medium for preventing bank conflict in memory
US8959497B1 (en) System and method for dynamically spawning thread blocks within multi-threaded processing systems
CN115248701A (en) Zero-copy data transmission device and method between processor register files
US11669489B2 (en) Sparse systolic array design
KR102644951B1 (en) Arithmetic Logic Unit Register Sequencing
US11822541B2 (en) Techniques for storing sub-alignment data when accelerating Smith-Waterman sequence alignments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant