CN113377438A

CN113377438A - Processor and data reading and writing method thereof

Info

Publication number: CN113377438A
Application number: CN202110927232.1A
Authority: CN
Inventors: 李颖
Original assignee: Muxi Integrated Circuit Shanghai Co ltd
Current assignee: Muxi Integrated Circuit Shanghai Co ltd
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2021-09-10
Anticipated expiration: 2041-08-13
Also published as: CN113377438B

Abstract

The application discloses a processor and a data reading and writing method thereof, and relates to the field of high-performance computing of computers. The processor converts the first clock frequency into a second clock frequency lower than the first clock frequency through the clock frequency division circuit, the storage queue and the calculation unit group work at the first clock frequency, and the register group works at the second clock frequency. The register bank includes a static random access memory including a set of register banks corresponding to a plurality of threads executing instructions, wherein each thread corresponds to at least two register banks, respectively. The register set is capable of reading at least two operands of each of at least some of the plurality of threads during one clock cycle of the second clock frequency. The processor and the data reading and writing method thereof can reduce the power consumption of the processing chip and improve the data reading and writing performance of the SRAM register.

Description

Processor and data reading and writing method thereof

Technical Field

The application relates to the field of high-performance computing of computers, in particular to a processor and a data reading and writing method thereof.

Background

High performance computing chips such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU) and the like have strong demands for a limited frequency. For example, AMD NAVI10 GPU processors can reach 2.5GHZ, and R93950X CPU processors have single core frequencies of up to 4.7 GHZ. In the prior art, in the design of a CPU or a GPU processor chip, in order to achieve higher performance and higher frequency, a Static Random-Access Memory (SRAM) design including more transistors must be selected, that is, an SRAM design with a larger area and larger power consumption is adopted, so that an SRAM device in high-performance calculation becomes a bottleneck that restricts the maximum frequency and high performance of a processor.

Disclosure of Invention

In view of this, the present application provides a processor and a data reading and writing method thereof, which can adopt an SRAM device driven by a lower clock frequency, improve the performance and the dominant frequency of a processing chip, and reduce the power consumption of the processing chip.

In a first aspect, the present application provides a processor, comprising:

a register bank comprising a static random access memory including a set of register banks corresponding to a plurality of threads executing instructions, wherein each thread corresponds to at least two register banks, respectively;

a store queue for storing operands of a plurality of threads corresponding to the execution instruction read from the register set and output results of the plurality of threads of the execution instruction ready to be written to the register set;

a compute unit group comprising a plurality of compute units to execute a plurality of threads of the instruction based on operands in the store queue and to output results to the store queue;

the clock frequency division circuit is used for converting a first clock frequency into a second clock frequency lower than the first clock frequency, the storage queue and the calculation unit group work at the first clock frequency, and the register group works at the second clock frequency;

wherein the register set is capable of reading at least two operands of each of at least some of the plurality of threads in one clock cycle of the second clock frequency.

In a preferred embodiment, the register set is capable of writing the output result of each of at least some of the plurality of threads during one clock cycle of the second clock frequency.

In a preferred embodiment, the register set is capable of writing at least two output results of each of at least some of the plurality of threads within one clock cycle of the second clock frequency.

In a preferred embodiment, the clock dividing circuit includes an even-divide circuit for converting the first clock frequency to an even-divided second clock frequency of the first clock frequency.

In a preferred embodiment, the even-number frequency-dividing circuit includes a divide-by-two circuit for converting the first clock frequency to a divide-by-two second clock frequency of the first clock frequency.

In a preferred embodiment, the register bank is configured to correspond to two register banks for each thread, and the register bank can read two operands of each of at least some of the plurality of threads in one clock cycle of the second clock frequency divided by two of the first clock frequency.

In a preferred embodiment, the register set is capable of writing at most two output results for each of at least some of the plurality of threads in one clock cycle of a second clock frequency divided by two of the first clock frequency.

In a preferred embodiment, the even-numbered frequency-dividing circuit includes a four-frequency-dividing circuit for converting the first clock frequency to a divided-by-four second clock frequency of the first clock frequency.

In a preferred embodiment, the register bank is configured to correspond to four register banks per thread, and the register bank can read at most four operands of each of at least some of the plurality of threads in one clock cycle of the second clock frequency divided by four of the first clock frequency.

In a preferred embodiment, the register set is capable of writing at most four output results for each of at least some of the plurality of threads in one clock cycle of a second clock frequency that is four times divided by the first clock frequency.

In a preferred embodiment, the processor further comprises a level shift circuit for converting a first voltage to a second voltage lower than the first voltage, the store queue and the bank of compute cells operating at the first voltage, and the set of registers operating at the second voltage.

In a preferred embodiment, the store queue comprises an operand queue and an output result queue.

In a preferred embodiment, the width of the operand queue and result queue is the same as the number of the set of register banks corresponding to the plurality of threads executing instructions.

In a second aspect, the present application provides a data reading and writing method, which is applied to a processor including a register set, a storage queue, a calculation unit set, and a clock frequency division circuit, where the storage queue and the calculation unit set operate at a first clock frequency, and the register set operates at a second clock frequency lower than the first clock frequency, and the method includes the following steps:

reading out at least two operands of each of at least some of the threads executing instructions from the register bank to a store queue in one clock cycle of a second clock frequency;

judging whether all operands of each thread in at least part of threads of the execution instruction are read out to the storage queue, if not, continuously reading out the rest operands of each thread in at least part of threads of the execution instruction from the register group to the storage queue in the next clock cycle of the second clock frequency;

after all operands of each thread in the at least part of the threads executing the instruction have been read out to the store queue, the group of compute units executes a multi-stage pipeline on the at least part of the threads executing the instruction at clock cycles of a first clock frequency;

after the multi-stage pipeline is executed on at least part of the threads for executing the instructions, the output result of each thread in the at least part of the threads for executing the instructions is written into the register set in the next clock cycle of the second clock frequency.

In a preferred embodiment, said writing output results of each of said at least part of the threads of execution of instructions to said register bank comprises writing at least two output results of each of said at least part of the threads of execution of instructions to said register bank.

In a preferred embodiment, the second clock frequency is an even division of the first clock frequency.

In a preferred embodiment, the second clock frequency is a divide-by-two of the first clock frequency; wherein reading out at least two operands of each of at least some of the threads executing the instruction from the register bank to the store queue in one clock cycle of the second clock frequency comprises: two operands of each of at least a portion of the threads executing the instruction are read from the register bank to the store queue during a clock cycle of the second clock frequency.

In a preferred embodiment, the writing the output result of each of the at least some threads of the executed instructions to the register bank on a next clock cycle of the second clock frequency comprises: writing to the register bank at most two output results for each of at least some of the threads executing the instruction in a next clock cycle of a second clock frequency.

In a preferred embodiment, the second clock frequency is a quarter of the first clock frequency; wherein reading out at least two operands for each of at least some of the threads executing the instruction from the register bank to the store queue in one clock cycle of the second clock frequency comprises: at most four operands per thread of at least a portion of the threads executing the instruction are read from the register bank to the store queue on one clock cycle of the second clock frequency.

In a preferred embodiment, the writing the output result of each of the at least some threads executing instructions to the register bank in the next clock cycle of the second clock frequency comprises: writing to the register bank at most four output results for each of at least some of the threads executing the instruction in a next clock cycle of a second clock frequency.

Compared with the prior art, the method has the following beneficial effects: by reducing the clock frequency of the SRAM registers in the processor and expanding the number of register banks read and written corresponding to the parallel threads, each thread executing instructions can read more operands within one clock cycle of the SRAM register bank. Due to the fact that the frequency of the SRAM register is reduced, the SRAM register is designed by using fewer transistors, and the technical effects of reducing chip power consumption and achieving high-performance SRAM reading and writing are achieved. Meanwhile, even when the size of the storage space of the SRAM register is kept unchanged, the more optimal bandwidth for reading and writing data in a unit clock cycle can be realized.

Drawings

The features, objects, and advantages of the present application will be more fully understood from the following detailed description of non-limiting embodiments, taken in conjunction with the accompanying drawings, in which:

fig. 1 is a schematic diagram of an exemplary structure of a GPU chip 100 of the prior art;

FIG. 2 is a schematic diagram of an exemplary architecture of a Single Instruction multi-threading processor (SIMT) having 8 computational units;

FIG. 3 is a diagram of a four-stage pipeline for a Fused Multiply-Add (FMA) instruction in a GPU;

FIG. 4 is a timing diagram of the single instruction multithreaded processor of FIG. 2 executing two sequential two-operand multiply instructions;

fig. 5 is a schematic structural diagram of a processor 500 according to a first embodiment of the present application;

fig. 6 is a schematic structural diagram of a processor 600 according to a second embodiment of the present application;

FIG. 7 is a flow chart illustrating a data reading and writing method according to an embodiment of the present application;

FIG. 8 is a timing diagram of an embodiment of the present application processing two consecutive multiply two-operand instructions;

FIG. 9 is a timing diagram illustrating the processing of two consecutive FMA three-operand instructions according to an embodiment of the present application;

FIG. 10 is a graph comparing performance according to embodiments of the present application with that of the prior art.

Detailed Description

The technical solutions of the present application are clearly and completely described below by way of embodiments and with reference to the accompanying drawings, but the present application is not limited to the embodiments described below. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the following embodiments, fall within the scope of protection of the present application. For the sake of clarity, parts not relevant to the description of the exemplary embodiments have been omitted in the drawings.

It will be understood that terms such as "including" or "having," and the like, in this application are intended to specify the presence of stated features, integers, steps, acts, components, or groups thereof, but do not preclude the presence or addition of one or more other features, integers, steps, acts, components, or groups thereof. In this application "plurality" may generally be interpreted as meaning two or more.

Fig. 1 is a diagram of an exemplary structure of a GPU chip 100 of the related art. As shown in fig. 1, the GPU chip 100 generally comprises one or more stream processors 110, the stream processors 110 further comprising a scheduler 111, one or more single-instruction multi-threaded processors 112, one or more level one caches 113, a memory management unit 114, and a shared storage 115. The stream processor 110 reads and writes data to and from one or more level two buffers 117 and a PCIE controller 120 through a crossbar or crossbar network 116. In addition, the GPU chip 100 may further include a video codec 121 and/or other processing cores (not shown in the figure).

In the GPU chip 100, in order to achieve high-performance access to data, SRAM registers are used in large quantities, such as register sets (also called register files) in the single-instruction multi-thread processor 112, a level one cache 113 (instruction cache, constant cache, data cache), a shared storage 115, a memory management unit 114, and a large number of queues (not shown in the figure).

As shown in fig. 2, the SIMT processor 112 generally includes a register set 210, an operand collector 220, and a plurality of computation units 230. The register bank 210 is implemented as an SRAM register, and the operand collector 220 and the calculation unit 230 are implemented as combinational logic.

Fig. 2 exemplarily depicts a SIMT processor architecture with 8 computing units. In this example, the SIMT processor contains 8 physical compute units and 32 physical threads T0-T31 (threaded 0-31 for short), using a 4 clock cycle instruction and a 4-stage pipeline design. Of the 32 physical threads T0-T31, each thread contains 128, 192, or 256 32-bit SRAM registers. When using dual ported SRAM registers, 32 bits of data can be read and written simultaneously per clock cycle.

As shown in FIG. 2, 32 physical threads T0-T31 correspond to 32 register banks B0-B31, respectively. If the depth of each register bank is 256, then the register set 21032 256 32 register banks stores the structure.

The operand collector 220 generally includes an operand queue and an output result queue for instruction execution, for reading instruction operands from and writing output result data of instructions to corresponding register banks of the register file 210.

When operands of an instruction are read from the corresponding register banks of the register set and placed into the corresponding operand queues, i.e., all operands of an instruction are ready, the instruction is sent to the computing unit 230 for execution.

The calculation unit 230 may include an integer execution unit, a floating point execution unit, a transcendental function execution unit, a tensor calculation unit, and other calculation units. Each computation unit typically comprises three operand input ports, i.e. three data can be read in per clock cycle.

Generally, in a system-on-chip design, one IP core typically employs the same operating voltage and clock frequency of SRAM registers and combinational logic components. For example, the register set 210 in fig. 2 uses the same operating voltage VDD and clock signal CLK as the operand collector 220 and the computation unit 230.

Typically, the SIMT instruction set of a GPU chip may be divided into single operand instructions, double operand instructions, triple operand instructions, and asynchronous instructions. The data reading operation of the SRAM register is asynchronously instructed, the data reading operation of the SRAM register is instructed by one to three operands, the data writing clock of the asynchronous instruction is not fixed, and the asynchronous instruction can be regarded as the SRAM register writing operation of one or more data.

In a processing chip such as a CPU or GPU, instructions are generally executed in a multistage pipeline manner in a computing center. FIG. 3 is a four-stage pipeline diagram of a fused multiply-add FMA instruction in a GPU. As shown in FIG. 4, before an FMA instruction is executed, operands A, B and C are first read into registers, and then the four-stage pipeline represented by S0, S1, S2, S3 is executed.

The first stage pipeline S0 primarily implements multiply operations for operands A and B and exponential alignment of three operands A, B and C;

the second-stage pipeline S1 mainly realizes the addition operation and leading 0 prediction of the multiplication result and C of the first-stage pipeline;

the third-stage pipeline S2 realizes the normalization of the operation result of the second-stage pipeline;

the fourth stage pipeline S3 implements carry, normalization, and next clock cycle output of the result of the operation of the third stage pipeline.

Typical representatives of a two-operand instruction are a multiply instruction, an add instruction, or a compare instruction, among others. FIG. 4 is a four-stage pipeline timing diagram of the SIMT processor of FIG. 2 executing two sequential two-operand multiply instructions. Since the register bank 210 is clocked at the same frequency as the combinational logic of the operand collector 220, the computation unit 230, and one register bank per thread, one data can be read and one data can be written per thread in one clock cycle.

As shown in FIG. 3, in the first clock cycle of the system clock CLK, the operand A of the Thread0-7 executing the first multiply instruction is read first;

in the second clock cycle, the Thread0-7 reads out the operand B, Thread8-15 executing the first multiply instruction to execute the operand A of the first multiply instruction;

in the third clock cycle, the Thread8-15 is read out to execute the operand B, Thread16-23 of the first multiply instruction to execute the operand A of the first multiply instruction;

at the fourth clock cycle, the operands B, Thread24-31 of the Thread16-23 executing the first multiply instruction are read out to execute the operand A of the first multiply instruction;

in the fifth clock cycle, the Thread0-7 reads out the operand A of the second multiply instruction and the Thread24-31 reads out the operand B of the first multiply instruction;

in the sixth clock cycle, the operand B of the Thread0-7 executing the second multiply instruction and the operand A of the Thread8-15 executing the second multiply instruction are read out, and the output result W of the Thread16-23 executing the first multiply instruction is written;

in the seventh clock cycle, the Thread8-15 executes the operand B of the second multiply instruction and the Thread16-23 executes the operand A of the second multiply instruction;

in the eighth clock cycle, the operands B of the second multiply instruction executed by the threads 16-23 and the operands A of the second multiply instruction executed by the threads 24-31 are read;

in the ninth clock cycle, the Thread0-7 is written to execute the output W of the first multiply instruction, and the Thread24-31 is read to execute the operand B of the second multiply instruction;

in the tenth clock cycle, the output W of the Thread8-15 executing the first multiply instruction is written;

in the eleventh clock cycle, the output W of the Thread16-23 for executing the first multiply instruction is written;

in the twelfth clock cycle, the output W of the Thread24-31 executing the first multiply instruction is written;

in the thirteenth clock cycle, the output W of the Thread0-7 executing the second multiply instruction is written;

in the fourteenth clock cycle, the output W of the Thread8-15 executing the second multiply instruction is written;

in the fifteenth clock cycle, the output W of the Thread16-23 for executing the second multiply instruction is written;

writing the output W of the Thread24-31 executing the second multiply instruction in the sixteenth clock cycle;

as shown in FIG. 4, the four-stage pipeline in which the compute unit 230 executes the instructions described above is also executed at the clock frequency of the system clock CLK. Wherein the Thread0-7 executing the first multiply instruction requires waiting for the operands A and B to be ready, and therefore, in order to clock align the four-stage pipeline, the first stage pipeline (S0) of the Thread0-7 executing the first multiply instruction delays execution by four clock cycles; then, the second stage pipeline (S1), the third stage pipeline (S2), and the fourth stage pipeline (S3) of the first multiply instruction are sequentially executed, and after the four stage pipeline is completed, the output result W of the instruction is written into the output result queue of the operand collector 220.

As mentioned above, the register set 210 in the SIMT processor shown in fig. 2 can read and write one data per thread in one clock cycle, which, along with the requirement of high performance computation on the chip frequency, may cause the SRAM registers constituting the register set to adopt a design with a larger area and a larger power consumption, thereby affecting the power consumption and the limit frequency of the processor chip.

It should be noted that the above description of the present application is only made by taking the SIMT processor in the GPU chip as an example, but it is not meant that the technical solution of the embodiment of the present application is only applicable to the GPU chip and the SIMT processor thereof shown in fig. 1 and 2, and the structure of the GPU and the SIMT processor shown in fig. 1 above is not understood to limit the protection scope of the present application.

Fig. 5 is a schematic structural diagram of a processor 500 according to a first embodiment of the present application. As shown in fig. 5, the processor 500 includes:

the register bank 510, including SRAM registers, includes a set of register banks 511 corresponding to a plurality of threads executing instructions.

In which each thread corresponds to at least two register banks, fig. 5 schematically shows an implementation of 32 threads T0-T31 and two register banks B0 and B1 for each thread, but the number of threads corresponding to the set of register banks in the register set 510 depends on the number of parallel threads supporting the execution instruction of the processor, which may be 16, 32, 64, and other different implementations, and is not limited to the number of 32 threads schematically depicted in fig. 4, and the number of register banks corresponding to each thread may also be greater than two according to the requirement of the number of data read out by each thread in one clock cycle.

A storage queue 520 connected to the register set 510 for storing operands corresponding to the plurality of threads executing the instruction read from the register set 510 and output results of the plurality of threads executing the instruction ready to be written to the register set 510.

A compute unit group 530, coupled to the store queue 520, includes multiple compute units 531 for executing multiple threads of instructions based on operands in the store queue 520 and outputting results to the store queue 520.

The clock divider circuit 540 is coupled to the system clock signal CLK, divides the system clock signal CLK into a divided clock signal CLKM having a frequency lower than the system clock frequency, and couples the divided clock signal CLKM to the register set 510. The combinatorial logic of the store queue 520 and the bank of compute units 530 operates at the clock frequency of the system clock signal CLK and the bank of registers 510 operates at the divided clock frequency of the divided clock signal CLKM.

Since the SRAM registers of the register set employ a frequency-divided clock frequency with a lower frequency, and each thread corresponds to at least two register banks at the same time, the register set 510 can read at least two operands of each of at least some of the threads in the plurality of threads within one clock cycle of the frequency-divided clock frequency.

According to the embodiment of the application, the clock frequency of the SRAM register in the processor is reduced, and the number of the register banks corresponding to the reading and writing of the parallel threads is expanded, so that each thread executing instructions can read more operands in one clock cycle of the register group. Even if the size of the storage space of the SRAM register is kept unchanged, the number of register banks can be increased, and the more optimal bandwidth of reading and writing data in unit clock period can be realized. In addition, the frequency of the SRAM register is reduced, and the design of using less transistor number is adopted, so that the power consumption is reduced, and the high-performance SRAM register reading and writing is realized.

In some embodiments, the register set 510 is capable of writing instruction output results for each of at least some of the plurality of threads within one clock cycle of the divided clock frequency.

In some embodiments, the register set 510 is capable of writing at least two output results for each of at least some of the plurality of threads within one clock cycle of the divided clock frequency.

In some embodiments, the clock divider circuit 540 may include an even divider circuit for converting the system clock signal CLK to an even divided clock signal CLKM of the system clock frequency.

In some embodiments, the even-numbered frequency-division circuit may include a divide-by-two circuit for converting the system clock signal CLK into a divide-by-two clock signal CLKM of the system clock frequency.

In some embodiments, the register file 510 may be configured to correspond to two register banks per thread, and the register file 510 may be capable of reading two operands per thread of at least some of the plurality of threads in one clock cycle of a divide-by-two (i.e., 1/2) of the system clock frequency.

In some embodiments, register set 510 is capable of writing up to two output results for each of at least some of the plurality of threads, depending on the type of instruction executed, during one clock cycle of a divide-by-two frequency of the system clock frequency, e.g., each thread may write the output results of one or both threads during one divide-by-two clock cycle.

In some embodiments, to achieve higher master frequency and register read and write efficiency, the even-numbered frequency-division circuit may further include a quarter-frequency-division circuit for converting the system clock signal CLK into a clock signal CLKM divided by four (i.e., 1/4) of the system clock frequency.

In some embodiments, the register set 510 may be configured to correspond to four register banks per thread, and the register set 510 may be capable of reading up to four operands per thread of at least some of the plurality of threads in one clock cycle of a divide-by-four system clock frequency.

In some embodiments, register set 510 is capable of writing up to four output results for each of at least some of the plurality of threads, depending on the type of instruction executed, in one clock cycle of four divisions of the system clock frequency.

In some embodiments, store queue 520 may include an operand queue and an output result queue. The width of each of the operand queue and the output result queue is the same as the number of the register banks in a group, so that the operand required by the execution of each thread instruction is read and the output result of the instruction is written. Taking the example that the register bank supports 32 parallel threads, each thread corresponds to 2 register banks, the width of the operand queue is the same as the number of register banks of the register bank, and is 32 × 2 32-bit SRAM registers, and the minimum depth may be 2, and optionally, may be 4 to 6 in depth. The output result queue has the same width as the register bank number of the register bank, and is also 32 × 2 32-bit SRAM registers, and the depth can be 2.

In some embodiments, the processor may include any one of a central processing unit, a graphics processing unit, a digital processor, a field programmable gate array, an artificial intelligence chip, and a video codec chip.

Fig. 6 is a schematic structural diagram of a processor 600 according to a second embodiment of the present application. As shown in fig. 6, the processor 600 further includes, based on the processor 500 in the first embodiment:

a level shift circuit 650, connected to the operating voltage VDD and the register group 610, for converting the operating voltage VDD to a voltage VDDM lower than the operating voltage VDD. The memory queue 620 and the computing unit group 630 operate at a voltage VDD, and the register group 510 operates at a voltage VDDM.

In the embodiment, a lower working voltage is adopted for the register group, for example, the VDDM voltage can be about 10% -20% lower than the VDD working voltage, so that the power consumption of the chip can be further reduced, the service life of the chip is effectively prolonged, and the yield of the chip is improved, thereby further improving the processing performance of the processing chip under low power consumption.

Fig. 7 is a flowchart illustrating a data read/write method according to an embodiment of the present application. As shown in fig. 7, the data reading and writing method is applicable to a processor including a register set, a storage queue, and a calculation unit set, where the storage queue and the calculation unit set operate at a first clock frequency, and the register set operates at a second clock frequency lower than the first clock frequency, and the data reading and writing method may include the following steps:

s710, reading out at least two operands of each of at least partial threads of the execution instruction from the register group to the storage queue in a clock cycle of the second clock frequency;

s720, judging whether all operands in at least part of threads of the execution instruction are read out to the storage queue, if not, continuously reading out the rest operands in at least part of threads of the execution instruction from the register group to the storage queue in the next clock cycle of the second clock frequency;

s730, after all operands of each of at least partial threads of the executed instruction are read out to the storage queue, the computing unit group executes the multi-stage pipeline on at least partial threads of the executed instruction in a clock cycle of a first clock frequency;

s740, after the multi-stage pipeline is executed on at least part of the threads executing the instruction, writing an output result of each of the at least part of the threads executing the instruction to the register set in a next clock cycle of the second clock frequency.

In some embodiments, writing the output results of each of the at least part of the threads executing the instruction to the register bank comprises writing at least two output results of each of the at least part of the threads executing the instruction to the register bank.

In some embodiments, the second clock frequency is an even division of the first clock frequency.

In some embodiments, the second clock frequency is a divide-by-two of the first clock frequency; in step S710, reading out at least two operands of each of at least some threads of the execution instruction from the register set to the store queue in one clock cycle of the second clock frequency includes: two operands per thread of at least a portion of the threads executing the instruction are read from the register bank to the store queue on one clock cycle of the second clock frequency.

In some embodiments, writing the output result of each of the at least some threads executing the instruction to the register set in the next clock cycle of the second clock frequency in step S740 includes: at most two output results of each of at least some of the threads executing the instruction are written to the register set on a next clock cycle of the second clock frequency.

In some embodiments, the second clock frequency is a quarter of the first clock frequency; in step S710, reading out at least two operands of each of at least some threads of the execution instruction from the register set to the store queue in one clock cycle of the second clock frequency includes: at most four operands per thread of at least a portion of the threads executing the instruction are read from the register bank to the store queue on one clock cycle of the second clock frequency.

In some embodiments, writing the output result of each of the at least some threads executing the instruction to the register set in the next clock cycle of the second clock frequency in step S740 includes: at most four output results of each of at least a portion of the threads executing the instruction are written to the register set on a next clock cycle of the second clock frequency.

FIG. 8 is a timing diagram of an embodiment of the present application processing two consecutive multiply two-operand instructions. FIG. 8 illustrates a two-divided clock CLKM with a register set operating at the system clock frequency CLK. As shown in FIG. 8, in the first clock cycle of the two-divided clock signal CLKM (second clock frequency), the operand AB of the Thread0-7 executing the first multiply instruction is read out from the register set first, and the operand AB of the Thread8-15 executing the first multiply instruction is read out at the same time;

in the second clock cycle, the Thread16-23 is read out to execute the operand AB of the first multiply instruction, and the Thread24-31 is read out to execute the operand AB of the first multiply instruction;

in the third clock cycle, the Thread0-7 is read out to execute the operand AB of the second multiply instruction, and the Thread8-15 is read out to execute the operand AB of the second multiply instruction;

at the fourth clock cycle, the Thread16-23 is read out to execute the operand AB of the second multiply instruction, and the Thread24-31 is read out to execute the operand AB of the second multiply instruction;

in the fifth clock cycle, writing the output result W of the Thread0-7 executing the first multiplication instruction into the register bank;

in the sixth clock cycle, writing the output result W of the first multiply instruction executed by the Thread8-15 and the output result W of the first multiply instruction executed by the Thread16-23 into the register groups;

in the seventh clock cycle, writing the output result W of the second multiply instruction executed by the Thread0-7 and the output result W of the first multiply instruction executed by the Thread24-31 into the register bank;

writing the output result W of the second multiply instruction executed by the Thread8-15 and the output result W of the second multiply instruction executed by the Thread16-23 to the register banks in the eighth clock cycle;

in the ninth clock cycle, the output W of the second multiply instruction executed by Thread24-31 is written to the register bank.

As shown in FIG. 8, the four-stage pipeline in which the set of compute units execute the instructions described above executes in accordance with the system clock signal CLK (first clock frequency). Wherein the execution of the first multiply instruction by Thread0-7 requires that both operands A and B be read from the register banks and are ready, and therefore, in order to align the clocks of the four-stage pipeline, the first stage pipeline (S0) of Thread0-7 executing the first multiply instruction delays by four clock cycles of the system clock signal CLK by two cycles of the frequency-divided clock signal CLKM; then, the second stage pipeline (S1), the third stage pipeline (S2), and the fourth stage pipeline (S3) of the first multiply instruction are sequentially executed, and after the four stage pipeline is completed, the output result W of the instruction is written into the output result queue of the store queue at the fifth clock cycle of the two-divided clock signal CLKM.

Similarly, the other threads Thread8-15, Thread16-23, Thread24-31 of the instruction sequentially lag behind one clock cycle of the system clock signal CLK to execute a four-stage pipeline, and sequentially write the output result W of the instruction into the output result queue of the store queue at the corresponding clock cycle of the halved clock signal CLKM.

In this embodiment, the register bank operates at the combinational logic frequency of 1/2, two operands can be read simultaneously in one clock cycle, and one or two output results can be written in one clock cycle.

FIG. 9 is a timing diagram of the processing of two consecutive FMA three-operand instructions according to an embodiment of the present application. FIG. 9 also illustrates a two-divided clock CLKM with a register set operating at the system clock frequency CLK. As shown in FIG. 9, in the first clock cycle of the divided-by clock signal CLKM (second clock frequency), the operand AB of the Thread0-7 executing the first FMA instruction is read out from the register set first, and the operand AB of the Thread8-15 executing the first FMA instruction is read out at the same time;

in the second clock cycle, reading out operand C of the Thread0-7 executing the first FMA instruction, reading out operand C of the Thread8-15 executing the first FMA instruction, reading out operand AB of the Thread16-23 executing the first FMA instruction, and reading out operand AB of the Thread24-31 executing the first FMA instruction;

in the third clock cycle, the Thread0-7 reads the operand AB to execute the second FMA instruction, the Thread8-15 reads the operand AB to execute the second FMA instruction, the threads 16-23 reads the operand C to execute the first FMA instruction, and the threads 24-31 reads the operand C to execute the first FMA instruction;

in the fourth clock cycle, the Thread0-7 reads operand C to execute the second FMA instruction, the Thread8-15 reads operand C to execute the second FMA instruction, the threads 16-23 reads operand AB to execute the second FMA instruction, and the threads 24-31 reads operand AB to execute the second FMA instruction;

in the fifth clock cycle, writing the output result W of the Thread0-7 executing the first FMA instruction into the register group, and simultaneously reading out the operand C of the Thread16-23 executing the second FMA instruction and reading out the operand C of the Thread24-31 executing the second FMA instruction;

in the sixth clock cycle, writing the output result W of the first FMA instruction executed by the Thread8-15 and the output result W of the first FMA instruction executed by the Thread16-23 into the register group;

in the seventh clock cycle, writing the output result W of the Thread0-7 executing the second FMA instruction and the output result W of the Thread24-31 executing the first FMA instruction into the register bank;

in the eighth clock cycle, the output result W, Thread16-23 of executing the second FMA instruction writes the Thread8-15 to the register bank;

on the ninth clock cycle, the output W of the Thread24-31 executing the second FMA instruction is written to the register bank.

As shown in fig. 9, the four-stage pipeline for executing the FMA instruction by the computing unit group is executed according to the system clock signal CLK (the first clock frequency), which is the same as that shown in fig. 8 and will not be described herein again.

FIG. 10 is a graph comparing performance according to embodiments of the present application with that of the prior art. The dynamic power consumption formula of the chip can be expressed as follows:

wherein, α is an activity factor, which is the probability of the circuit node jumping from 0 to 1, C is a load capacitance, VDD is the working voltage of the chip, and F is the frequency of the chip.

Assuming a 3.1Hz processing chip is designed, when the SRAM register and the combinational logic circuit use the same clock frequency, a 10T (i.e. 10 transistors implement 1 bit) SRAM register must be used, and the dynamic Power consumption of the 10T SRAM register is Power (10T) = 1.0 x α 3.0Ghz x C0, and C0 is the load capacitance at an operating voltage of 1.0 v.

If the embodiment of the present application is used, for example, the clock of the SRAM register is divided by two (1/2), an 8T (i.e., 8 transistors implement 1 bit) SRAM register may be used, the operating voltage may be reduced to 0.9V, the dynamic Power consumption of the 8T SRAM is Power (8T) = 0.9 × 2 × α = 1.5Ghz × C1, and C1 is the load capacitance. Wherein, C0 is higher than C1 by at least 25%, and by adopting the scheme of the embodiment of the application, the power consumption of the SRAM register can be reduced by about 35%.

According to the embodiment of the application, the clock frequency of the SRAM register in the processor is reduced, the number of the register banks corresponding to the reading and writing of the parallel threads is expanded, and each thread executing instructions can read more operands in one clock cycle of the SRAM register. Due to the fact that the frequency of the SRAM register is reduced, the SRAM register is designed by using fewer transistors, and the technical effects of reducing chip power consumption and achieving high-performance SRAM register reading and writing are achieved. Meanwhile, even when the size of the storage space of the SRAM register is kept unchanged, the more optimal bandwidth for reading and writing data in a unit clock cycle can be realized.

The embodiment of the application can be widely applied to processing circuits such as a central processing unit, a graphic processor, a digital processor, a field programmable gate array, an artificial intelligence chip, a video coding and decoding chip and the like, so that the read-write performance of the chip is improved, the power consumption is reduced, the service life of the chip is prolonged, and the yield is improved.

The steps, units or modules referred to in the embodiments of the present application may be implemented by hardware circuits or by a combination of hardware and software logics. The embodiments of the present application are not limited to the above-described examples, and various changes and modifications in form and detail may be made by one skilled in the art without departing from the spirit and scope of the present application, which are considered to fall within the scope of the present application.

Claims

1. A processor, comprising:

2. The processor of claim 1, wherein the set of registers is capable of writing output results for each of at least some of the plurality of threads during a clock cycle of the second clock frequency.

3. The processor of claim 2, wherein the set of registers is capable of writing at least two output results for each of at least some of the plurality of threads during one clock cycle of the second clock frequency.

4. The processor of claim 1, wherein the clock divider circuit comprises an even divider circuit to convert the first clock frequency to an even divided second clock frequency of the first clock frequency.

5. The processor of claim 4, wherein the even divider circuit comprises a divide-by-two circuit to convert the first clock frequency to a divide-by-two second clock frequency of the first clock frequency.

6. The processor of claim 5, wherein the register bank is configured to correspond to two register banks per thread, and wherein the register bank is capable of reading two operands per thread of at least some of the plurality of threads in one clock cycle of the second clock frequency divided by two of the first clock frequency.

7. The processor of claim 6, wherein the set of registers is capable of writing at most two output results for each of at least some of the plurality of threads in one clock cycle of a second clock frequency that is halved from the first clock frequency.

8. The processor of claim 4, wherein the even divider circuit comprises a divide-by-four circuit to convert the first clock frequency to a divided-by-four second clock frequency of the first clock frequency.

9. The processor of claim 8, wherein the register bank is configured to correspond to four register banks per thread, and wherein the register bank is capable of reading up to four operands per thread of at least some of the plurality of threads in one clock cycle of a second clock frequency that is four times the first clock frequency.

10. The processor of claim 9, wherein the register set is capable of writing up to four output results for each of at least some of the plurality of threads in one clock cycle of a second clock frequency that is four times the first clock frequency.

11. The processor of any one of claims 1-10, further comprising a level shift circuit to convert a first voltage to a second voltage lower than the first voltage, the store queue and the bank of compute cells operating at the first voltage, the bank of registers operating at the second voltage.

12. The processor of claim 11, wherein the store queue comprises an operand queue and an output result queue.

13. The processor of claim 12, wherein the operand queue and the result queue have a width that is the same as the number of the set of register banks corresponding to the plurality of threads executing instructions.

14. A method for reading and writing data, adapted to a processor including a register bank, a store queue, a compute unit bank, and a clock divider circuit, wherein the store queue and the compute unit bank operate at a first clock frequency, and the register bank operates at a second clock frequency lower than the first clock frequency, the method comprising the steps of:

15. A method as claimed in claim 14, wherein said writing the output of each of said at least some threads of said execution instruction to said register bank comprises: writing to the register set at least two output results for each of at least a portion of the threads executing the instruction.

16. A method as claimed in claim 15, wherein the second clock frequency is an even division of the first clock frequency.

17. A method for reading from and writing to data according to claim 16, wherein the second clock frequency is a divide-by-two of the first clock frequency; wherein reading out at least two operands of each of at least some of the threads executing the instruction from the register bank to the store queue in one clock cycle of the second clock frequency comprises: two operands for each of at least a portion of the threads executing the instruction are read from the register bank to the store queue during a clock cycle of the second clock frequency.

18. A method as claimed in claim 14, wherein said writing the output of each of said at least some threads of execution instructions to said register bank in the next clock cycle of the second clock frequency comprises: writing to the register bank at most two output results for each of at least some of the threads executing the instruction in a next clock cycle of a second clock frequency.

19. A method for reading and writing data according to claim 16, wherein the second clock frequency is a quarter of the first clock frequency; wherein reading out at least two operands for each of at least some of the threads executing the instruction from the register bank to the store queue in one clock cycle of the second clock frequency comprises: at most four operands per thread of at least a portion of the threads executing the instruction are read from the register bank to the store queue in one clock cycle of the second clock frequency.

20. A method as claimed in claim 14, wherein said writing the output result of each of at least some of the threads executing instructions to the register bank in the next clock cycle of the second clock frequency comprises: writing to the register bank at most four output results for each of at least some of the threads executing the instruction in a next clock cycle of a second clock frequency.