CN112463217A

CN112463217A - System, method, and medium for register file shared read port in a superscalar processor

Info

Publication number: CN112463217A
Application number: CN202011293529.9A
Authority: CN
Inventors: 潘杰; 耿恒生
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2021-03-09
Anticipated expiration: 2040-11-18
Also published as: CN112463217B

Abstract

A system, method and medium are provided for a register file shared read port in a superscalar processor, the system comprising: m input ports of the plurality of execution units, M being a positive integer; the method comprises the steps that N read ports of a register file are arranged, wherein the read port with the read port number x is associated with input ports from an input port number x to an input port number (x + M-N), N and x are positive integers, wherein x is larger than or equal to 1 and smaller than or equal to N, and in response to the fact that one input port in the M input ports needs a read port, the read port with the smallest read port number in the unassigned read ports is distributed from the associated input ports. The scheme achieves the purpose of fully sharing the read port by depending on the selection of the read port mapping and the selection strategy of the instructions in the transmission queue, can also reduce the implementation area of the register file, reduces the power consumption of the register file, and helps to optimize the time sequence of the register file.

Description

System, method, and medium for register file shared read port in a superscalar processor

Technical Field

The present disclosure relates to the field of processor technology, and more particularly, to a system, method, and medium for a register file shared read port in a superscalar processor.

Background

Dependencies in a Central Processing Unit (CPU) instruction include data dependencies, structure (resource) dependencies, and control dependencies. The data dependency mainly includes three types of RAW (Read After Write), WAR (Write After Read), and WAW (Write After Write). Assume instruction j is an instruction that executes after instruction i. RAW indicates that instruction j can only read data from a register after instruction i writes data into the register, and incorrect data will be obtained if instruction j attempts to read the contents of the register before instruction i writes into the register. WAR indicates that instruction j can write the register after instruction i reads data, and if instruction j writes the register before instruction i reads data, the data read by instruction i is incorrect. WAW indicates that instruction j can write data to a register after instruction i writes the data to the register, and if instruction j writes the register before instruction i, the value of the register is not the latest value.

In superscalar processors, instructions are often multi-issue, and the maximum number of instructions that can be issued in the same instruction issue cycle depends on the width of the issue slot (or instruction issue queue). Superscalar processors use a dynamic scheduling approach in order to achieve instruction level parallelism. In addition to the width of issue slots, data dependencies that exist between instructions become important factors that limit maximum instruction-level parallelism during sequential execution of instructions. The method mainly adopts a pipeline technology in the architecture design of the superscalar processor, and after the parallelism of instruction stages is realized, the key for ensuring the correct execution of a program and dynamically scheduling instructions is to detect whether data dependency exists between each group of instructions of adjacent pipeline stages.

In superscalar processor design, the improvement of the parallelism of a program is prevented due to the existence of false dependencies of WAW and WAR, so that register renaming (register name) technology is introduced to eliminate the false dependencies, and the parallelism of the program is further improved.

There remains a need to improve resource utilization in superscalar processor designs.

Disclosure of Invention

According to one aspect of the present disclosure, there is provided a system for a register file shared read port in a superscalar processor, comprising: m input ports of the plurality of execution units, M being a positive integer; the method comprises the steps that N read ports of the register file are formed, wherein the read port with the read port number x is associated with input ports from an input port number x to an input port number (x + M-N), N and x are positive integers, wherein x is larger than or equal to 1 and smaller than or equal to N, and in response to the fact that one input port in the M input ports needs the read port, the read port with the smallest read port number is distributed from the unassigned read ports.

In one embodiment, in a first instruction issue cycle, read ports are assigned to input ports in a first priority order of input port numbers, and in a second instruction issue cycle adjacent to the first instruction issue cycle, read ports are assigned to input ports in a second priority order of input port numbers different from the first priority order.

In one embodiment, the first priority order is an order from input port number 1 being the highest priority to input port number N being the lowest priority, and the second priority order is an order from input port number N being the highest priority to input port number 1 being the lowest priority.

In one embodiment, the priority of one input port number in the first instruction issue period is different from the priority of the input port number in the second instruction issue period.

In one embodiment, the priority of the input port number of the input port requiring a read port in one instruction issue cycle is related to how long the input port is not allocated a read port.

In one embodiment, in the case that the number of read ports required by M input ports is less than or equal to N, then in one instruction issue cycle, all input ports are allocated read ports.

In one embodiment, in the case that the number of read ports required by M input ports is greater than N, if one input port has no assignable read port in one instruction issue cycle, the input port is assigned a read port in another instruction issue cycle.

In one embodiment, for an instruction issue queue corresponding to one execution unit, instructions requiring different operands are selected for issue in different instruction issue cycles, so that the number of read ports required in different instruction issue cycles by the execution unit is different.

In one embodiment, if it is determined that an operand required by an input port is not read from a read port of the register file, the number of read ports required by the input port is determined to be 0.

In one embodiment, the values of M and N depend on the simulated or actual read port utilization of the input port.

According to another aspect of the present disclosure, there is provided a method of a register file shared read port in a superscalar processor, comprising: setting a total of M input ports of a plurality of execution units, M being a positive integer; setting N read ports of the register file, wherein the read port with the read port number x is associated with an input port from an input port number x to an input port number (x + M-N), N and x are positive integers, and x is more than or equal to 1 and less than or equal to N; and responding to the fact that one input port in the M input ports needs a read port, and allocating a read port with the minimum read port number in the unassigned read ports from the associated input ports.

According to another aspect of the present disclosure, a computer storage medium storing a computer executable program which, when executed by a processor, performs a method of register file shared read port in a superscalar processor is provided.

The scheme achieves the purpose of fully sharing the read port by depending on the selection of the read port mapping and the selection strategy of the instructions in the transmission queue, can also reduce the implementation area of the register file, reduces the power consumption of the register file, and helps to optimize the time sequence of the register file.

Drawings

FIG. 1 shows a fully mapped mode in which the read port of each register file is interconnected with the input port of each execution unit.

FIG. 2 illustrates a direct mapped mode where the input ports of each execution unit are interconnected with only a fixed one of the physical register file read ports.

FIG. 3 illustrates a partial mapping scheme in which the input ports of each execution unit are interconnected with a number of physical register file read ports, and each physical register file read port is also interconnected with the input ports of a number of execution units.

FIG. 4 illustrates a system for a register file shared read port in a superscalar processor, in accordance with one embodiment of the present disclosure.

FIG. 5A illustrates a specific example operation of the scheme shown in FIG. 4.

Fig. 5B shows a schematic diagram of connection of the newly added input port to the read port in the case of adding an input port.

FIG. 6 illustrates one embodiment of assigning read ports in different priority order of input ports in two instruction issue cycles.

FIG. 7 illustrates an example diagram of an instruction issue for an instruction issue queue corresponding to an execution unit that selects for issue instructions that do not always require the same operands in different instruction issue cycles.

FIG. 8 illustrates a method of a register file shared read port in a superscalar processor, in accordance with one embodiment of the present disclosure.

FIG. 9 illustrates a block diagram of an exemplary computer system/server suitable for use in implementing embodiments of the present invention.

Detailed Description

Reference will now be made in detail to the present embodiments of the invention, examples of which are illustrated in the accompanying drawings. While the invention will be described in conjunction with the specific embodiments, it will be understood that they are not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. It should be noted that the method steps described herein may be implemented by any functional block or functional arrangement, and that any functional block or functional arrangement may be implemented as a physical entity or a logical entity, or a combination of both.

In order that those skilled in the art will better understand the present invention, the following detailed description of the invention is provided in conjunction with the accompanying drawings and the detailed description of the invention.

Note that the example to be described next is only a specific example, and is not intended as a limitation on the embodiments of the present invention, and specific shapes, hardware, connections, steps, numerical values, conditions, data, orders, and the like, are necessarily shown and described. Those skilled in the art can, upon reading this specification, utilize the concepts of the present invention to construct more embodiments than those specifically described herein.

Register renaming is one of the important supporting base techniques for dynamic scheduling and out-of-order execution (out-of-order) in superscalar processors. Among the Register renaming techniques, the most widely used technique is realized by using a uniform Physical Register File (Physical Register File). The essence of the method is that a logical Register File (Architecture Register File) defined in an instruction set is dynamically mapped to a physical Register File inside a processor in the execution process of the processor, the physical Register actually realized inside the processor replaces the logical Register to participate in final operation, and the number of the physical registers is increased to effectively reduce false dependency so as to improve the parallelism of the whole program.

For a CPU of a 5-level pipeline structure, address fetching, decoding, execution, memory access and write back are carried out. In a typical superscalar processor, after instruction decode (decode) is completed, register renaming is performed, and dependencies (dependencies) are replaced with physical registers to eliminate spurious dependencies before entering the issue queue. For example, assume that the architectural registers are Reg1, Reg2, Reg3, …, and the physical registers are PR1, PR2, PR3, …. Examples of the program of the WAR are Reg 1-Reg 2+1 and Reg 2-Reg 3+2 … …, if not mapped to a physical register, the second instruction needs to wait for the first instruction to complete execution, but obviously the second instruction is not related to the first instruction, and if two instructions are mapped to different physical registers, the two instructions can be executed independently. That is, by mapping Reg1 in the first instruction to physical register PR1 and Reg2 to physical register PR2, and then mapping Reg2 in the second instruction to PR4 and Reg3 to PR3, the two instructions can be represented as PR 1-PR 2+1, PR 4-PR 3+2, … …, and the same applies to the WAW. The physical register completes the allocation of the physical register and the establishment of the dependency when renaming, the physical register can be released when no operation depends on the physical register, the next round of allocation is entered, and the issue (issue) can be carried out when the operands dependent on the operation are all available.

Since physical registers can be freely allocated to different logical registers, the operation starts by reading out the corresponding operands from the physical register file. In a superscalar processor architecture, each execution unit (execution unit) may correspond to a number of read ports (read ports) of a physical register file for performing read operations to the physical register file. In general, there is a huge number of physical register files in an implementation, and a huge number of read ports are required to read out required operands from the register files corresponding to a huge number of selection networks. As the performance of superscalar processors increases, correspondingly more execution units tend to be added, and correspondingly the number of read ports and their associated selection networks increases.

In most designs today, several read ports are typically required for multiple execution units, and register files are designed for several read ports, but they are not fully utilized. Since there are also instructions that do not need to read operands in registers, or require operands in a small number of registers, among the instructions that are actually executed, read port utilization typically does not reach 100% on average. This is wasteful in terms of providing a sufficient number of read ports for each execution unit, and this waste results in huge idle resources, which may increase the area and power consumption of the processor in the design, and may result in that the predetermined frequency cannot be reached in the physical implementation process, and the timing is difficult to converge. Timing closure means that the chip design meets all timing requirements, i.e. the chip can work at the required frequency. The reason why the timing is difficult to converge is that when the number of read ports is increased, logic units are greatly increased, and when signals pass through too many logic units and too long metal wires, signals cannot be transmitted from a source end to a destination end in a period corresponding to the frequency, so that only lower frequency can be realized under the condition that the requirements cannot be met by winding and the like.

In other implementations, three schemes for sharing the read port are proposed:

1. a full mapping, i.e. each physical register file read port is interconnected with an input port of each execution unit, as shown in fig. 1. FIG. 1 shows a fully mapped mode in which the read port of each register file is interconnected with the input port of each execution unit. This approach is complex to interface and requires complex instruction issue arbitration logic control to avoid collisions.

2. Direct-mapped, the input port of each execution unit is interconnected with only a fixed one of the physical register file read ports, but the read port may be connected to the input ports of multiple execution units, as shown in FIG. 2. FIG. 2 illustrates a direct mapped mode where the input ports of each execution unit are interconnected with only a fixed one of the physical register file read ports. The shared read port has a high probability of collision due to the limited number of connections.

3. Partial mapping, where the input port of each execution unit is interconnected with a number of physical register file read ports, and where each physical register file read port is also interconnected with the input ports of a number of execution units, as shown in FIG. 3. FIG. 3 illustrates a partial mapping scheme in which the input ports of each execution unit are interconnected with a number of physical register file read ports, and each physical register file read port is also interconnected with the input ports of a number of execution units. The scheme is a simplified version of scheme 1, but the execution characteristics on different pipelines (pipe) and the requirements of instructions on read ports need to be identified, read port sharing is reasonably performed, and complicated emission control logic is used for reducing read port collision, but the efficiency of total sharing cannot be completely achieved.

Therefore, there is a need for a technique for a register file shared read port in a superscalar processor that improves the efficiency of shared read ports, reduces read port conflicts, and simplifies issue control logic.

As described above, if all the input ports of the execution units have their own exclusive read ports of the register file, there is a certain waste of resources, and if the above-mentioned sharing method is adopted, either the read port sharing cannot be fully performed, or a complex control logic is required to avoid read port collision.

A set of improved scheme is constructed, all read ports can be shared in each beat of instruction emission period through a brand-new partial mapping corresponding relation, read port conflict is reduced, and the use efficiency of the shared read ports is further improved by being assisted with appropriate emission control logic.

FIG. 4 illustrates a system 400 for a register file shared read port in a superscalar processor, in accordance with one embodiment of the present disclosure.

As shown in FIG. 4, a system 400 for a register file shared read port in a superscalar processor comprises: m input ports 402 of the plurality of execution units 401, M being a positive integer; n read ports 404 of the register file 403, where a read port 404 with a read port number x is associated with an input port number x to an input port number (x + M-N) 402, N and x being positive integers, where 1 ≦ x ≦ N. And responding to the fact that one input port in the M input ports needs a read port, and allocating a read port with the minimum read port number in the unassigned read ports from the associated input ports.

Generally, the number M of input ports is greater than or equal to the number N of read ports, i.e., M ≧ N, the input ports need to share these read ports. If each input port is associated with each read port, while full sharing can be achieved, resource utilization is reduced, and if partially associated, complex read port allocation (or selection) logic is also required to avoid read port conflicts.

The scheme adopts a special partial interconnection mode to realize the effect of full interconnection. According to the scheme, a read port 404 with a read port number x is partially interconnected with input ports 402 from an input port number x to an input port number (x + M-N), instead of a full interconnection mode that each input port is associated with each read port, and a read port with the smallest read port number is allocated from the associated input ports by simple read port allocation (or selection) logic, namely, in response to the fact that one input port of the M input ports needs a read port, the read ports which need the read port are allocated, so that the input ports which need the read port can be allocated with a sufficient number of read ports in an instruction transmission cycle of one transmission queue.

Each execution unit 1,2, … … y, … … is shown in fig. 4 as having 2 input ports, but this is merely an example, and in practice execution units may have a different number of input ports, and not a limitation herein.

The specific operation of the above scheme is further described with reference to the specific example of fig. 5A. FIG. 5A illustrates a specific example operation of the scheme shown in FIG. 4.

As shown in fig. 5A, assuming that there are 4 execution units and each execution unit has 2 input ports, there are 8 input ports (numbered 1-8), i.e., M is 8. Of course, each execution unit may have other number of input ports, and is not limited herein. It is assumed that there are 4 shared read ports, i.e., N equals 4, and the number of shared read ports may be other numbers, which is not limited herein. Then the connection (association) manner according to the present solution is:

read port 1 is connected to input port 1, input port 2, input port 3, input port 4, and input port 5.

Read port 2 is connected to input port 2, input port 3, input port 4, input port 5, and input port 6.

Read port 3 is connected to input port 3, input port 4, input port 5, input port 6, and input port 7.

Read port 4 is connected to input port 4, input port 5, input port 6, input port 7, and input port 8.

When input port 1 requires a read port, read port 1 associated with input port 1 is assigned (where input port 1 is associated with read port 1 only). When input port 2 requires a read port, the unassigned read port 2 with the smallest read port number of read port 1, read port 2 associated with input port 2 is assigned because read port 1 is already assigned to input port 1 and is occupied. When input port 3 requires a read port, an unassigned read port 3 with the smallest read port number of read ports 1,2, 3 associated with input port 3 is assigned because read port 1,2 has already been assigned to input port 1,2 and is occupied.

By analogy, when the previous read port is assigned to the previous input port and is occupied, the subsequent input port can only assign the remaining read ports. And distributing the read port with the minimum read port number in the read ports associated with the input ports as much as possible so as to meet the requirements of the subsequent input ports and improve the efficiency of sharing the read ports. If a read port with the minimum read port number is not allocated, for example, when the input ports 1,2, 3, 4 do not need a read port and the input port 5 needs a read port, if the read port 2 of the read ports 1,2, 3, 4 associated with the input port 5 is allocated at this time instead of the read port 1 with the minimum read port number, the read port 1 may not be allocated and resources of the read port 1 may be wasted because none of the subsequent input ports 6, 7, 8 is associated with the read port 1, and only the remaining read ports 3, 4 of the subsequent input ports 6, 7, 8 may be shared, and a situation that the input ports cannot be allocated with available read ports may occur, thereby reducing the efficiency of sharing the read ports.

Of course, since there are 4 read ports, if the number of operands required by the 4 execution units is less than or equal to 4 (i.e. the number of read ports required is less than or equal to 4) in one beat of instruction issue cycle, the connection structure shown in fig. 5A can ensure that the 4 execution units can be normally allocated to the read ports required by the execution units, and the 4 read ports can be efficiently used.

In an embodiment, the values of the number M of input ports and the number N of read ports may depend on the utilization rate of the simulated or actual read ports of the input ports, and in different cases, M and N may take different values to satisfy the requirement that read ports are allocated to input ports requiring read ports as much as possible in one instruction transmission cycle, thereby better improving the utilization efficiency of the read ports and reducing the vacancy rate of shared read ports.

For example, a read port is allocated to each input port in the perf model simulation model, performance simulation is performed through the perf model simulation model to count the utilization rate of the read port, the lower limit of the shared read port is determined, then the number of the shared read ports is adjusted in the perf model simulation model according to a preset sharing strategy, balance points of performance and resources are found, and the number N of the proper read ports is determined.

Assume that there are m input ports, I1, I2, …, I m, and n shared read ports, R1, R2, …, Rn. To achieve sharing, the shared read port R [ x ], needs to be connected to input ports I [ x ], I [ x +1], …, I [ x + m-n ] (a total of m-n +1 input ports). For example 8 input ports, 4 shared read ports, R1 needs to be connected to 5 input ports I1, I2, …, I5.

The reason is that when the average read port usage rate of a input ports I [1, …, a ] (a is a positive integer, and it is assumed that a is 2 as shown in fig. 5B, and is input port 1 and input port 2) is 1, that is, a input ports only need to share 1 read port to meet the requirement, the average read port usage rate of (a +1) input ports is 2 (that is, 1 read port required by a input port, plus another read port required by a newly added input port, and totally 2 read ports are required) (as shown in fig. 5B, input ports 1,2, and 3 need read port 1 and read port 2), that is, m is a +1, n 2, and then the former a connection mode is the same as the former case, and the 2 nd read port R [2] (in this case, x is 2) is equal to all input ports R [2] except for the 1 st input port R [1], …,2+ a +1-2], i.e., R2, …, a + 1.

Therefore, when one of the 2 read port demands is from the first a input ports and one is from the last input port I [ a +1], the connection structure can ensure that the 2 read port demands of the input ports can be allocated to the corresponding read ports.

Read port [2] may be assigned when 2 read port demands are all from a input ports, e.g., if input port [1] has a read demand, it occupies read port [1], and another read port demand is from input port [2, …, a ], and they are already interconnected with read port [2 ].

The requirement is also satisfied if 2 read port requests are all from input ports [2, …, a +1], and it is clear that [2, …, a +1] has an interconnection with both read ports [1,2], i.e. read port [1] and read port [2] are allocated.

When the average read port utilization rate expanded to any a + b input ports is 1+ b, 1+ b read ports can be fully shared by the connection modes provided by the scheme.

Expanding to a general rule, when the number of input ports is M and the number of shared read ports is N, each shared read port is connected to the immediately following (M-N +1) input ports in a shifted manner, and the shared read port X is connected to the input port from the input port number X to the input port number (X + M-N). That is, in the case where the number of read ports required by M input ports is less than or equal to N, then in one instruction issue cycle, read ports are allocated to all the input ports.

However, if in one beat of instruction issue cycle, when 4 execution units need operands larger than 4 read ports in total, it may happen that the former execution unit occupies all the shared read ports, and the latter execution unit can only execute operations without read ports (i.e. operations without operands), otherwise, it can only pause, and wait for the next beat of instruction issue cycle to allocate read ports for operand reading. That is, in the case that the number of read ports required by M input ports is greater than N, if one input port has no assignable read port in one instruction issue cycle, the input port is assigned with a read port in another instruction issue cycle.

This may be further optimized to reduce the long term inability of the last-ordered execution unit to acquire the read port. In one embodiment, in a first instruction issue cycle, read ports are assigned to input ports in a first priority order of input port numbers, and in a second instruction issue cycle adjacent to the first instruction issue cycle, read ports are assigned to input ports in a second priority order of input port numbers different from the first priority order.

For example, assuming that in the first instruction issue cycle, the first priority is from input port 1 to the highest priority to allocate a read port, and then the priorities of the read ports allocated by the following input ports 2 to 8 are gradually decreased, which may result in a decrease in the probability that the following input ports are allocated to the read port, and a "starvation" phenomenon may occur. For example, the previous input ports 1-4 need 4 operands for a long time and are allocated with all 4 read ports, the subsequent input ports 5-8 may not be allocated with read ports for a long time and not get operands, and thus cannot continue to execute the program requiring operands. Thus, a variable priority of acquiring shared read ports may be added, one way of which is to adjust the order in which the priorities of the shared read ports are preferentially selected. For example, in a second instruction issue cycle adjacent to the first instruction issue cycle, the read ports are assigned to the input ports in a second priority order of the input port numbers that is different from the first priority order.

In particular, FIG. 6 illustrates one embodiment of assigning read ports in different priority order of input ports in two instruction issue cycles. As shown in fig. 6, in the previous beat of instruction issue cycle (cycle p, p is a positive integer), if the priority of assigning read ports by the execution unit is to start preferential assignment from input port 1, then input port 2 selects from the rest of ports, and so on until the last input port or the last shared read port completes assignment, then in the current beat of instruction issue cycle (cycle p +1), it is necessary to adjust the assignment priority order, for example, the assignment may be preferentially started from input port 8, but it first assigns read port as read port 4, then input port 7 starts selection, selects from the rest of read ports, and the priority order of input ports is from large input port to small input port number.

Of course, the first priority order and the second priority order may not be the order exemplified above as long as one input port is not always assigned a read port. In one embodiment, the priority of one input port number in the first instruction issue period is different from the priority of the input port number in the second instruction issue period. The priority of an input port may be dynamically adjusted, for example, according to the probability that the input port is assigned to a read port, or according to the number of read ports assigned to the input port in simulation or actual operation, etc. For example, in one embodiment, the priority of an input port number of an input port requiring a read port in an instruction issue cycle is related to how long the input port has not been assigned a read port.

In this way, by dynamically adjusting the priority of the input port number of the input port in a plurality of instruction transmission cycles, it can be ensured that the input port can be allocated to the read port in time, and the situation that the input port is not allocated to the read port for a long time and is starved is avoided.

The read port utilization rate of each execution unit can be balanced to further ensure the purpose that the input ports are balanced to share the read ports. In the instruction transmitting process, a plurality of instructions to be transmitted in different instruction transmitting periods are selected and put in an instruction transmitting queue corresponding to one execution unit, and each instruction has the requirement of an operand. Considering that each instruction may have different operand requirements, for example, some instructions may require 1 operand, some instructions may require 2 operands, some instructions may require 0 operand, etc., in one embodiment, for an instruction issue queue corresponding to one execution unit, instructions that do not always have the same operand requirement are selected for issue in different instruction issue cycles, so that the number of read ports required by the execution unit in different instruction issue cycles is not always the same. For example, an instruction requiring different operands is directly found from the issue queue, and the instruction requiring different operands is selected as much as possible for the execution unit in different instruction issue cycles.

FIG. 7 illustrates an example diagram of an instruction issue for an instruction issue queue corresponding to an execution unit that selects for issue instructions that do not always require the same operands in different instruction issue cycles. As shown in fig. 7, for example, such that in one instruction issue cycle, the instruction of execution unit 1 requires 0 operands, in the next instruction issue cycle, the instruction of execution unit 1 requires 1 operand, in the next instruction issue cycle, the instruction of execution unit 1 requires 2 operands, in the next instruction issue cycle, the instruction of execution unit 1 requires 1 operand, … … in the next instruction issue cycle, the instruction of execution unit 1 requires q operands, and so on. q is a positive integer.

In this way, since the number of read ports required by the input ports of the execution units in different instruction issue cycles is not always the same, the total number of read ports required by the input ports of the execution units in each instruction issue cycle is not always the same. Therefore, the input port can be further ensured to be allocated with the read port required by the input port, and the aim of balanced use of the shared read port can be further achieved.

In addition, if it can be judged that data is passed from the bypass network, and it is not necessary to actually read an operand from the register file, the port requirement for reading may be considered to be 0, that is, if it is determined that the operand required by the input port is not read from the read port of the register file, it is determined that the number of read ports required by the input port is 0. Therefore, the use efficiency of the read port can be further optimized, and the vacancy rate of the shared read port is reduced.

In the read port design of the register file of the processor, the number of read ports can be effectively reduced on the premise of not reducing the efficiency of reading the register file, so that the area is reduced, the power consumption is reduced, and timing sequence convergence can be helped to a certain extent.

FIG. 8 shows a method 800 for a register file shared read port in a superscalar processor, in accordance with one embodiment of the present disclosure, comprising: step 801, setting a total of M input ports of a plurality of execution units, wherein M is a positive integer; step 802, setting N read ports of a register file, wherein the read port with the read port number x is associated with an input port from an input port number x to an input port number (x + M-N), N and x are positive integers, and x is more than or equal to 1 and less than or equal to N; in step 803, in response to that one of the M input ports needs a read port, a read port with the smallest read port number among the unassigned read ports is allocated from the associated input ports.

To reduce the situation that the last execution unit cannot acquire the read port for a long time, in one embodiment, in a first instruction transmission cycle, the read port is allocated to the input port in a first priority order of the input port numbers, and in a second instruction transmission cycle adjacent to the first instruction transmission cycle, the read port is allocated to the input port in a second priority order of the input port numbers, wherein the second priority order is different from the first priority order.

And under the condition that the number of the read ports required by the M input ports is less than or equal to N, allocating the read ports to all the input ports in one instruction transmission cycle.

If the number of the read ports needed by the M input ports is larger than N, if one input port has no allocable read ports in one instruction transmission cycle, waiting for another instruction transmission cycle, and allocating the read ports to the input ports.

The read port utilization rate of each execution unit can be balanced to further ensure the purpose that the input ports are balanced to share the read ports. In one embodiment, for an instruction issue queue corresponding to one execution unit, instructions requiring different operands are selected for issue in different instruction issue cycles, so that the number of read ports required in different instruction issue cycles by the execution unit is different.

FIG. 9 illustrates a block diagram of an exemplary computer system suitable for use to implement embodiments of the present invention.

The computer system may include a processor (H1); a memory (H2) coupled to the processor (H1) and storing computer executable instructions therein.

The processor (H1) may include, but is not limited to, for example, one or more processors or microprocessors or the like.

The memory (H2) may include, but is not limited to, for example, Random Access Memory (RAM), Read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, hard disk, floppy disk, solid state disk, removable disk, CD-ROM, DVD-ROM, Blu-ray disk, and the like.

In addition, the computer system may include a data bus (H3), an input/output (I/O) bus (H4), a display (H5), and an input/output device (H6) (e.g., a keyboard, a mouse, a speaker, etc.), among others.

The processor (H1) may communicate with external devices (H5, H6, etc.) via a wired or wireless network (not shown) over an I/O bus (H4).

The memory (H2) may also store at least one computer-executable instruction for performing, when executed by the processor (H1), the functions and/or steps of the methods in the embodiments described in the present technology.

Of course, the above-mentioned embodiments are merely examples and not limitations, and those skilled in the art can combine and combine some steps and apparatuses from the above-mentioned separately described embodiments to achieve the effects of the present invention according to the concepts of the present invention, and such combined and combined embodiments are also included in the present invention, and such combined and combined embodiments are not necessarily described herein.

It is noted that advantages, effects, and the like, which are mentioned in the present disclosure, are only examples and not limitations, and they are not to be considered essential to various embodiments of the present invention. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the invention is not limited to the specific details described above.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The flowchart of steps in the present disclosure and the above description of methods are merely illustrative examples and are not intended to require or imply that the steps of the various embodiments must be performed in the order presented. As will be appreciated by those skilled in the art, the order of the steps in the above embodiments may be performed in any order. Words such as "thereafter," "then," "next," etc. are not intended to limit the order of the steps; these words are only used to guide the reader through the description of these methods. Furthermore, any reference to an element in the singular, for example, using the articles "a," "an," or "the" is not to be construed as limiting the element to the singular.

In addition, the steps and devices in the embodiments are not limited to be implemented in a certain embodiment, and in fact, some steps and devices in the embodiments may be combined according to the concept of the present invention to conceive new embodiments, and these new embodiments are also included in the scope of the present invention.

The individual operations of the methods described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software components and/or modules including, but not limited to, a hardware circuit, an Application Specific Integrated Circuit (ASIC), or a processor.

The various illustrative logical blocks, modules, and circuits described may be implemented or described with a general purpose processor, a Digital Signal Processor (DSP), an ASIC, a field programmable gate array signal (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any commercially available processor, controller, microcontroller or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may reside in any form of tangible storage medium. Some examples of storage media that may be used include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM, and the like. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. A software module may be a single instruction, or many instructions, and may be distributed over several different code segments, among different programs, and across multiple storage media.

The methods disclosed herein comprise one or more acts for implementing the described methods. The methods and/or acts may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of actions is specified, the order and/or use of specific actions may be modified without departing from the scope of the claims.

The above-described functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as one or more instructions on a tangible computer-readable medium. A storage media may be any available tangible media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other tangible medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. As used herein, disk (disk) and disc (disc) includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.

Accordingly, a computer program product may perform the operations presented herein. For example, such a computer program product may be a computer-readable tangible medium having instructions stored (and/or encoded) thereon that are executable by one or more processors to perform the operations described herein. The computer program product may include packaged material.

Software or instructions may also be transmitted over a transmission medium. For example, the software may be transmitted from a website, server, or other remote source using a transmission medium such as coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, or microwave.

Further, modules and/or other suitable means for carrying out the methods and techniques described herein may be downloaded and/or otherwise obtained by a user terminal and/or base station as appropriate. For example, such a device may be coupled to a server to facilitate the transfer of means for performing the methods described herein. Alternatively, the various methods described herein can be provided via storage means (e.g., RAM, ROM, a physical storage medium such as a CD or floppy disk) so that the user terminal and/or base station can obtain the various methods when coupled to or providing storage means to the device. Moreover, any other suitable technique for providing the methods and techniques described herein to a device may be utilized.

Other examples and implementations are within the scope and spirit of the disclosure and the following claims. For example, due to the nature of software, the functions described above may be implemented using software executed by a processor, hardware, firmware, hard-wired, or any combination of these. Features implementing functions may also be physically located at various locations, including being distributed such that portions of functions are implemented at different physical locations. Also, as used herein, including in the claims, "or" as used in a list of items beginning with "at least one" indicates a separate list, such that a list of "A, B or at least one of C" means a or B or C, or AB or AC or BC, or ABC (i.e., a and B and C). Furthermore, the word "exemplary" does not mean that the described example is preferred or better than other examples.

Various changes, substitutions and alterations to the techniques described herein may be made without departing from the techniques of the teachings as defined by the appended claims. Moreover, the scope of the claims of the present disclosure is not limited to the particular aspects of the process, machine, manufacture, composition of matter, means, methods and acts described above. Processes, machines, manufacture, compositions of matter, means, methods, or acts, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding aspects described herein may be utilized. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or acts.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the invention. Thus, the present invention is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the invention to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A system for a register file shared read port in a superscalar processor, comprising:

m input ports of the plurality of execution units, M being a positive integer;

n read ports of the register file, wherein a read port with a read port number x is associated with an input port number x to an input port number (x + M-N), N and x are positive integers, wherein 1 ≦ x ≦ N,

and responding to the requirement of a read port of one input port in the M input ports, and allocating a read port with the minimum read port number in the unassigned read ports from the associated input ports.

2. The system of claim 1, wherein in a first instruction issue cycle, the input ports are assigned read ports in a first priority order of input port numbers, and in a second instruction issue cycle adjacent to the first instruction issue cycle, the input ports are assigned read ports in a second priority order of input port numbers different from the first priority order.

3. The system according to claim 2, wherein the first priority order is an order from input port number 1 being the highest priority to input port number N being the lowest priority, and the second priority order is an order from input port number N being the highest priority to input port number 1 being the lowest priority.

4. The system of claim 1, wherein a priority of one input port number in a first instruction issue cycle is different from a priority of the input port number in a second instruction issue cycle.

5. The system of claim 1, wherein a priority of an input port number of an input port requiring a read port in an instruction issue cycle is related to how long the input port is unassigned a read port.

6. The system of claim 1, wherein in case the number of read ports required for M input ports is less than or equal to N, then in one instruction issue cycle, all input ports are assigned read ports.

7. The system of claim 1, wherein in a case that the number of read ports required by M input ports is greater than N, if one input port has no assignable read port in one instruction issue cycle, waiting for another instruction issue cycle, and assigning a read port to the input port.

8. The system of claim 1, wherein instructions requiring different operands are selected for issue in different instruction issue cycles for an instruction issue queue corresponding to an execution unit, such that the number of read ports required in different instruction issue cycles by the execution unit is not always the same.

9. The system of claim 1, wherein the number of read ports required by an input port is determined to be 0 if it is determined that an operand required by the input port is not read from a read port of the register file.

10. The system of claim 1, wherein the values of M and N depend on simulated or actual read port utilization of an input port.

11. A method of sharing a read port with a register file in a superscalar processor, comprising:

setting a total of M input ports of a plurality of execution units, M being a positive integer;

setting N read ports of the register file, wherein the read port with the read port number x is associated with an input port from an input port number x to an input port number (x + M-N), N and x are positive integers, and x is more than or equal to 1 and less than or equal to N;

and responding to the fact that one input port in the M input ports needs a read port, and allocating a read port with the minimum read port number in the unassigned read ports from the associated input ports.

12. The method of claim 11, wherein in a first instruction issue cycle, read ports are assigned to input ports in a first priority order of input port numbers, and in a second instruction issue cycle adjacent to the first instruction issue cycle, read ports are assigned to input ports in a second priority order of input port numbers different from the first priority order.

13. The method according to claim 12, wherein the first priority order is an order from input port number 1 being the highest priority to input port number N being the lowest priority, and the second priority order is an order from input port number N being the highest priority to input port number 1 being the lowest priority.

14. The method of claim 11, wherein a priority of one input port number in a first instruction issue cycle is different from a priority of the input port number in a second instruction issue cycle.

15. The method of claim 11, wherein a priority of an input port number of an input port requiring a read port in an instruction issue cycle is related to how long the input port is unassigned a read port.

16. The method of claim 11, wherein in case the number of read ports required by M input ports is less than or equal to N, then in one instruction issue cycle, all input ports are allocated read ports.

17. The method of claim 11, wherein in a case that the number of read ports required by M input ports is greater than N, if one input port has no assignable read port in one instruction issue cycle, waiting for another instruction issue cycle, and assigning a read port to the input port.

18. The method of claim 11, wherein instructions requiring different operands are selected for issue in different instruction issue cycles for an instruction issue queue corresponding to an execution unit, such that the number of read ports required in different instruction issue cycles by the execution unit is not always the same.

19. The method of claim 11, wherein if it is determined that an operand required by an input port is not read from a read port of the register file, then determining that the number of read ports required by the input port is 0.

20. The method of claim 11, wherein the values of M and N depend on simulated or actual read port utilization of the input port.

21. A computer storage medium storing a computer executable program which when executed by a processor performs any of the methods of claims 11-20.