CN117348933B

CN117348933B - Processor and computer system

Info

Publication number: CN117348933B
Application number: CN202311652922.6A
Authority: CN
Inventors: 刘宇翔
Original assignee: Ruisixinke Shenzhen Technology Co ltd
Current assignee: Ruisixinke Shenzhen Technology Co ltd
Priority date: 2023-12-05
Filing date: 2023-12-05
Publication date: 2024-02-06
Anticipated expiration: 2043-12-05
Also published as: CN117348933A

Abstract

The invention is applicable to the technical field of processors and provides a processor and a computer system. According to the invention, by adding the additional instruction group buffer unit to store the instruction information content possibly output in the next period, when the instruction buffer unit module needs to output, the corresponding instruction is not selected from the main buffer through a large cross switch matrix to output, but the instruction information content of the corresponding instruction group buffer is selected instead to output and judge. And at the same time supplements instructions from the input of the main buffer or instruction buffer unit. With this structure, the time of two stages of logic can be saved. And with the increase of the size of the main buffer and the emission quantity of the superscalar processor system, the processor provided by the invention has more time sequence benefits, is beneficial to realizing higher working frequency, and achieves better performance.

Description

Processor and computer system

Technical Field

The present invention relates to a processor, and more particularly, to a processor and a computer system.

Background

The processor is the primary computing component in the computer system. It is responsible for executing instructions and performing arithmetic and logical operations. With the continuous development of computer technology, the performance and the function of a processor are greatly improved. Currently, processors on the market are mainly divided into two types: a reduced instruction set processor (Reduced Instruction Set Computing, RISC) and a complex instruction set processor (Complex Instruction Set Computer, CISC). A reduced instruction set processor may enable efficient instruction processing using a simple instruction set, while a complex instruction set processor supports more complex instructions and higher level operations. With the development of technologies such as artificial intelligence and machine learning, the requirements on the performance of a processor become higher and higher.

A simplified typical superscalar processor pipeline architecture is shown in fig. 1. As shown in fig. 1, it can be divided into four modules: fetch unit, instruction buffer unit, decoding unit, and execution unit. The instruction fetching unit is responsible for fetching instructions to be executed by the processor from the memory in each cycle; the instruction buffer unit is responsible for protecting the instruction fetched by the instruction fetch unit and balancing instruction throughput gap between the instruction fetch unit and the decoding unit. The decoding unit is responsible for decoding the fetched instruction to obtain an operand, and sending relevant information into the execution unit for execution to obtain a result. Among these, the instruction buffer unit is a key component in a superscalar processor, which functions to store and schedule instructions. Because in a high-performance superscalar processor system, in view of system performance balance, the number of instructions fetched by the instruction fetching unit in each cycle is generally greater than the number of instructions that can be decoded by each unit in the subsequent stage decoding, the instructions fetched by the instruction fetching unit are generally stored in the instruction buffer unit and then sent to the decoding unit in the subsequent stage, so as to balance different load requirements on two sides. It can be said that instruction buffer unit performance directly affects the performance and efficiency of a superscalar processor.

The main body of the existing instruction buffer is composed of a fixed-size first-in first-out (First In First Out, FIFO) queue, i.e. a main buffer, and the structure is shown in fig. 2. When the superscalar processor runs, the fetch unit of the front stage stores the fetched pieces of instruction information in the main buffer of the instruction buffer in sequence. Meanwhile, at the outlet of the instruction buffer unit, each cycle is transmitted to a certain number of instructions of the later stage processing unit according to the residual instruction number of the current instruction buffer unit.

As shown in fig. 2, the conventional instruction buffer unit mainly comprises two crossbar matrices, an output selection logic unit, an output buffer and a main buffer. When in operation, the device comprises: when an upper level transmits an instruction packet information, if the residual space of the main buffer of the current instruction buffer is insufficient, the front-end request is back-pressed; if the remaining space is sufficient, the corresponding main buffer position of the input instruction packet is selected according to the write pointer of the current main buffer, the input instruction packet is sequentially stored in the FIFO of the main buffer, and the write pointer position of the FIFO is updated. When the instruction buffer outputs the instruction to the next stage, the instruction to be output is selected from the main buffer through the cross switch matrix according to the read pointer position of the FIFO, stored in the output buffer, and the instruction which can be actually output is judged through the output selection logic unit and is sent to the next stage decoding unit.

In a superscalar processor system of this architecture, to ensure that the pipeline at the instruction buffer runs continuously without disruption, each cycle of the instruction buffer needs to complete the following work, as shown in FIG. 3.

1. According to the latest main buffer read pointer of the present period, the instruction range which is likely to be output when the beat is read out from the main buffer. 2. According to the content of the first instruction in the instruction range, judging the first instruction as a compressed/non-compressed instruction, and determining the starting position of the next instruction according to the content. 3. And circularly carrying out the operation in the last step, determining the boundaries of all instructions, and obtaining the number of main buffers occupied by the output instructions. 4. Based on the current main buffer effective depth, the number of actual instructions that can be issued to the later stage pipeline is determined. 5. The read pointer of the main buffer is updated.

It can be seen that since the start and end points in this path are both the read pointers of the main buffer, the contents of this path must be completed in one cycle of time in order to ensure that the pipeline at the instruction buffer runs uninterrupted. This also becomes a key point limiting the frequency with which superscalar processors can be implemented. At the same time, the logic depth of the crossbar matrix in the output logic and the complexity of the output select logic are related to the main buffer size of the instruction buffer unit and the number of emissions of the superscalar processor system, which is more significant for high performance, high throughput systems, and thus may be a bottleneck in some computationally intensive applications, resulting in the superscalar processor not fully utilizing its processing power.

Disclosure of Invention

The invention provides a processor and a computer system, which solve the problem that the performance and efficiency of the processor are affected by overlong critical time sequence paths related to the output end of a command buffer unit in the processor in the prior art.

In a first aspect, the present invention provides a processor, where the processor includes a fetch unit, an instruction buffer unit connected to an output end of the fetch unit, a decoding unit connected to an output end of the instruction buffer unit, and an execution unit connected to an output end of the decoding unit;

the instruction fetching unit is used for fetching instruction information which needs to be executed by the processor from the memory in each period; the instruction buffer unit is used for storing the instruction information fetched by the instruction fetching unit and balancing throughput gaps of the instruction information between the instruction fetching unit and the decoding unit; the decoding unit is used for decoding the fetched instruction information to obtain an operand and transmitting the operand to the execution unit; the execution unit is used for executing the instruction information and obtaining a result;

the instruction buffer unit comprises a main buffer, a first crossbar matrix connected with the input end of the main buffer, a plurality of instruction group buffers respectively connected with the output end of the main buffer and the output end of the first crossbar matrix, a second crossbar matrix connected with the output ends of the plurality of instruction group buffers, an output buffer connected with the output end of the second crossbar matrix and an output selection logic unit connected with the output end of the output buffer; the output end of the output selection logic unit is connected with the decoding unit;

the instruction group buffer is used for storing instruction information to be output in the next period; the first crossbar is used for splitting the instruction information output by the instruction fetching unit into the instruction information with the storage size in the corresponding instruction buffer unit according to the current instruction information condition stored in the instruction buffer unit, and outputting the instruction information to the corresponding positions in the main buffer and the instruction group buffer for storage; the main buffer is used for storing instruction information output by the instruction fetching unit; the second crossbar is used for selecting instruction information in the corresponding instruction group buffer from a plurality of instruction group buffers and outputting the instruction information to the output buffer; the output buffer is used for temporarily storing instruction information in the corresponding instruction group buffer; the output selection logic unit is used for judging according to the instruction information output by the output buffer, selecting the instruction information to be output in the current period and outputting the instruction information to the decoding unit.

Preferably, after the main buffer receives the instruction information output by the instruction fetching unit, if the current remaining space of the main buffer is insufficient, back-pressing a front-end request to the instruction fetching unit; and if the residual space of the main buffer is sufficient, storing instruction information into a first-in first-out queue of the main buffer in sequence, and updating the write pointer position of the first-in first-out queue.

Preferably, the number of instruction set buffers corresponds to the number of maximum outputs of the main buffer per cycle.

Preferably, the instruction information received by the instruction group buffer includes instruction information output by the main buffer and instruction information output by the instruction fetching unit in a current period.

Preferably, after the output selection logic unit outputs instruction information to the decoding unit, the read pointer position of the main buffer is updated according to the number of the output instruction information.

Preferably, the instruction information of the corresponding instruction group buffer is selected as the instruction information output by the output buffer according to the number of the instruction information output by the main buffer in the previous period.

In a second aspect, the present invention also provides a computer system comprising a processor as in any one of the above embodiments.

Compared with the prior art, the invention stores the instruction information content possibly output in the next period by adding the additional instruction group buffer unit, and when the instruction buffer unit module needs to output, the instruction information content of the corresponding instruction group buffer is selected to output instead of selecting the corresponding instruction from the main buffer through a large crossbar matrix to output. And at the same time supplements instructions from the input of the main buffer or instruction buffer unit. With this structure, the time of two stages of logic can be saved. And with the increase of the size of the main buffer and the emission quantity of the superscalar processor system, the processor provided by the invention has more time sequence benefits, is beneficial to realizing higher working frequency, and achieves better performance.

Drawings

The present invention will be described in detail with reference to the accompanying drawings. The foregoing and other aspects of the invention will become more apparent and more readily appreciated from the following detailed description taken in conjunction with the accompanying drawings. In the accompanying drawings:

FIG. 1 is a schematic diagram of a simplified superscalar processor pipeline provided by the related art;

FIG. 2 is a schematic diagram of an instruction buffer unit according to the related art;

FIG. 3 is a schematic diagram of a critical timing path of an output stage of an instruction buffer unit according to the related art;

fig. 4 is a schematic diagram of an instruction buffer unit according to an embodiment of the present invention.

The system comprises 100 instruction buffer units, 101, first crossbar switch matrixes, 102, main buffers, 103, instruction group buffers, 104, second crossbar switch matrixes, 105, output buffers, 106 and output selection logic units.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1

Referring to fig. 4, the present invention provides a processor, which includes a fetch unit, an instruction buffer unit 100 connected to an output end of the fetch unit, a decoding unit connected to an output end of the instruction buffer unit 100, and an execution unit connected to an output end of the decoding unit;

the instruction fetching unit is used for fetching instruction information which needs to be executed by the processor from the memory in each period; the instruction buffer unit 100 is configured to store the instruction information fetched by the instruction fetch unit, and balance a throughput gap of the instruction information between the instruction fetch unit and the decoding unit; the decoding unit is used for decoding the fetched instruction information to obtain an operand and transmitting the operand to the execution unit; the execution unit is used for executing the instruction information and obtaining a result;

the instruction buffer unit 100 includes a main buffer 102, a first crossbar matrix 101 connected to an input of the main buffer 102, a plurality of instruction group buffers 103 connected to an output of the main buffer 102 and an output of the first crossbar matrix 101, respectively, a second crossbar matrix 104 connected to outputs of the plurality of instruction group buffers 103, an output buffer 105 connected to an output of the second crossbar matrix 104, and an output selection logic unit 106 connected to an output of the output buffer 105; wherein, the output end of the output selection logic unit 106 is connected with the decoding unit;

the instruction set buffer 103 is used for storing instruction information to be output in the next period; the first crossbar 101 is configured to split the instruction information output by the instruction fetching unit into corresponding instruction information with a size stored in the instruction buffer unit 100 according to the current instruction information situation stored in the instruction buffer unit 100, and output the instruction information to corresponding positions in the main buffer 102 and the instruction group buffer 103 for storage; the main buffer 102 is configured to store instruction information output by the instruction fetch unit; the second crossbar 104 is configured to select instruction information in the corresponding instruction set buffer 103 from the plurality of instruction set buffers 103, and output the instruction information to the output buffer 105; the output buffer 105 is configured to temporarily store instruction information in the corresponding instruction set buffer 103; the output selection logic unit 106 is configured to determine according to the instruction information output by the output buffer 105, and select instruction information to be output in the current period to output to the decoding unit.

In the embodiment of the present invention, after the main buffer 102 receives the instruction information output by the instruction fetching unit, if the current remaining space of the main buffer 102 is insufficient, back-pressing a front-end request to the instruction fetching unit; if the remaining space of the main buffer 102 is sufficient, instruction information is sequentially stored in a first-in first-out queue of the main buffer 102, and the write pointer position of the first-in first-out queue is updated.

In an embodiment of the present invention, the number of instruction set buffers 103 corresponds to the number of maximum outputs of the main buffer 102 per cycle. Specifically, the instruction set buffer 103 outputs at most 3 instructions per cycle, so the main buffer 102 outputs 6 depth contents per cycle, and thus 7 instruction set buffers 103 (1 more is an instruction when not currently outputting) are required, and each instruction set buffer 103 contains one complete instruction of the output buffer 105, i.e., 6 depth instructions. It should be noted that the number of instruction set buffers 103 is related to the maximum output of the instruction buffer unit 100 in each cycle, and other numbers of instruction set buffers 103 are possible.

In the embodiment of the present invention, the instruction information received by the instruction set buffer 103 includes the instruction information output by the main buffer 102 and the instruction information output by the instruction fetch unit in the current period. Specifically, when the instruction information of the main buffer 102 is valid, the instruction group buffer 103 receives the instruction information output from the main buffer 102; otherwise, it is checked whether there is valid input instruction information currently, and when there is valid instruction information input by the instruction fetch unit, the corresponding instruction information is selected to be input to the main buffer 102.

For example, taking 7 instruction set buffers 103 as an example, when the instruction buffer unit 100 outputs instruction information to the decoding unit, if the number of instruction information output in a cycle on the main buffer 102 is 2, the instruction set buffer 103 with the number of instruction information corresponding to the 7 instruction set buffers 103 being 2 is selected, the instruction information in the instruction set buffer 103 is output to the output buffer 105 through the second crossbar 104, and then is selected by the output selection logic unit 106, and finally is output to the decoding unit.

In the existing processor design, the instruction output end needs to perform 32-level 1 selection operation to obtain the content of an output buffer, namely about 6 levels of logic; in the design of the processor, only 7-1 selection operation is needed, and about 4 stages of logic are needed, so that the time of two stages of logic can be saved.

In the embodiment of the present invention, after the output selection logic 106 outputs the instruction information to the decoding unit, the read pointer position of the main buffer 102 is updated according to the number of the outputted instruction information.

In the embodiment of the present invention, the instruction information of the corresponding instruction group buffer 103 is selected as the instruction information output by the output buffer 105 according to the number of instruction information output by the main buffer 102 in the previous cycle.

Example two

The embodiment of the present invention also provides a computer system, including a processor as described in the first embodiment, and since the computer system in this embodiment includes the processor of one of the above embodiments, it can achieve the technical effects achieved by the processor of the first embodiment, which is not described herein again,

it should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the embodiments of the present invention have been illustrated and described in connection with the drawings, what is presently considered to be the most practical and preferred embodiments of the invention, it is to be understood that the invention is not limited to the disclosed embodiments, but on the contrary, is intended to cover various equivalent modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. The processor is characterized by comprising a fetching unit, an instruction buffer unit connected with the output end of the fetching unit, a decoding unit connected with the output end of the instruction buffer unit and an execution unit connected with the output end of the decoding unit;

2. The processor as set forth in claim 1, wherein after said main buffer receives instruction information output by said instruction fetch unit, if the remaining space of said main buffer is currently insufficient, back-pressing a front-end request to said instruction fetch unit; and if the residual space of the main buffer is sufficient, storing instruction information into a first-in first-out queue of the main buffer in sequence, and updating the write pointer position of the first-in first-out queue.

3. The processor of claim 1 wherein the number of instruction set buffers corresponds to the number of maximum outputs of the main buffer per cycle.

4. The processor of claim 1 wherein the instruction information received by the instruction set buffer includes instruction information output by the main buffer and instruction information output by the instruction fetch unit at a current cycle.

5. The processor of claim 1, wherein the output select logic unit updates the read pointer position of the main buffer based on the number of instruction information output after outputting the instruction information to the decode unit.

6. The processor of claim 1 wherein instruction information of the corresponding instruction group buffer is selected as instruction information output by the output buffer based on the number of instruction information output by the main buffer in the last cycle.

7. A computer system comprising a processor as claimed in any one of claims 1 to 6.