WO2021218633A1

WO2021218633A1 - Cpu instruction processing method, controller, and central processing unit

Info

Publication number: WO2021218633A1
Application number: PCT/CN2021/087176
Authority: WO
Inventors: 马凌; 姚四海; 何昌华
Original assignee: 支付宝(杭州)信息技术有限公司
Priority date: 2020-04-28
Filing date: 2021-04-14
Publication date: 2021-11-04
Also published as: CN111538535B; CN111538535A

Abstract

Disclosed in the present application are a CPU instruction processing method, a controller, and a central processing unit. The method comprises: extracting an instruction to form an instruction block, and sending the instruction block to a CPU execution unit, wherein the instruction block comprises a single jump instruction and a branch instruction obtained by means of CPU instruction prediction; making the CPU execution unit execute an instruction before the jump instruction, and execute the jump instruction; and before a jump target instruction of the jump instruction is determined, rejecting the branch instruction in order to enter an execution phase. According to the present solution, on the basis of making full use of the accuracy rate (98%) of instruction prediction, performance and power consumption problems caused by prediction failures are reduced while also avoiding security problems, such that the efficiency of a CPU is improved.

Description

CPU instruction processing method, controller and central processing unit

Technical field

This application relates to the field of computer technology, in particular to a CPU instruction processing method, a controller, and a central processing unit CPU.

Background technique

In the current big data cloud environment, it is necessary to store and process massive amounts of data, which puts forward higher requirements for data calculation speed. As we all know, the decisive factor in computing speed is the performance of the central processing unit CPU. In order to achieve higher speed calculations, CPUs are constantly improving in all aspects, from physical technology to logical control.

For example, in order to improve parallel processing capabilities, CPU hyper-threading technology is proposed, which uses hardware instructions with special characters to simulate two logical cores into physical chips, allowing a single processor to use thread-level parallel computing, which is compatible with multi-threaded parallel computing. In other words, a hyper-threaded CPU can run two or more threads in parallel on the basis of a physical core, thereby obtaining more parallel instructions and improving overall operating performance. On the other hand, in order to make more effective use of the CPU clock cycle and avoid pipeline stalls or waits, the instruction prediction scheme is adopted for instruction prefetching and instruction pre-execution.

These programs have improved the execution efficiency of the CPU to a certain extent. However, instruction prediction is not always accurate (98% accuracy). Although the CPU uses instruction prediction to improve data and instruction parallelism, a 2% prediction failure brings 25% performance damage and also brings safety hazards (fuse ,ghost).

Summary of the invention

In view of this, the embodiments of this specification provide a CPU instruction processing method, a controller, and a central processing unit CPU. On the basis of making full use of the accuracy of instruction prediction (98%), it avoids security problems and reduces prediction failures. The performance and power consumption problems brought about, improve the efficiency of the CPU.

The embodiments of this specification adopt the following technical solutions:

The embodiment of this specification provides a CPU instruction processing method, the method includes: extracting instructions to form an instruction block to be sent to a CPU execution unit, the instruction block including a single jump instruction and a branch instruction predicted by the CPU instruction; The CPU execution unit is made to execute the instruction before the jump instruction and the jump instruction, and before the jump target instruction of the jump instruction is determined, the branch instruction is rejected to enter the execution stage.

The embodiment of the present specification also provides a CPU controller, including: an instruction extraction unit for extracting instructions to form an instruction block to be sent to the CPU execution unit. The instruction block includes a single jump instruction and a prediction obtained by the CPU instruction. Branch instruction; execution operation unit, used to make the CPU execution unit execute the instruction before the jump instruction and the jump instruction, and reject the branch instruction before the jump target instruction of the jump instruction is determined Enter the implementation phase.

The embodiment of this specification also provides a central processing unit, including the above-mentioned controller.

The above-mentioned at least one technical solution adopted in the embodiment of the application can achieve the following beneficial effects: the solution in this specification makes full use of the instruction prediction function of the existing CPU, and passes the predicted instruction through the stages of fetching, decoding, renaming and allocating execution resources, and enters Ready to execute, and before the jump target instruction of the jump instruction is determined, the branch instruction (that is, the predicted instruction) is rejected to enter the execution stage, that is, only the determined instruction (such as the determined jump target instruction) is entered into the execution stage Execution, while avoiding security problems, reduces performance and power consumption problems caused by prediction failures, reduces running conflicts between hyperthreads within the CPU, and improves the overall throughput performance of the CPU in big data scenarios.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of this specification or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments described in this specification. For those of ordinary skill in the art, without creative labor, other drawings can be obtained from these drawings:

FIG. 1 is a CPU execution process provided by an embodiment of this specification;

FIG. 2 is a flowchart of a method for processing CPU instructions according to an embodiment of the specification;

Fig. 3 is a functional block diagram of a CPU controller provided by an embodiment of this specification.

Detailed ways

For example, the safety hazards of fuse and ghost mentioned in the background art, in order to solve the fuse and ghost, it is now solved by software, but the performance will be affected. The solution in this manual will greatly reduce the failure of jump prediction while ensuring safety. The performance hurts and improves the overall throughput of the CPU.

In order to enable those skilled in the art to better understand the technical solutions in this specification, the following will clearly and completely describe the technical solutions in the embodiments of this specification in conjunction with the drawings in the embodiments of this specification. Obviously, the described The embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments of this specification, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The term involved in the embodiments of this specification: CPU hyperthreading: On the basis of a physical core, two or more threads are run in parallel, thereby obtaining more parallel instructions and improving overall performance. CPU instruction prediction: predict the destination address of the jump instruction through the historical execution process of the instruction. CPU instruction pre-execution stage: Before the jump instruction obtains the effective destination address, the CPU obtains the executable code through instruction prediction. We consider the execution of these predicted codes as the instruction pre-execution stage.

Fig. 1 is a CPU execution process provided by an embodiment of this specification. As shown in Figure 1, the entire execution process is divided into multiple stages. The first is the instruction fetch stage. Current mainstream CPUs can fetch 16 bytes per instruction cycle, which is about 4 instructions each time. Then proceed to instruction pre-decoding. The main task of the pre-decoding stage is to identify the length of the instruction and mark the jump instruction at the same time. Generally speaking, mainstream CPUs have a throughput of 5 instructions/cycle at this stage.

After pre-decoding, it enters the decoding stage. The decoding stage mainly transforms complex instructions into condensed instructions (fixed length), and specifies the type of operation at the same time. Usually there is a throughput of 5 instructions/cycle at this stage. The decoded instruction will be put into the decoded buffer.

The decoded cache serves as an instruction cache pool, in which multiple decoded instructions can be stored for the next stage to read. The throughput of the decoded cache to the next stage can reach 6 instructions per cycle.

As mentioned earlier, for a hyper-threaded CPU, there can be multiple threads executing in parallel. During the execution process, each thread will read the instructions to be executed next to form its own thread cache queue. In the case where the above-mentioned instruction to be executed exists in the decoded cache, the instruction stored in the decoded cache is used, otherwise, the corresponding instruction is obtained from the front end (memory) and added to the queue. In FIG. 1, the thread buffer queues of the thread A and the thread B are exemplarily shown, but it can be understood that the hyper-threaded CPU can also support the parallel execution of more threads.

Then, proceed to the next stage from forming the thread cache queue: renaming and allocating executable resources. This stage can usually include renaming 1, renaming 2, allocating execution resources. The throughput from the thread cache queue to this stage can reach 5 instructions per cycle. In the stage of renaming and allocating executable resources, the main task is to solve the dependency of register reading and writing, remove unnecessary dependencies, and strive to obtain more parallel execution capabilities of instructions, and at the same time allocate various resources required for execution.

After the resources required for execution are allocated, the instructions will be sent to the execution unit of the CPU for execution. At present, the CPU has multiple execution units. The most common CPU currently has 8 pipelines that can be executed in parallel, that is, 8 micro-operations can be executed per cycle. Although it can be executed out of order, the order of the last instruction submission and the order of the program same.

As mentioned earlier, in order to avoid pipeline stalls or waits caused by instruction missing, almost all CPUs currently use instruction prediction, also known as branch prediction (Branch Prediction) for instruction prediction and prefetching. After the end of each cycle, the prediction unit predicts the instructions to be prefetched according to the historical execution state table it contains. If the instruction does not jump, in the aforementioned instruction fetch stage, the instruction block with the current instruction fetch address plus 16 bytes is fetched. If the instruction has a jump, the instruction for the predicted branch is obtained according to the instruction prediction result.

After continuous improvement, the forecast accuracy of the current instruction forecasting scheme can exceed 90%, and the forecast accuracy of some schemes can even reach 98%. However, there is still a possibility that the prediction is wrong, and at this time, it is very likely that the wrong instruction block is input into the executable unit.

For example, suppose there are instructions L1, L2, L3, L4, L5, where L2 is a jump instruction, which specifies that when a certain judgment condition is met, jump to instruction L5, otherwise execute instructions L3 and L4 in sequence. If during instruction prediction, the target branch of the jump instruction L2 is predicted to be L3, then L3 and subsequent instructions will be read in the instruction fetch stage, and in the subsequent execution stage, it is possible to send L1, L2, L3, and L4 The CPU execution unit performs execution. If the execution result of L2 actually indicates that it should jump to L5, then L3 and L4 are executed incorrectly. In this case, the CPU has to refresh the entire pipeline again, roll back to the previous branch, then restart the hot restart, and select another branch for execution. Although the probability of an instruction prediction error is not high, once it occurs, the above operation needs to be performed. Such an operation is very time-consuming, resulting in a maximum CPU efficiency of about 75%.

To this end, an existing solution is: still execute the instruction fetch stage, pre-decode stage, and decode stage in Figure 1 in the original way, and put the decoded instructions into the decoded cache, and each thread can read from the decoded The instructions are read from the cache to form a thread cache queue. But before the jump instruction obtains the effective jump target address, that is, before the jump target instruction is determined, the code block renaming and executable resource allocation stage is no longer executed to ensure that the subsequent execution operations are completed correctly. Loss of efficiency caused by failure to predict. For example, in the foregoing example, the instructions L1, L2, L3, L4, and L5 include L2 as a jump instruction. Even if the target branch of the jump instruction L2 is incorrectly predicted as L3, the existing solution will only combine L1 and As an instruction block, L2 is sent to the CPU execution unit for execution, instead of simultaneously executing L1, L2, L3, and L4. That is, after the target address of the jump instruction L2 is determined (that is, the jump target instruction is determined), the jump target instruction is put into the execution unit, and after the stage of renaming and allocating execution resources, it enters the execution stage.

As mentioned above, this existing solution needs to put the determined jump target instruction into the execution unit to start execution after the jump instruction is resolved. However, after the jump destination address is confirmed, before the jump target instruction is executed, the jump target instruction needs to go through at least the renaming and execution resource allocation stages, such as renaming 1, renaming 2, and allocating execution resources. This leads to a waste of more than 3 cycles. To this end, the embodiments of this specification are further improved on this basis, as far as possible to retain and use the advantages of high-accuracy instruction prediction, while using the high parallelism of hyper-threading, while avoiding security issues , Reduce the performance and power consumption problems caused by prediction failures, reduce the running conflicts between the hyperthreads within the CPU, and improve the overall throughput performance of the CPU in the big data scenario.

According to one or more embodiments of this specification, instruction jump prediction is still used. The predicted instruction block undergoes instruction fetching, decoding, renaming 1, renaming 2 and allocating execution resources, but each time it executes only the instructions before the jump instruction Code. The implementation of the above concept is described below.

Fig. 2 is a flowchart of a method for processing a CPU instruction provided by an embodiment of the specification. As shown in Figure 2, the CPU instruction processing method provided in this specification includes:

S110: Extract instructions to form an instruction block to be sent to the CPU execution unit; where the instruction block includes a single jump instruction and a branch instruction predicted by the CPU instruction.

S120: Make the CPU execution unit execute the instruction before the jump instruction and the jump instruction, and before the jump target instruction of the jump instruction is determined, refuse the branch instruction to enter the execution stage.

Specifically, in step S110, the instruction is fetched from the current thread cache queue in the original manner to form an instruction block of the maximum length corresponding to the maximum processing capability of the hardware. Generally, the maximum processing capacity of the CPU hardware depends on the number of execution units included, and a predetermined threshold can be determined according to the number of execution units as the maximum length of the instruction block. For example, the most common CPU at present has 8 pipelines that can be executed in parallel, then the predetermined threshold can be set to 8, correspondingly, the maximum length of the instruction block is 8.

In the above-mentioned existing solution, the instruction block sent to the CPU execution unit does not include the branch instruction obtained through CPU instruction prediction. Different from the above-mentioned existing solution, the instruction block sent to the CPU execution unit in this solution includes a single jump instruction and a branch instruction predicted by the CPU instruction.

After the instruction block formed in step S110 is sent to the CPU execution unit, the CPU renames and allocates execution resources to the instructions according to the existing method, and then enters the execution stage, which includes the instructions before the jump instruction and the CPU instruction The predicted branch instructions all go through the stage of renaming and allocating execution resources, and enter the preparation for execution. The difference from the existing method is that this solution only executes the instructions including the jump instruction (that is, the instruction before the jump instruction and the jump instruction), and refuses until the jump target instruction of the jump instruction is determined. The branch instruction predicted by the CPU instruction enters the execution stage. That is to say, the instructions before the jump instruction and the branch instruction predicted by the CPU instruction are all renamed and allocated execution resources, and enter the stage of preparation for execution. After that, before the jump target instruction of the jump instruction is determined , Only the instructions before the jump instruction are included in the execution stage, and the branch instruction predicted by the CPU instruction is refused to enter the execution stage.

That is, the solution of this specification puts the predicted instruction block into the CPU execution unit. After the stage of renaming and allocating execution resources, it enters the preparation for execution, but does not need to be executed until the jump instruction is confirmed (that is, the jump instruction is confirmed). The target instruction) is executed again, so there will never be a rollback. At the same time, the parallelism of hyperthreading instructions and data is used to ultimately improve the overall CPU throughput.

The above process is described below with a specific example. Suppose there is the following instruction (where the content in /*...*/ is the explanation of the instruction):

1.mov(r1),r2/*copy the contents of the address pointed to by register r1 to register r2*/

2.mov 0x08(r1),r3/*copy the contents of the address pointed to by register r1+8 to register r3*/

3.add r3, r2/*Add the content of register r3 to the content of register r2, and store it in r2*/

4.mov r2,(r4)/*store the content of r2 to the memory address pointed to by the basic device r4*/

5.cmp r2,r5

6.ja L_Jmp

7.div r4, r5/* Divide the contents of register r4 by r5, and then store them in register r5*/

...

L_Jmp:

n.mul r6,r7/*Multiply the content of register r6 by r7, and then store it in register r7*/

n+1....

In this paragraph of instructions, instruction 6 is a jump instruction. The CPU passes the instructions (1-6) through the following stages according to the existing method: fetch instructions, decode, rename 1, rename 2, and allocate execution resources, and then prepare Run, if the jump prediction judgment requires a jump (the destination address is instruction n), the instruction n and the instructions after the instruction n will also go through the above process (instruction fetch, decode, rename 1, rename according to the judgment of the jump predictor) Name 2 and allocate execution resources) to enter the preparation for execution. That is, instruction (1-6), instruction n, and instructions after instruction n are all sent to the CPU execution unit, but different from the existing method, the embodiment of this specification only executes instructions 1-6 (that is, instructions 1-6 enter Execution is performed in the execution stage), and before the target instruction of the jump instruction is determined, instruction n and instructions after instruction n refuse to enter the execution stage.

Until the target instruction of the jump instruction is determined according to the execution result executed by the CPU execution unit, it is determined whether the target instruction is consistent with the branch instruction. If the target instruction is consistent with the branch instruction, the predicted branch instruction is instruction n, and the determined target instruction of the jump instruction is also instruction n (98% prediction accuracy). At this time, because instruction n has been prepared Execution (that is, the instruction n has been renamed 1, renamed 2 and the execution resource allocation stage), so the instruction n can enter the execution stage and run quickly. If the target instruction is consistent with the branch instruction, for example, the predicted branch instruction is instruction n, and the determined target instruction of the jump instruction is instruction 7, then instruction n is cleared. At this time, instruction n is in the stage of preparing to execute , That is, the instruction n is not executed, no unnecessary context is generated, so there is no need for complex recovery methods, and the real destination address instruction 7 and subsequent instructions can be quickly fetched without waiting, and sent to the CPU execution unit for execution.

In one embodiment, obtaining the target instruction may include: first determining whether the correct target instruction is contained in the decoded cache; if it is included, obtaining the target instruction from the decoded cache. It can be understood that the instruction prefetch based on the instruction prediction scheme will continuously prefetch many instructions, and then put them into the decoded cache after being decoded. Therefore, in most cases, the correct target instruction can be obtained from the decoded cache. On the other hand, in extremely rare cases, the target instruction is not contained in the decoded cache. At this time, the target instruction can be obtained from the memory request.

According to the above embodiment, before the target instruction of the jump instruction is determined, the CPU execution unit is made to execute the rename and execution resource allocation stage of the branch instruction, so that the branch instruction enters the stage of preparation for execution, and the branch instruction is refused to enter the execution stage. . Only when the target instruction is consistent with the branch instruction, the branch instruction enters the execution stage to execute the branch instruction; otherwise, the branch instruction is cleared, the target instruction is obtained, and the target instruction is sent to The CPU execution unit performs execution. It can be seen that the CPU will always execute the correct instructions, so it effectively avoids the safety problems introduced by fuse and ghost. At the same time, after the introduction of hyperthreading, it will not harm other threads running in the same execution unit because a certain thread occupies too many resources. , There is a very good adaptive scheduling ability between threads, and ultimately under the premise of ensuring safety, reducing power consumption and improving performance.

Under the conditions of big data scenarios, hyper-threading needs to be used to improve the overall throughput. After verification, it is found that the multi-threaded scenario requires lower instruction prediction success rate, but the number of instruction rollbacks is greater, resulting in a large amount of unnecessary Performance overhead, and safety issues such as circuit breakers/ghosts. In the solution in this specification, since the predicted instruction will not be executed, when the prediction fails, the delay caused by the prediction can be effectively reduced. At the same time, considering the multi-threaded execution in the big data scenario, there is a lot of Strong adaptive scheduling. In addition, as the number of single-core CPU hyperthreads increases, shared execution resources become more scarce. In the solution of this specification, only certain tasks are executed to avoid the problems of rollback and resource abuse due to excessive disorder. Improve CPU throughput while avoiding security issues.

As mentioned above, the solution in this specification makes full use of the instruction prediction function of the existing CPU, and passes the predicted instruction through the phases of fetching, decoding, renaming, and allocating execution resources, and then enters the preparation for execution, and determines the target instruction of the jump instruction. Previously, branch instructions (that is, predicted instructions) were refused to enter the execution stage, that is, only certain instructions (such as determined target instructions) were entered into the execution stage for execution. While avoiding safety issues, it also reduced performance and performance caused by prediction failures. The problem of power consumption reduces the running conflicts between the hyper-threads within the CPU, and improves the overall throughput performance of the CPU in the big data scenario as a whole.

As those skilled in the art know, the execution of instructions in the CPU is controlled by the controller. The controller is the command and control center of the entire CPU and is used to coordinate the operations between various components. The controller generally includes several parts such as instruction control logic, timing control logic, bus control logic, and interrupt control logic. The instruction control logic must complete the operations of fetching instructions, analyzing instructions and executing instructions.

According to the solution of the above-described embodiment, the original command control process is optimized and adjusted. Therefore, the controller circuit, especially the command control logic, can be modified at the hardware level to complete the above embodiment. Describe the control process.

Fig. 3 is a functional block diagram of a CPU controller provided by an embodiment of this specification. As shown in Figure 3, the CPU controller includes:

The instruction extraction unit 301 is configured to extract instructions to form an instruction block to be sent to the CPU execution unit, and the instruction block includes a single jump instruction and a branch instruction predicted by the CPU instruction;

The execution operation unit 305 is configured to enable the CPU execution unit to execute the instruction before the jump instruction, and refuse the branch instruction to enter the execution stage before the target instruction of the jump instruction is determined.

And, the execution operation unit 305 is further configured to, before determining the target instruction of the jump instruction, cause the CPU execution unit to execute the renaming and execution resource allocation stage of the branch instruction, so that the branch instruction enters the stage of preparing for execution .

In a specific embodiment, as shown in FIG. 3, the CPU controller may further include:

The target instruction determining unit 302 is configured to determine the target instruction of the jump instruction according to the execution result of the CPU execution unit;

The judging unit 303 is configured to judge whether the target instruction is consistent with the branch instruction;

The target instruction acquisition unit 304 is configured to determine whether the target instruction is contained in the decoded cache, wherein a plurality of prefetched and decoded instructions are stored in the decoded cache; and, if it is contained, from the decoded cache Acquire the target instruction; if it is not included, acquire the target instruction from the memory.

The execution operation unit 305 is further configured to enable the branch instruction to enter the execution stage to execute the branch instruction when the target instruction is consistent with the branch instruction.

The execution operation unit 305 is further configured to clear the branch instruction when the target instruction is inconsistent with the branch instruction, and send the target instruction acquired by the target instruction acquisition unit 304 to the CPU execution unit for execution.

The above units can be implemented by various circuit elements as required, for example, a number of comparators are used to implement the judgment unit 303 and the like.

Through the above controller, the control process shown in Figure 2 can be realized, so that on the basis of using the advantages of instruction prediction and prefetching, while avoiding safety problems, it reduces the performance and power consumption problems caused by prediction failures, and reduces The running conflicts between the hyper-threads within the CPU improve the overall throughput performance of the CPU in the big data scenario as a whole.

The embodiment of the present specification also provides a central processing unit including the above-mentioned controller.

The specific embodiments of this specification have been described above, and other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims can be performed in a different order than in the embodiments and still achieve desired results. In addition, the processes depicted in the drawings do not necessarily have to be in the specific order or sequential order shown in order to achieve the desired result. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments. In particular, for the device, equipment, and non-volatile computer-readable storage medium embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiments.

The apparatus, equipment, non-volatile computer-readable storage medium, and method provided in the embodiments of this specification correspond to each other. Therefore, the apparatus, equipment, and non-volatile computer storage medium also have beneficial technical effects similar to the corresponding method. The beneficial technical effects of the method have been described in detail above, therefore, the beneficial technical effects of the corresponding device, equipment, and non-volatile computer storage medium will not be repeated here.

In the 1990s, the improvement of a technology can be clearly distinguished between hardware improvements (for example, improvements in circuit structures such as diodes, transistors, switches, etc.) or software improvements (improvements in method flow). However, with the development of technology, the improvement of many methods and processes of today can be regarded as a direct improvement of the hardware circuit structure. Designers almost always get the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be realized by the hardware entity module. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is such an integrated circuit whose logic function is determined by the user's programming of the device. It is programmed by the designer to "integrate" a digital system on a piece of PLD, without requiring chip manufacturers to design and manufacture dedicated integrated circuit chips. Moreover, nowadays, instead of manually making integrated circuit chips, this kind of programming is mostly realized with "logic compiler" software, which is similar to the software compiler used in program development and writing, but before compilation The original code must also be written in a specific programming language, which is called Hardware Description Language (HDL), and there is not only one HDL, but many, such as ABEL (Advanced Boolean Expression) Language), AHDL (Altera Hardware DescrIP Address Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware DescrIP Address Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Address) Language) and so on, the most commonly used at present are VHDL (Very-High-Speed Integrated Circuit Hardware DescrIP Address Language) and Verilog. It should also be clear to those skilled in the art that just a little bit of logic programming of the method flow in the above-mentioned hardware description languages and programming into an integrated circuit can easily obtain the hardware circuit that implements the logic method flow.

The controller can be implemented in any suitable manner. For example, the controller can take the form of, for example, a microprocessor or a processor and a computer-readable medium storing computer-readable program codes (such as software or firmware) executable by the (micro)processor. , Logic gates, switches, application specific integrated circuits (ASICs), programmable logic controllers and embedded microcontrollers. Examples of controllers include but are not limited to the following microcontrollers: ARC625D, Atmel AT91SAM, MicrochIP addresses PIC18F26K20 and Silicon Labs C8051F320, the memory controller can also be implemented as a part of the memory control logic. Those skilled in the art also know that, in addition to implementing the controller in a purely computer-readable program code manner, it is entirely possible to program the method steps to make the controller use logic gates, switches, application specific integrated circuits, programmable logic controllers, and embedded logic. The same function can be realized in the form of a microcontroller or the like. Therefore, such a controller can be regarded as a hardware component, and the devices included in it for realizing various functions can also be regarded as a structure within the hardware component. Or even, the device for realizing various functions can be regarded as both a software module for realizing the method and a structure within a hardware component.

The systems, devices, modules, or units illustrated in the above embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.

For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing this specification, the functions of each unit can be implemented in the same or multiple software and/or hardware.

Those skilled in the art should understand that the embodiments of this specification can be provided as a method, a system, or a computer program product. Therefore, the embodiments of this specification may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of this specification may adopt the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

This specification is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to the embodiments of this specification. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include non-permanent memory in a computer-readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer readable media.

Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cartridges, magnetic tape storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements includes not only those elements, but also Other elements that are not explicitly listed, or also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.

This specification may be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. This specification can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media including storage devices.

The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments. In particular, as for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

The above descriptions are only examples of this specification, and are not intended to limit this application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims

A CPU instruction processing method, including:

The fetched instructions form an instruction block to be sent to the CPU execution unit; wherein the instruction block includes a single jump instruction and a branch instruction predicted by the CPU instruction;

The CPU execution unit is made to execute the instruction before the jump instruction and the jump instruction, and before the jump target instruction of the jump instruction is determined, the branch instruction is rejected to enter the execution stage.
The method according to claim 1, before rejecting the branch instruction to enter the execution phase, the method further comprises:

The CPU execution unit is made to execute the renaming and execution resource allocation stage of the branch instruction, so that the branch instruction enters the stage of preparing for execution.
The method according to claim 2, further comprising:

Determine the target instruction of the jump instruction according to the execution result executed by the CPU execution unit;

Judging whether the target instruction is consistent with the branch instruction;

If the target instruction is consistent with the branch instruction, the branch instruction enters the execution stage to execute the branch instruction.
The method according to claim 3, further comprising:

If the target instruction is consistent with the branch non-instruction, the branch instruction is cleared, the target instruction is acquired, and the target instruction is sent to the CPU execution unit for execution.
According to the method of claim 4, obtaining the target instruction comprises:

Judging whether the target instruction is contained in the decoded cache, wherein a plurality of prefetched and decoded instructions are stored in the decoded cache;

In the case of inclusion, obtain the target instruction from the decoded cache;

If it is not included, the target instruction is obtained from the memory.
A CPU controller, including:

The instruction extraction unit is used to extract instructions to form an instruction block to be sent to the CPU execution unit; wherein the instruction block includes a single jump instruction and a branch instruction predicted by the CPU instruction;

The execution operation unit is configured to enable the CPU execution unit to execute the instruction before the jump instruction and the jump instruction, and refuse the branch instruction to enter the execution stage before the jump target instruction of the jump instruction is determined.
The CPU controller according to claim 6,

The execution operation unit is further configured to make the CPU execution unit execute the rename and execution resource allocation stage of the branch instruction before determining the target instruction of the jump instruction, so that the branch instruction enters the stage of preparing for execution.
The controller according to claim 7, further comprising:

The target instruction determining unit is configured to determine the target instruction of the jump instruction according to the execution result of the CPU execution unit;

A judging unit for judging whether the target instruction is consistent with the branch instruction;

The execution operation unit is further configured to cause the branch instruction to enter the execution stage to execute the branch instruction when the target instruction is consistent with the branch instruction.
The controller according to claim 8, wherein the execution operation unit is further configured to clear the branch instruction when the target instruction is inconsistent with the branch instruction, and send the target instruction to the CPU execution unit Carry out execution.
The controller according to claim 9, further comprising:

The target instruction acquisition unit is used to determine whether the target instruction is contained in the decoded cache, wherein a plurality of prefetched and decoded instructions are stored in the decoded cache; and,

In the case of inclusion, obtain the target instruction from the decoded cache;

If it is not included, the target instruction is obtained from the memory.
A central processing unit comprising the controller according to any one of claims 6 to 10.