WO2021218633A1 - Cpu instruction processing method, controller, and central processing unit - Google Patents

Cpu instruction processing method, controller, and central processing unit Download PDF

Info

Publication number
WO2021218633A1
WO2021218633A1 PCT/CN2021/087176 CN2021087176W WO2021218633A1 WO 2021218633 A1 WO2021218633 A1 WO 2021218633A1 CN 2021087176 W CN2021087176 W CN 2021087176W WO 2021218633 A1 WO2021218633 A1 WO 2021218633A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
execution
cpu
jump
target
Prior art date
Application number
PCT/CN2021/087176
Other languages
French (fr)
Chinese (zh)
Inventor
马凌
姚四海
何昌华
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2021218633A1 publication Critical patent/WO2021218633A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of computer technology, in particular to a CPU instruction processing method, a controller, and a central processing unit CPU.
  • CPU hyper-threading technology which uses hardware instructions with special characters to simulate two logical cores into physical chips, allowing a single processor to use thread-level parallel computing, which is compatible with multi-threaded parallel computing.
  • a hyper-threaded CPU can run two or more threads in parallel on the basis of a physical core, thereby obtaining more parallel instructions and improving overall operating performance.
  • the instruction prediction scheme is adopted for instruction prefetching and instruction pre-execution.
  • the embodiments of this specification provide a CPU instruction processing method, a controller, and a central processing unit CPU.
  • a CPU instruction processing method On the basis of making full use of the accuracy of instruction prediction (98%), it avoids security problems and reduces prediction failures.
  • the embodiment of this specification provides a CPU instruction processing method, the method includes: extracting instructions to form an instruction block to be sent to a CPU execution unit, the instruction block including a single jump instruction and a branch instruction predicted by the CPU instruction; The CPU execution unit is made to execute the instruction before the jump instruction and the jump instruction, and before the jump target instruction of the jump instruction is determined, the branch instruction is rejected to enter the execution stage.
  • the embodiment of the present specification also provides a CPU controller, including: an instruction extraction unit for extracting instructions to form an instruction block to be sent to the CPU execution unit.
  • the instruction block includes a single jump instruction and a prediction obtained by the CPU instruction.
  • Branch instruction execution operation unit, used to make the CPU execution unit execute the instruction before the jump instruction and the jump instruction, and reject the branch instruction before the jump target instruction of the jump instruction is determined Enter the implementation phase.
  • the embodiment of this specification also provides a central processing unit, including the above-mentioned controller.
  • the solution in this specification makes full use of the instruction prediction function of the existing CPU, and passes the predicted instruction through the stages of fetching, decoding, renaming and allocating execution resources, and enters Ready to execute, and before the jump target instruction of the jump instruction is determined, the branch instruction (that is, the predicted instruction) is rejected to enter the execution stage, that is, only the determined instruction (such as the determined jump target instruction) is entered into the execution stage Execution, while avoiding security problems, reduces performance and power consumption problems caused by prediction failures, reduces running conflicts between hyperthreads within the CPU, and improves the overall throughput performance of the CPU in big data scenarios.
  • FIG. 1 is a CPU execution process provided by an embodiment of this specification
  • FIG. 2 is a flowchart of a method for processing CPU instructions according to an embodiment of the specification
  • Fig. 3 is a functional block diagram of a CPU controller provided by an embodiment of this specification.
  • CPU hyperthreading On the basis of a physical core, two or more threads are run in parallel, thereby obtaining more parallel instructions and improving overall performance.
  • CPU instruction prediction predict the destination address of the jump instruction through the historical execution process of the instruction.
  • CPU instruction pre-execution stage Before the jump instruction obtains the effective destination address, the CPU obtains the executable code through instruction prediction. We consider the execution of these predicted codes as the instruction pre-execution stage.
  • Fig. 1 is a CPU execution process provided by an embodiment of this specification. As shown in Figure 1, the entire execution process is divided into multiple stages. The first is the instruction fetch stage. Current mainstream CPUs can fetch 16 bytes per instruction cycle, which is about 4 instructions each time. Then proceed to instruction pre-decoding. The main task of the pre-decoding stage is to identify the length of the instruction and mark the jump instruction at the same time. Generally speaking, mainstream CPUs have a throughput of 5 instructions/cycle at this stage.
  • the decoding stage After pre-decoding, it enters the decoding stage.
  • the decoding stage mainly transforms complex instructions into condensed instructions (fixed length), and specifies the type of operation at the same time. Usually there is a throughput of 5 instructions/cycle at this stage.
  • the decoded instruction will be put into the decoded buffer.
  • the decoded cache serves as an instruction cache pool, in which multiple decoded instructions can be stored for the next stage to read.
  • the throughput of the decoded cache to the next stage can reach 6 instructions per cycle.
  • each thread will read the instructions to be executed next to form its own thread cache queue.
  • the instruction stored in the decoded cache is used, otherwise, the corresponding instruction is obtained from the front end (memory) and added to the queue.
  • the thread buffer queues of the thread A and the thread B are exemplarily shown, but it can be understood that the hyper-threaded CPU can also support the parallel execution of more threads.
  • renaming and allocating executable resources can usually include renaming 1, renaming 2, allocating execution resources.
  • the throughput from the thread cache queue to this stage can reach 5 instructions per cycle.
  • the main task is to solve the dependency of register reading and writing, remove unnecessary dependencies, and strive to obtain more parallel execution capabilities of instructions, and at the same time allocate various resources required for execution.
  • the instructions will be sent to the execution unit of the CPU for execution.
  • the CPU has multiple execution units.
  • the most common CPU currently has 8 pipelines that can be executed in parallel, that is, 8 micro-operations can be executed per cycle. Although it can be executed out of order, the order of the last instruction submission and the order of the program same.
  • branch prediction Branch Prediction
  • the prediction unit predicts the instructions to be prefetched according to the historical execution state table it contains. If the instruction does not jump, in the aforementioned instruction fetch stage, the instruction block with the current instruction fetch address plus 16 bytes is fetched. If the instruction has a jump, the instruction for the predicted branch is obtained according to the instruction prediction result.
  • the forecast accuracy of the current instruction forecasting scheme can exceed 90%, and the forecast accuracy of some schemes can even reach 98%.
  • the prediction is wrong, and at this time, it is very likely that the wrong instruction block is input into the executable unit.
  • L2 is a jump instruction, which specifies that when a certain judgment condition is met, jump to instruction L5, otherwise execute instructions L3 and L4 in sequence.
  • L3 and subsequent instructions will be read in the instruction fetch stage, and in the subsequent execution stage, it is possible to send L1, L2, L3, and L4
  • the CPU execution unit performs execution. If the execution result of L2 actually indicates that it should jump to L5, then L3 and L4 are executed incorrectly. In this case, the CPU has to refresh the entire pipeline again, roll back to the previous branch, then restart the hot restart, and select another branch for execution.
  • the probability of an instruction prediction error is not high, once it occurs, the above operation needs to be performed. Such an operation is very time-consuming, resulting in a maximum CPU efficiency of about 75%.
  • an existing solution is: still execute the instruction fetch stage, pre-decode stage, and decode stage in Figure 1 in the original way, and put the decoded instructions into the decoded cache, and each thread can read from the decoded
  • the instructions are read from the cache to form a thread cache queue.
  • the code block renaming and executable resource allocation stage is no longer executed to ensure that the subsequent execution operations are completed correctly. Loss of efficiency caused by failure to predict.
  • the instructions L1, L2, L3, L4, and L5 include L2 as a jump instruction.
  • L2 is sent to the CPU execution unit for execution, instead of simultaneously executing L1, L2, L3, and L4. That is, after the target address of the jump instruction L2 is determined (that is, the jump target instruction is determined), the jump target instruction is put into the execution unit, and after the stage of renaming and allocating execution resources, it enters the execution stage.
  • this existing solution needs to put the determined jump target instruction into the execution unit to start execution after the jump instruction is resolved.
  • the jump target instruction needs to go through at least the renaming and execution resource allocation stages, such as renaming 1, renaming 2, and allocating execution resources. This leads to a waste of more than 3 cycles.
  • the embodiments of this specification are further improved on this basis, as far as possible to retain and use the advantages of high-accuracy instruction prediction, while using the high parallelism of hyper-threading, while avoiding security issues , Reduce the performance and power consumption problems caused by prediction failures, reduce the running conflicts between the hyperthreads within the CPU, and improve the overall throughput performance of the CPU in the big data scenario.
  • instruction jump prediction is still used.
  • the predicted instruction block undergoes instruction fetching, decoding, renaming 1, renaming 2 and allocating execution resources, but each time it executes only the instructions before the jump instruction Code.
  • the implementation of the above concept is described below.
  • Fig. 2 is a flowchart of a method for processing a CPU instruction provided by an embodiment of the specification. As shown in Figure 2, the CPU instruction processing method provided in this specification includes:
  • S110 Extract instructions to form an instruction block to be sent to the CPU execution unit; where the instruction block includes a single jump instruction and a branch instruction predicted by the CPU instruction.
  • S120 Make the CPU execution unit execute the instruction before the jump instruction and the jump instruction, and before the jump target instruction of the jump instruction is determined, refuse the branch instruction to enter the execution stage.
  • step S110 the instruction is fetched from the current thread cache queue in the original manner to form an instruction block of the maximum length corresponding to the maximum processing capability of the hardware.
  • the maximum processing capacity of the CPU hardware depends on the number of execution units included, and a predetermined threshold can be determined according to the number of execution units as the maximum length of the instruction block. For example, the most common CPU at present has 8 pipelines that can be executed in parallel, then the predetermined threshold can be set to 8, correspondingly, the maximum length of the instruction block is 8.
  • the instruction block sent to the CPU execution unit does not include the branch instruction obtained through CPU instruction prediction.
  • the instruction block sent to the CPU execution unit in this solution includes a single jump instruction and a branch instruction predicted by the CPU instruction.
  • the CPU After the instruction block formed in step S110 is sent to the CPU execution unit, the CPU renames and allocates execution resources to the instructions according to the existing method, and then enters the execution stage, which includes the instructions before the jump instruction and the CPU instruction
  • the predicted branch instructions all go through the stage of renaming and allocating execution resources, and enter the preparation for execution.
  • the difference from the existing method is that this solution only executes the instructions including the jump instruction (that is, the instruction before the jump instruction and the jump instruction), and refuses until the jump target instruction of the jump instruction is determined.
  • the branch instruction predicted by the CPU instruction enters the execution stage. That is to say, the instructions before the jump instruction and the branch instruction predicted by the CPU instruction are all renamed and allocated execution resources, and enter the stage of preparation for execution. After that, before the jump target instruction of the jump instruction is determined , Only the instructions before the jump instruction are included in the execution stage, and the branch instruction predicted by the CPU instruction is refused to enter the execution stage.
  • the solution of this specification puts the predicted instruction block into the CPU execution unit. After the stage of renaming and allocating execution resources, it enters the preparation for execution, but does not need to be executed until the jump instruction is confirmed (that is, the jump instruction is confirmed). The target instruction) is executed again, so there will never be a rollback. At the same time, the parallelism of hyperthreading instructions and data is used to ultimately improve the overall CPU throughput.
  • instruction 6 is a jump instruction.
  • the CPU passes the instructions (1-6) through the following stages according to the existing method: fetch instructions, decode, rename 1, rename 2, and allocate execution resources, and then prepare Run, if the jump prediction judgment requires a jump (the destination address is instruction n), the instruction n and the instructions after the instruction n will also go through the above process (instruction fetch, decode, rename 1, rename according to the judgment of the jump predictor) Name 2 and allocate execution resources) to enter the preparation for execution.
  • instruction (1-6), instruction n, and instructions after instruction n are all sent to the CPU execution unit, but different from the existing method, the embodiment of this specification only executes instructions 1-6 (that is, instructions 1-6 enter Execution is performed in the execution stage), and before the target instruction of the jump instruction is determined, instruction n and instructions after instruction n refuse to enter the execution stage.
  • the target instruction of the jump instruction is determined according to the execution result executed by the CPU execution unit, it is determined whether the target instruction is consistent with the branch instruction. If the target instruction is consistent with the branch instruction, the predicted branch instruction is instruction n, and the determined target instruction of the jump instruction is also instruction n (98% prediction accuracy). At this time, because instruction n has been prepared Execution (that is, the instruction n has been renamed 1, renamed 2 and the execution resource allocation stage), so the instruction n can enter the execution stage and run quickly. If the target instruction is consistent with the branch instruction, for example, the predicted branch instruction is instruction n, and the determined target instruction of the jump instruction is instruction 7, then instruction n is cleared.
  • instruction n is in the stage of preparing to execute , That is, the instruction n is not executed, no unnecessary context is generated, so there is no need for complex recovery methods, and the real destination address instruction 7 and subsequent instructions can be quickly fetched without waiting, and sent to the CPU execution unit for execution.
  • obtaining the target instruction may include: first determining whether the correct target instruction is contained in the decoded cache; if it is included, obtaining the target instruction from the decoded cache. It can be understood that the instruction prefetch based on the instruction prediction scheme will continuously prefetch many instructions, and then put them into the decoded cache after being decoded. Therefore, in most cases, the correct target instruction can be obtained from the decoded cache. On the other hand, in extremely rare cases, the target instruction is not contained in the decoded cache. At this time, the target instruction can be obtained from the memory request.
  • the CPU execution unit before the target instruction of the jump instruction is determined, the CPU execution unit is made to execute the rename and execution resource allocation stage of the branch instruction, so that the branch instruction enters the stage of preparation for execution, and the branch instruction is refused to enter the execution stage. .
  • the branch instruction Only when the target instruction is consistent with the branch instruction, the branch instruction enters the execution stage to execute the branch instruction; otherwise, the branch instruction is cleared, the target instruction is obtained, and the target instruction is sent to The CPU execution unit performs execution. It can be seen that the CPU will always execute the correct instructions, so it effectively avoids the safety problems introduced by fuse and ghost. At the same time, after the introduction of hyperthreading, it will not harm other threads running in the same execution unit because a certain thread occupies too many resources. , There is a very good adaptive scheduling ability between threads, and ultimately under the premise of ensuring safety, reducing power consumption and improving performance.
  • the solution in this specification makes full use of the instruction prediction function of the existing CPU, and passes the predicted instruction through the phases of fetching, decoding, renaming, and allocating execution resources, and then enters the preparation for execution, and determines the target instruction of the jump instruction.
  • branch instructions that is, predicted instructions
  • branch instructions were refused to enter the execution stage, that is, only certain instructions (such as determined target instructions) were entered into the execution stage for execution.
  • the problem of power consumption reduces the running conflicts between the hyper-threads within the CPU, and improves the overall throughput performance of the CPU in the big data scenario as a whole.
  • the controller is the command and control center of the entire CPU and is used to coordinate the operations between various components.
  • the controller generally includes several parts such as instruction control logic, timing control logic, bus control logic, and interrupt control logic.
  • the instruction control logic must complete the operations of fetching instructions, analyzing instructions and executing instructions.
  • the original command control process is optimized and adjusted. Therefore, the controller circuit, especially the command control logic, can be modified at the hardware level to complete the above embodiment. Describe the control process.
  • Fig. 3 is a functional block diagram of a CPU controller provided by an embodiment of this specification. As shown in Figure 3, the CPU controller includes:
  • the instruction extraction unit 301 is configured to extract instructions to form an instruction block to be sent to the CPU execution unit, and the instruction block includes a single jump instruction and a branch instruction predicted by the CPU instruction;
  • the execution operation unit 305 is configured to enable the CPU execution unit to execute the instruction before the jump instruction, and refuse the branch instruction to enter the execution stage before the target instruction of the jump instruction is determined.
  • execution operation unit 305 is further configured to, before determining the target instruction of the jump instruction, cause the CPU execution unit to execute the renaming and execution resource allocation stage of the branch instruction, so that the branch instruction enters the stage of preparing for execution .
  • the CPU controller may further include:
  • the target instruction determining unit 302 is configured to determine the target instruction of the jump instruction according to the execution result of the CPU execution unit;
  • the judging unit 303 is configured to judge whether the target instruction is consistent with the branch instruction
  • the target instruction acquisition unit 304 is configured to determine whether the target instruction is contained in the decoded cache, wherein a plurality of prefetched and decoded instructions are stored in the decoded cache; and, if it is contained, from the decoded cache Acquire the target instruction; if it is not included, acquire the target instruction from the memory.
  • the execution operation unit 305 is further configured to enable the branch instruction to enter the execution stage to execute the branch instruction when the target instruction is consistent with the branch instruction.
  • the execution operation unit 305 is further configured to clear the branch instruction when the target instruction is inconsistent with the branch instruction, and send the target instruction acquired by the target instruction acquisition unit 304 to the CPU execution unit for execution.
  • the above units can be implemented by various circuit elements as required, for example, a number of comparators are used to implement the judgment unit 303 and the like.
  • control process shown in Figure 2 can be realized, so that on the basis of using the advantages of instruction prediction and prefetching, while avoiding safety problems, it reduces the performance and power consumption problems caused by prediction failures, and reduces The running conflicts between the hyper-threads within the CPU improve the overall throughput performance of the CPU in the big data scenario as a whole.
  • the embodiment of the present specification also provides a central processing unit including the above-mentioned controller.
  • the apparatus, equipment, non-volatile computer-readable storage medium, and method provided in the embodiments of this specification correspond to each other. Therefore, the apparatus, equipment, and non-volatile computer storage medium also have beneficial technical effects similar to the corresponding method.
  • the beneficial technical effects of the method have been described in detail above, therefore, the beneficial technical effects of the corresponding device, equipment, and non-volatile computer storage medium will not be repeated here.
  • a Programmable Logic Device (such as a Field Programmable Gate Array (FPGA)) is such an integrated circuit whose logic function is determined by the user's programming of the device.
  • HDL Hardware Description Language
  • ABEL Advanced Boolean Expression
  • AHDL Altera Hardware DescrIP Address Language
  • HDCal JHDL (Java Hardware DescrIP Address Language)
  • Lava Lola
  • MyHDL PALASM
  • RHDL Ruby Hardware Address
  • the controller can be implemented in any suitable manner.
  • the controller can take the form of, for example, a microprocessor or a processor and a computer-readable medium storing computer-readable program codes (such as software or firmware) executable by the (micro)processor. , Logic gates, switches, application specific integrated circuits (ASICs), programmable logic controllers and embedded microcontrollers.
  • controllers include but are not limited to the following microcontrollers: ARC625D, Atmel AT91SAM, MicrochIP addresses PIC18F26K20 and Silicon Labs C8051F320, the memory controller can also be implemented as a part of the memory control logic.
  • controllers in addition to implementing the controller in a purely computer-readable program code manner, it is entirely possible to program the method steps to make the controller use logic gates, switches, application specific integrated circuits, programmable logic controllers, and embedded logic.
  • the same function can be realized in the form of a microcontroller or the like. Therefore, such a controller can be regarded as a hardware component, and the devices included in it for realizing various functions can also be regarded as a structure within the hardware component. Or even, the device for realizing various functions can be regarded as both a software module for realizing the method and a structure within a hardware component.
  • a typical implementation device is a computer.
  • the computer may be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.
  • the embodiments of this specification can be provided as a method, a system, or a computer program product. Therefore, the embodiments of this specification may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of this specification may adopt the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-permanent memory in a computer-readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM).
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cartridges, magnetic tape storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • This specification can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communication network.
  • program modules can be located in local and remote computer storage media including storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

Disclosed in the present application are a CPU instruction processing method, a controller, and a central processing unit. The method comprises: extracting an instruction to form an instruction block, and sending the instruction block to a CPU execution unit, wherein the instruction block comprises a single jump instruction and a branch instruction obtained by means of CPU instruction prediction; making the CPU execution unit execute an instruction before the jump instruction, and execute the jump instruction; and before a jump target instruction of the jump instruction is determined, rejecting the branch instruction in order to enter an execution phase. According to the present solution, on the basis of making full use of the accuracy rate (98%) of instruction prediction, performance and power consumption problems caused by prediction failures are reduced while also avoiding security problems, such that the efficiency of a CPU is improved.

Description

CPU指令处理方法、控制器和中央处理单元CPU instruction processing method, controller and central processing unit 技术领域Technical field
本申请涉及计算机技术领域,尤其涉及CPU指令处理方法、控制器和中央处理单元CPU。This application relates to the field of computer technology, in particular to a CPU instruction processing method, a controller, and a central processing unit CPU.
背景技术Background technique
在当前的大数据云环境下,需要对海量数据进行存储和处理,对数据的计算速度提出了更高的要求。众所周知,计算速度的决定性因素为中央处理单元CPU的性能。为了实现更高速度的运算,CPU在各个方面,从物理工艺到逻辑控制,都在不断进行改进。In the current big data cloud environment, it is necessary to store and process massive amounts of data, which puts forward higher requirements for data calculation speed. As we all know, the decisive factor in computing speed is the performance of the central processing unit CPU. In order to achieve higher speed calculations, CPUs are constantly improving in all aspects, from physical technology to logical control.
例如,为了提升并行处理能力,提出CPU超线程技术,即利用特殊字符的硬件指令,把两个逻辑内核模拟成物理芯片,让单个处理器能使用线程级并行计算,从而兼容多线程并行计算。也就是说,超线程CPU可以在一个物理核的基础上,并行运行2个或更多个线程,从而得到更多可并行指令,提升总体运行性能。另一方面,为了更有效地利用CPU的时钟周期,避免流水线停顿或等待,采用指令预测的方案,进行指令预取和指令预执行。For example, in order to improve parallel processing capabilities, CPU hyper-threading technology is proposed, which uses hardware instructions with special characters to simulate two logical cores into physical chips, allowing a single processor to use thread-level parallel computing, which is compatible with multi-threaded parallel computing. In other words, a hyper-threaded CPU can run two or more threads in parallel on the basis of a physical core, thereby obtaining more parallel instructions and improving overall operating performance. On the other hand, in order to make more effective use of the CPU clock cycle and avoid pipeline stalls or waits, the instruction prediction scheme is adopted for instruction prefetching and instruction pre-execution.
这些方案都在一定程度上提升了CPU的执行效率。然而,指令预测并不总是准确(准确率98%),CPU虽然通过使用指令预测提升数据和指令并行度,但是2%的预测失败带来25%的性能伤害,同时带来安全隐患(熔断,幽灵)。These programs have improved the execution efficiency of the CPU to a certain extent. However, instruction prediction is not always accurate (98% accuracy). Although the CPU uses instruction prediction to improve data and instruction parallelism, a 2% prediction failure brings 25% performance damage and also brings safety hazards (fuse ,ghost).
发明内容Summary of the invention
有鉴于此,本说明书实施例提供了一种CPU指令处理方法、控制器和中央处理单元CPU,在充分利用指令预测的准确率(98%)的基础上,避免安全问题的同时,减少预测失败带来的性能和功耗问题,提升CPU的效率。In view of this, the embodiments of this specification provide a CPU instruction processing method, a controller, and a central processing unit CPU. On the basis of making full use of the accuracy of instruction prediction (98%), it avoids security problems and reduces prediction failures. The performance and power consumption problems brought about, improve the efficiency of the CPU.
本说明书实施例采用下述技术方案:The embodiments of this specification adopt the following technical solutions:
本说明书实施例提供了一种CPU指令处理方法,所述方法包括:提取指令形成指令块,以送入CPU执行单元,所述指令块包括单条跳转指令以及通过CPU指令预测得到的分支指令;使CPU执行单元执行所述跳转指令之前的指令和所述跳转指令,以及在确定出所述跳转指令的跳转目标指令之前,拒绝所述分支指令进入执行阶段。The embodiment of this specification provides a CPU instruction processing method, the method includes: extracting instructions to form an instruction block to be sent to a CPU execution unit, the instruction block including a single jump instruction and a branch instruction predicted by the CPU instruction; The CPU execution unit is made to execute the instruction before the jump instruction and the jump instruction, and before the jump target instruction of the jump instruction is determined, the branch instruction is rejected to enter the execution stage.
本说明书实施例还提供了一种CPU控制器,包括:指令提取单元,用于提取指令形成指令块,以送入CPU执行单元,所述指令块包括单条跳转指令以及通过CPU指令预测得到的分支指令;执行操作单元,用于使CPU执行单元执行所述跳转指令之前的指令和所述跳转指令,以及在确定出所述跳转指令的跳转目标指令之前,拒绝所述分支指令进入执行阶段。The embodiment of the present specification also provides a CPU controller, including: an instruction extraction unit for extracting instructions to form an instruction block to be sent to the CPU execution unit. The instruction block includes a single jump instruction and a prediction obtained by the CPU instruction. Branch instruction; execution operation unit, used to make the CPU execution unit execute the instruction before the jump instruction and the jump instruction, and reject the branch instruction before the jump target instruction of the jump instruction is determined Enter the implementation phase.
本说明书实施例还提供了一种中央处理单元,包括上述的控制器。The embodiment of this specification also provides a central processing unit, including the above-mentioned controller.
本申请实施例采用的上述至少一个技术方案能够达到以下有益效果:本说明书的方案充分利用现有CPU的指令预测功能,将预测的指令经过取指、解码、重命名和分配执行资源阶段,进入准备执行,并且在确定出跳转指令的跳转目标指令之前,拒绝分支指令(即预测的指令)进入执行阶段,也即仅将确定的指令(如确定出的跳转目标指令)进入执行阶段进行执行,在避免安全问题的同时,减少预测失败带来性能和功耗问题,减少CPU内部超线程之间的运行冲突,整体提升CPU在大数据场景下的整体吞吐性能。The above-mentioned at least one technical solution adopted in the embodiment of the application can achieve the following beneficial effects: the solution in this specification makes full use of the instruction prediction function of the existing CPU, and passes the predicted instruction through the stages of fetching, decoding, renaming and allocating execution resources, and enters Ready to execute, and before the jump target instruction of the jump instruction is determined, the branch instruction (that is, the predicted instruction) is rejected to enter the execution stage, that is, only the determined instruction (such as the determined jump target instruction) is entered into the execution stage Execution, while avoiding security problems, reduces performance and power consumption problems caused by prediction failures, reduces running conflicts between hyperthreads within the CPU, and improves the overall throughput performance of the CPU in big data scenarios.
附图说明Description of the drawings
为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图:In order to more clearly describe the technical solutions in the embodiments of this specification or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments described in this specification. For those of ordinary skill in the art, without creative labor, other drawings can be obtained from these drawings:
图1为本说明书实施例提供的一种CPU执行过程;FIG. 1 is a CPU execution process provided by an embodiment of this specification;
图2为本说明书实施例提供的一种CPU指令处理方法的流程图;FIG. 2 is a flowchart of a method for processing CPU instructions according to an embodiment of the specification;
图3为本说明书实施例提供的CPU控制器的功能框图。Fig. 3 is a functional block diagram of a CPU controller provided by an embodiment of this specification.
具体实施方式Detailed ways
如背景技术提到的熔断和幽灵的安全隐患,为了解决熔断和幽灵,现在通过软件解决,但是性能会受到影响,本说明书的方案在保证安全的情况下,大幅度降低跳转预测失败带来的性能伤害,提升CPU整体的吞吐量。For example, the safety hazards of fuse and ghost mentioned in the background art, in order to solve the fuse and ghost, it is now solved by software, but the performance will be affected. The solution in this manual will greatly reduce the failure of jump prediction while ensuring safety. The performance hurts and improves the overall throughput of the CPU.
为了使本技术领域的人员更好地理解本说明书中的技术方案,下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本说明书实施例, 本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the technical solutions in this specification, the following will clearly and completely describe the technical solutions in the embodiments of this specification in conjunction with the drawings in the embodiments of this specification. Obviously, the described The embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments of this specification, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本说明书实施例涉及到的术语:CPU超线程:在一个物理核的基础上,并行运行2个或者多个线程,从而得到更多可并行指令,提升总体性能。CPU指令预测:通过指令的历史执行过程,预测跳转指令的目的地址。CPU指令预执行阶段:在跳转指令获得有效目的地址之前,CPU通过指令预测获得可执行代码,我们认为这些预测获得的代码的执行为指令预执行阶段。The term involved in the embodiments of this specification: CPU hyperthreading: On the basis of a physical core, two or more threads are run in parallel, thereby obtaining more parallel instructions and improving overall performance. CPU instruction prediction: predict the destination address of the jump instruction through the historical execution process of the instruction. CPU instruction pre-execution stage: Before the jump instruction obtains the effective destination address, the CPU obtains the executable code through instruction prediction. We consider the execution of these predicted codes as the instruction pre-execution stage.
图1为本说明书实施例提供的一种CPU执行过程。如图1所示,整个执行过程分为多个阶段。首先是取指令阶段。当前的主流CPU每个指令周期可以取16字节,大约每次为4条指令。接着进行指令预解码。预解码阶段的主要工作是辨别指令长度,同时标注跳转指令。通常来说,主流CPU在该阶段有5指令/周期的吞吐量。Fig. 1 is a CPU execution process provided by an embodiment of this specification. As shown in Figure 1, the entire execution process is divided into multiple stages. The first is the instruction fetch stage. Current mainstream CPUs can fetch 16 bytes per instruction cycle, which is about 4 instructions each time. Then proceed to instruction pre-decoding. The main task of the pre-decoding stage is to identify the length of the instruction and mark the jump instruction at the same time. Generally speaking, mainstream CPUs have a throughput of 5 instructions/cycle at this stage.
预解码之后即进入解码阶段。解码阶段主要将复杂指令转变为精简指令(固定长度),同时指定操作类型。通常该阶段也有5指令/周期的吞吐量。解码后的指令会放入到已解码缓存。After pre-decoding, it enters the decoding stage. The decoding stage mainly transforms complex instructions into condensed instructions (fixed length), and specifies the type of operation at the same time. Usually there is a throughput of 5 instructions/cycle at this stage. The decoded instruction will be put into the decoded buffer.
已解码缓存作为一个指令缓存池,其中可以存储多条已解码的指令,供下一阶段读取。已解码缓存到下一阶段的吞吐量可以达到每个周期6条指令。The decoded cache serves as an instruction cache pool, in which multiple decoded instructions can be stored for the next stage to read. The throughput of the decoded cache to the next stage can reach 6 instructions per cycle.
如前所述,对于超线程CPU,可以存在多个线程并行执行。在执行过程中,每个线程都会读取接下来待执行的指令,形成自己的线程缓存队列。在已解码缓存中存在上述待执行指令的情况下,就使用已解码缓存中存储的指令,否则,从前端(内存)得到相应指令添加到队列中。在图1中示例性示出了线程A和线程B各自的线程缓存队列,但是可以理解,超线程CPU还可以支持更多线程的并行执行。As mentioned earlier, for a hyper-threaded CPU, there can be multiple threads executing in parallel. During the execution process, each thread will read the instructions to be executed next to form its own thread cache queue. In the case where the above-mentioned instruction to be executed exists in the decoded cache, the instruction stored in the decoded cache is used, otherwise, the corresponding instruction is obtained from the front end (memory) and added to the queue. In FIG. 1, the thread buffer queues of the thread A and the thread B are exemplarily shown, but it can be understood that the hyper-threaded CPU can also support the parallel execution of more threads.
接着,从形成线程缓存队列进入下一阶段:重命名和分配可执行资源。该阶段通常可以包括重命名1、重命名2、分配执行资源。从线程缓存队列到该阶段的吞吐量可以达到每个周期5条指令。而在重命名和分配可执行资源阶段,主要工作是解决寄存器读写依赖关系,去除不必要的依赖关系,力求得到指令更多并行执行能力,同时分配执行时所需要的各种资源。Then, proceed to the next stage from forming the thread cache queue: renaming and allocating executable resources. This stage can usually include renaming 1, renaming 2, allocating execution resources. The throughput from the thread cache queue to this stage can reach 5 instructions per cycle. In the stage of renaming and allocating executable resources, the main task is to solve the dependency of register reading and writing, remove unnecessary dependencies, and strive to obtain more parallel execution capabilities of instructions, and at the same time allocate various resources required for execution.
在分配好执行所需资源之后,指令才会被送入到CPU的执行单元进行执行。目前CPU拥有多个执行单元,当前最普遍的CPU具有8条可以并行执行的流水线,也就是每个周期可以执行8个微操作,虽然可以乱序执行,但是最后指令提交的顺序与程序的 顺序相同。After the resources required for execution are allocated, the instructions will be sent to the execution unit of the CPU for execution. At present, the CPU has multiple execution units. The most common CPU currently has 8 pipelines that can be executed in parallel, that is, 8 micro-operations can be executed per cycle. Although it can be executed out of order, the order of the last instruction submission and the order of the program same.
如前所述,为了避免指令缺失带来的流水线停顿或等待,目前几乎所有CPU都会采用指令预测,又称为分支预测(Branch Prediction)方案进行指令的预测和预取。在每个周期结束之后,预测单元根据其包含的历史执行状态表预测将要预取的指令。如果指令没有跳转,在前述取指令阶段,就取当前取指地址加16字节的指令块。如果指令存在跳转,则根据指令预测结果,获取预测分支的指令。As mentioned earlier, in order to avoid pipeline stalls or waits caused by instruction missing, almost all CPUs currently use instruction prediction, also known as branch prediction (Branch Prediction) for instruction prediction and prefetching. After the end of each cycle, the prediction unit predicts the instructions to be prefetched according to the historical execution state table it contains. If the instruction does not jump, in the aforementioned instruction fetch stage, the instruction block with the current instruction fetch address plus 16 bytes is fetched. If the instruction has a jump, the instruction for the predicted branch is obtained according to the instruction prediction result.
经过不断改进,当前的指令预测方案的预测准确度已经可以超过90%,有些方案的预测准确度甚至可达98%。但是,仍然存在预测错误的可能,此时很有可能将错误的指令块输入到可执行单元中。After continuous improvement, the forecast accuracy of the current instruction forecasting scheme can exceed 90%, and the forecast accuracy of some schemes can even reach 98%. However, there is still a possibility that the prediction is wrong, and at this time, it is very likely that the wrong instruction block is input into the executable unit.
例如,假定存在指令L1,L2,L3,L4,L5,其中L2是跳转指令,该指令规定,在某判断条件满足时,跳转至指令L5,否则顺序执行指令L3和L4。如果在指令预测时,预测该跳转指令L2的目标分支是L3,那么在取指令阶段就会读取L3和后续指令,并且在后续执行阶段,有可能将L1,L2,L3,L4送入CPU执行单元进行执行。如果实际上L2的执行结果指示,应该跳转到L5,那么L3和L4就被错误地执行。在这样的情况下,CPU不得不重新刷新整条流水线,回滚到之前的分支,然后重新热启动,选择另一条分支执行。尽管指令预测错误的概率并不高,但是一旦出现,就需要进行上述操作,这样的操作非常耗时,致使CPU效率最大只能在75%左右。For example, suppose there are instructions L1, L2, L3, L4, L5, where L2 is a jump instruction, which specifies that when a certain judgment condition is met, jump to instruction L5, otherwise execute instructions L3 and L4 in sequence. If during instruction prediction, the target branch of the jump instruction L2 is predicted to be L3, then L3 and subsequent instructions will be read in the instruction fetch stage, and in the subsequent execution stage, it is possible to send L1, L2, L3, and L4 The CPU execution unit performs execution. If the execution result of L2 actually indicates that it should jump to L5, then L3 and L4 are executed incorrectly. In this case, the CPU has to refresh the entire pipeline again, roll back to the previous branch, then restart the hot restart, and select another branch for execution. Although the probability of an instruction prediction error is not high, once it occurs, the above operation needs to be performed. Such an operation is very time-consuming, resulting in a maximum CPU efficiency of about 75%.
为此,现有的一种方案是:仍然按照原有方式执行图1中的取指令阶段、预解码阶段、解码阶段,并将解码的指令放入已解码缓存中,各个线程可以从已解码缓存中读取指令,形成线程缓存队列。但是在跳转指令获取到有效的跳转目标地址,也就是确定出跳转目标指令之前,不再执行代码块的重命名和分配可执行资源阶段,以保证后续执行操作都是正确完成,没有预测失败而导致的效率下降。举例而言,在前述例子中,指令L1,L2,L3,L4,L5中L2是跳转指令,即使跳转指令L2的目标分支被错误地预测为L3,该现有方案只会将L1和L2作为一个指令块,送入CPU执行单元进行执行,而不会同时将L1,L2,L3,L4一起执行。也即在确定出跳转指令L2的目标地址(即确定出跳转目标指令)后,才会将跳转目标指令放入到执行单元,经过重命名和分配执行资源阶段,然后进入执行阶段。To this end, an existing solution is: still execute the instruction fetch stage, pre-decode stage, and decode stage in Figure 1 in the original way, and put the decoded instructions into the decoded cache, and each thread can read from the decoded The instructions are read from the cache to form a thread cache queue. But before the jump instruction obtains the effective jump target address, that is, before the jump target instruction is determined, the code block renaming and executable resource allocation stage is no longer executed to ensure that the subsequent execution operations are completed correctly. Loss of efficiency caused by failure to predict. For example, in the foregoing example, the instructions L1, L2, L3, L4, and L5 include L2 as a jump instruction. Even if the target branch of the jump instruction L2 is incorrectly predicted as L3, the existing solution will only combine L1 and As an instruction block, L2 is sent to the CPU execution unit for execution, instead of simultaneously executing L1, L2, L3, and L4. That is, after the target address of the jump instruction L2 is determined (that is, the jump target instruction is determined), the jump target instruction is put into the execution unit, and after the stage of renaming and allocating execution resources, it enters the execution stage.
如上所述,现有的这种方案需要在跳转指令被解决之后,才会将确定出的跳转目标指令放入到执行单元开始执行。但是,在跳转目的地址确认后,到跳转目标指令被执行之前,跳转目标指令至少需要经过重命名和分配执行资源阶段,如重命名1、重命名2、 分配执行资源三个阶段,这样就导致3个周期以上的浪费。为此,本说明书的实施例在此基础上进行进一步改进,尽可能保留和利用高准确率的指令预测所带来的优势,同时利用超线程的高并行度的特点,在避免安全问题的同时,减少预测失败带来性能和功耗问题,减少CPU内部超线程之间的运行冲突,整体提升CPU在大数据场景下的整体吞吐性能。As mentioned above, this existing solution needs to put the determined jump target instruction into the execution unit to start execution after the jump instruction is resolved. However, after the jump destination address is confirmed, before the jump target instruction is executed, the jump target instruction needs to go through at least the renaming and execution resource allocation stages, such as renaming 1, renaming 2, and allocating execution resources. This leads to a waste of more than 3 cycles. To this end, the embodiments of this specification are further improved on this basis, as far as possible to retain and use the advantages of high-accuracy instruction prediction, while using the high parallelism of hyper-threading, while avoiding security issues , Reduce the performance and power consumption problems caused by prediction failures, reduce the running conflicts between the hyperthreads within the CPU, and improve the overall throughput performance of the CPU in the big data scenario.
根据本说明书一个或多个实施例,仍然使用指令跳转预测,预测的指令块经过取指令,解码,重命名1,重命名2和分配执行资源,但是每次仅仅执行包括跳转指令之前的代码。下面描述上述构思的实现方式。According to one or more embodiments of this specification, instruction jump prediction is still used. The predicted instruction block undergoes instruction fetching, decoding, renaming 1, renaming 2 and allocating execution resources, but each time it executes only the instructions before the jump instruction Code. The implementation of the above concept is described below.
图2为本说明书实施例提供的一种CPU指令处理方法的流程图。如图2所示,本说明书提供的CPU指令处理方法包括:Fig. 2 is a flowchart of a method for processing a CPU instruction provided by an embodiment of the specification. As shown in Figure 2, the CPU instruction processing method provided in this specification includes:
S110:提取指令形成指令块,以送入CPU执行单元;其中,所述指令块包括单条跳转指令以及通过CPU指令预测得到的分支指令。S110: Extract instructions to form an instruction block to be sent to the CPU execution unit; where the instruction block includes a single jump instruction and a branch instruction predicted by the CPU instruction.
S120:使CPU执行单元执行所述跳转指令之前的指令和所述跳转指令,以及在确定出所述跳转指令的跳转目标指令之前,拒绝所述分支指令进入执行阶段。S120: Make the CPU execution unit execute the instruction before the jump instruction and the jump instruction, and before the jump target instruction of the jump instruction is determined, refuse the branch instruction to enter the execution stage.
具体而言,在步骤S110中,按照原有的方式,从当前线程缓存队列提取指令,形成与硬件最大处理能力对应的最大长度的指令块。通常,CPU硬件的最大处理能力取决于包含的执行单元的数目,可以根据执行单元的数目确定一预定阈值,作为指令块的最大长度。例如,当前最普遍的CPU具有8条可以并行执行的流水线,那么可以将该预定阈值设为8,相应地,指令块的最大长度为8。Specifically, in step S110, the instruction is fetched from the current thread cache queue in the original manner to form an instruction block of the maximum length corresponding to the maximum processing capability of the hardware. Generally, the maximum processing capacity of the CPU hardware depends on the number of execution units included, and a predetermined threshold can be determined according to the number of execution units as the maximum length of the instruction block. For example, the most common CPU at present has 8 pipelines that can be executed in parallel, then the predetermined threshold can be set to 8, correspondingly, the maximum length of the instruction block is 8.
上述现有方案中,送入CPU执行单元的指令块不包含通过CPU指令预测得到的分支指令。与上述现有方案不同的是,本方案中送入CPU执行单元的指令块包括单条跳转指令以及通过CPU指令预测得到的分支指令。In the above-mentioned existing solution, the instruction block sent to the CPU execution unit does not include the branch instruction obtained through CPU instruction prediction. Different from the above-mentioned existing solution, the instruction block sent to the CPU execution unit in this solution includes a single jump instruction and a branch instruction predicted by the CPU instruction.
在步骤S110形成的指令块被送入CPU执行单元后,CPU按照现有的方式将指令都经过重命名和分配执行资源阶段,之后进入执行阶段,即包括跳转指令之前的指令和通过CPU指令预测得到的分支指令都经过重命名和分配执行资源阶段,进入准备执行。与现有方式不同的是,本方案仅执行包括跳转指令之前的指令(即执行跳转指令之前的指令和跳转指令),在确定出所述跳转指令的跳转目标指令之前,拒绝通过CPU指令预测得到的分支指令进入执行阶段。也就是说,包括跳转指令之前的指令和通过CPU指令预测得到的分支指令都经过重命名和分配执行资源阶段,进入准备执行,之后,在 确定出所述跳转指令的跳转目标指令之前,仅将包括跳转指令之前的指令进入执行阶段,而拒绝通过CPU指令预测得到的分支指令进入执行阶段。After the instruction block formed in step S110 is sent to the CPU execution unit, the CPU renames and allocates execution resources to the instructions according to the existing method, and then enters the execution stage, which includes the instructions before the jump instruction and the CPU instruction The predicted branch instructions all go through the stage of renaming and allocating execution resources, and enter the preparation for execution. The difference from the existing method is that this solution only executes the instructions including the jump instruction (that is, the instruction before the jump instruction and the jump instruction), and refuses until the jump target instruction of the jump instruction is determined. The branch instruction predicted by the CPU instruction enters the execution stage. That is to say, the instructions before the jump instruction and the branch instruction predicted by the CPU instruction are all renamed and allocated execution resources, and enter the stage of preparation for execution. After that, before the jump target instruction of the jump instruction is determined , Only the instructions before the jump instruction are included in the execution stage, and the branch instruction predicted by the CPU instruction is refused to enter the execution stage.
也即,本说明书的方案将预测到的指令块放入CPU执行单元,经过重命名和分配执行资源阶段,进入准备执行,但是不需要执行,一直到跳转指令得到确认(即确定出跳转目标指令)再进行执行,因此永远不会产生回滚,同时利用超线程指令和数据的并行性最终提升CPU总体吞吐量。That is, the solution of this specification puts the predicted instruction block into the CPU execution unit. After the stage of renaming and allocating execution resources, it enters the preparation for execution, but does not need to be executed until the jump instruction is confirmed (that is, the jump instruction is confirmed). The target instruction) is executed again, so there will never be a rollback. At the same time, the parallelism of hyperthreading instructions and data is used to ultimately improve the overall CPU throughput.
下面结合一个具体例子描述以上过程。假定存在以下一段指令(其中/*…*/中间的内容是对指令的解释):The above process is described below with a specific example. Suppose there is the following instruction (where the content in /*...*/ is the explanation of the instruction):
1.mov(r1),r2/*将寄存器r1所指向的地址内容,拷贝到寄存器r2*/1.mov(r1),r2/*copy the contents of the address pointed to by register r1 to register r2*/
2.mov 0x08(r1),r3/*将寄存器r1+8所指向的地址内容,拷贝到寄存器r3*/2.mov 0x08(r1),r3/*copy the contents of the address pointed to by register r1+8 to register r3*/
3.add r3,r2/*将寄存器r3内容加上寄存器r2的内容,存储到r2*/3.add r3, r2/*Add the content of register r3 to the content of register r2, and store it in r2*/
4.mov r2,(r4)/*将r2的内容,存储到基础器r4指向的内存地址*/4.mov r2,(r4)/*store the content of r2 to the memory address pointed to by the basic device r4*/
5.cmp r2,r55.cmp r2,r5
6.ja L_Jmp6.ja L_Jmp
7.div r4,r5/*将寄存器r4的内容与r5相除,然后存入寄存器r5*/7.div r4, r5/* Divide the contents of register r4 by r5, and then store them in register r5*/
......
L_Jmp:L_Jmp:
n.mul r6,r7/*将寄存器r6的内容与r7相乘,然后存入寄存器r7*/n.mul r6,r7/*Multiply the content of register r6 by r7, and then store it in register r7*/
n+1....n+1....
在这一段指令中,指令6为跳转指令,CPU按照现有的方式将指令(1-6)都经过如下阶段:取指令,解码,重命名1,重命名2和分配执行资源,之后准备运行,如果跳转预测判断需要跳转(目的地址是指令n),则指令n和指令n之后的指令也会根据跳转预测器的判断经过上述过程(取指令,解码,重命名1,重命名2和分配执行资源)进入准备执行。即指令(1-6)、指令n和指令n之后的指令都被送入CPU执行单元,但是与现有方式不同的是,本说明书实施例仅仅执行指令1-6(即将指令1-6进入执行阶段进行执行),而在确定出所述跳转指令的目标指令之前,指令n和指令n之后的指令拒绝进入执行阶段。In this paragraph of instructions, instruction 6 is a jump instruction. The CPU passes the instructions (1-6) through the following stages according to the existing method: fetch instructions, decode, rename 1, rename 2, and allocate execution resources, and then prepare Run, if the jump prediction judgment requires a jump (the destination address is instruction n), the instruction n and the instructions after the instruction n will also go through the above process (instruction fetch, decode, rename 1, rename according to the judgment of the jump predictor) Name 2 and allocate execution resources) to enter the preparation for execution. That is, instruction (1-6), instruction n, and instructions after instruction n are all sent to the CPU execution unit, but different from the existing method, the embodiment of this specification only executes instructions 1-6 (that is, instructions 1-6 enter Execution is performed in the execution stage), and before the target instruction of the jump instruction is determined, instruction n and instructions after instruction n refuse to enter the execution stage.
直到根据CPU执行单元执行的执行结果,确定出跳转指令的目标指令之后,判断所 述目标指令与所述分支指令是否一致。如果所述目标指令与所述分支指令一致,预测的分支指令为指令n,确定出的跳转指令的目标指令也为指令n(98%的预测准确率),此时,由于指令n已经准备执行(即指令n已经经过重命名1,重命名2和分配执行资源阶段),因此可以使指令n进入执行阶段,快速运行。如果所述目标指令与所述分支不指令一致,例如预测到的分支指令为指令n,确定出的跳转指令的目标指令为指令7,则清除指令n,此时由于指令n为准备执行阶段,即指令n并没有执行,没有产生不需要的上下文,所以不需要复杂的恢复方法,无需等待可以迅速取真正的目的地址指令7和之后的指令,送入CPU执行单元进行执行。Until the target instruction of the jump instruction is determined according to the execution result executed by the CPU execution unit, it is determined whether the target instruction is consistent with the branch instruction. If the target instruction is consistent with the branch instruction, the predicted branch instruction is instruction n, and the determined target instruction of the jump instruction is also instruction n (98% prediction accuracy). At this time, because instruction n has been prepared Execution (that is, the instruction n has been renamed 1, renamed 2 and the execution resource allocation stage), so the instruction n can enter the execution stage and run quickly. If the target instruction is consistent with the branch instruction, for example, the predicted branch instruction is instruction n, and the determined target instruction of the jump instruction is instruction 7, then instruction n is cleared. At this time, instruction n is in the stage of preparing to execute , That is, the instruction n is not executed, no unnecessary context is generated, so there is no need for complex recovery methods, and the real destination address instruction 7 and subsequent instructions can be quickly fetched without waiting, and sent to the CPU execution unit for execution.
在一个实施例中,获取目标指令可以包括:首先判断已解码缓存中是否包含正确的目标指令;在包含的情况下,从已解码缓存获取该目标指令。可以理解,基于指令预测方案的指令预取会不断地预取到许多条指令,解码后放入已解码缓存。因此,在绝大多数的情况下,都可以从已解码缓存中获取到正确的目标指令。另一方面,在极其罕见的情况下,已解码缓存中不包含目标指令。这时,可以从内存请求获取该目标指令。In one embodiment, obtaining the target instruction may include: first determining whether the correct target instruction is contained in the decoded cache; if it is included, obtaining the target instruction from the decoded cache. It can be understood that the instruction prefetch based on the instruction prediction scheme will continuously prefetch many instructions, and then put them into the decoded cache after being decoded. Therefore, in most cases, the correct target instruction can be obtained from the decoded cache. On the other hand, in extremely rare cases, the target instruction is not contained in the decoded cache. At this time, the target instruction can be obtained from the memory request.
按照上述实施例,在确定出跳转指令的目标指令之前,使CPU执行单元执行分支指令的重命名和分配执行资源阶段,以使所述分支指令进入准备执行阶段,并拒绝分支指令进入执行阶段。只有在目标指令与所述分支指令一致的情况下,使分支指令进入执行阶段以执行所述分支指令;否则,清除所述分支指令,并获取所述目标指令,并将所述目标指令送入CPU执行单元进行执行。可以看到CPU永远都会执行正确的指令,因此有效避免熔断和幽灵引入的安全问题,同时在引入超线程之后,不会因为某一个线程占用太多资源而伤害在同一个执行单元运行的其他线程,线程之间具有非常好的自适应调度能力,最终在保证安全的前提下,减少功耗提升性能。According to the above embodiment, before the target instruction of the jump instruction is determined, the CPU execution unit is made to execute the rename and execution resource allocation stage of the branch instruction, so that the branch instruction enters the stage of preparation for execution, and the branch instruction is refused to enter the execution stage. . Only when the target instruction is consistent with the branch instruction, the branch instruction enters the execution stage to execute the branch instruction; otherwise, the branch instruction is cleared, the target instruction is obtained, and the target instruction is sent to The CPU execution unit performs execution. It can be seen that the CPU will always execute the correct instructions, so it effectively avoids the safety problems introduced by fuse and ghost. At the same time, after the introduction of hyperthreading, it will not harm other threads running in the same execution unit because a certain thread occupies too many resources. , There is a very good adaptive scheduling ability between threads, and ultimately under the premise of ensuring safety, reducing power consumption and improving performance.
在大数据场景的条件下,需要通过超线程提升整体的吞吐量,经过验证发现,多线程的场景对于指令预测成功率要求较低,但是产生指令回滚的次数更多,产生大量不必要的性能开销,和熔断/幽灵等安全问题。在本说明书的方案中,由于预测的指令不会执行,当预测失败的时候,又可以有效减少预测带来的延迟,同时考虑到了大数据场景下的多线程执行,使多线程之间有着很强的自适应调度。另外,随着单核CPU超线程数目的增加,共享的执行资源变得更加稀缺,在本说明书的方案中只执行确定的任务,避免由于过多乱序导致回滚和资源滥用的问题,最终在避免安全问题的同时提升CPU吞吐量。Under the conditions of big data scenarios, hyper-threading needs to be used to improve the overall throughput. After verification, it is found that the multi-threaded scenario requires lower instruction prediction success rate, but the number of instruction rollbacks is greater, resulting in a large amount of unnecessary Performance overhead, and safety issues such as circuit breakers/ghosts. In the solution in this specification, since the predicted instruction will not be executed, when the prediction fails, the delay caused by the prediction can be effectively reduced. At the same time, considering the multi-threaded execution in the big data scenario, there is a lot of Strong adaptive scheduling. In addition, as the number of single-core CPU hyperthreads increases, shared execution resources become more scarce. In the solution of this specification, only certain tasks are executed to avoid the problems of rollback and resource abuse due to excessive disorder. Improve CPU throughput while avoiding security issues.
如上所述,本说明书的方案充分利用现有CPU的指令预测功能,将预测的指令经过 取指、解码、重命名和分配执行资源阶段,进入准备执行,并且在确定出跳转指令的目标指令之前,拒绝分支指令(即预测的指令)进入执行阶段,也即仅将确定的指令(如确定出的目标指令)进入执行阶段进行执行,在避免安全问题的同时,减少预测失败带来性能和功耗问题,减少CPU内部超线程之间的运行冲突,整体提升CPU在大数据场景下的整体吞吐性能。As mentioned above, the solution in this specification makes full use of the instruction prediction function of the existing CPU, and passes the predicted instruction through the phases of fetching, decoding, renaming, and allocating execution resources, and then enters the preparation for execution, and determines the target instruction of the jump instruction. Previously, branch instructions (that is, predicted instructions) were refused to enter the execution stage, that is, only certain instructions (such as determined target instructions) were entered into the execution stage for execution. While avoiding safety issues, it also reduced performance and performance caused by prediction failures. The problem of power consumption reduces the running conflicts between the hyper-threads within the CPU, and improves the overall throughput performance of the CPU in the big data scenario as a whole.
如本领域技术人员所知,CPU中指令的执行过程通过控制器来控制。控制器是整个CPU的指挥控制中心,用于协调各个部件之间的操作。控制器一般包括指令控制逻辑、时序控制逻辑、总线控制逻辑、中断控制逻辑等几个部分。指令控制逻辑要完成取指令、分析指令和执行指令的操作。As those skilled in the art know, the execution of instructions in the CPU is controlled by the controller. The controller is the command and control center of the entire CPU and is used to coordinate the operations between various components. The controller generally includes several parts such as instruction control logic, timing control logic, bus control logic, and interrupt control logic. The instruction control logic must complete the operations of fetching instructions, analyzing instructions and executing instructions.
根据以上描述的实施例的方案,对原有的指令控制过程进行了优化和调整,因此相应地,可以在硬件层面上修改控制器电路,特别是其中的指令控制逻辑,使其完成以上实施例描述的控制过程。According to the solution of the above-described embodiment, the original command control process is optimized and adjusted. Therefore, the controller circuit, especially the command control logic, can be modified at the hardware level to complete the above embodiment. Describe the control process.
图3为本说明书实施例提供的CPU控制器的功能框图。如图3所示,CPU控制器包括:Fig. 3 is a functional block diagram of a CPU controller provided by an embodiment of this specification. As shown in Figure 3, the CPU controller includes:
指令提取单元301,用于提取指令形成指令块,以送入CPU执行单元,所述指令块包括单条跳转指令以及通过CPU指令预测得到的分支指令;The instruction extraction unit 301 is configured to extract instructions to form an instruction block to be sent to the CPU execution unit, and the instruction block includes a single jump instruction and a branch instruction predicted by the CPU instruction;
执行操作单元305,用于使CPU执行单元执行包括所述跳转指令之前的指令,以及在确定出所述跳转指令的目标指令之前,拒绝所述分支指令进入执行阶段。The execution operation unit 305 is configured to enable the CPU execution unit to execute the instruction before the jump instruction, and refuse the branch instruction to enter the execution stage before the target instruction of the jump instruction is determined.
以及,执行操作单元305还用于在确定出所述跳转指令的目标指令之前,使CPU执行单元执行所述分支指令的重命名和分配执行资源阶段,以使所述分支指令进入准备执行阶段。And, the execution operation unit 305 is further configured to, before determining the target instruction of the jump instruction, cause the CPU execution unit to execute the renaming and execution resource allocation stage of the branch instruction, so that the branch instruction enters the stage of preparing for execution .
在一种具体的实施例中,如图3所示,CPU控制器还可以包括:In a specific embodiment, as shown in FIG. 3, the CPU controller may further include:
目标指令确定单元302,用于根据CPU执行单元的执行结果,确定所述跳转指令的目标指令;The target instruction determining unit 302 is configured to determine the target instruction of the jump instruction according to the execution result of the CPU execution unit;
判断单元303,用于判断所述目标指令与所述分支指令是否一致;The judging unit 303 is configured to judge whether the target instruction is consistent with the branch instruction;
目标指令获取单元304,用于判断已解码缓存中是否包含所述目标指令,其中已解码缓存中存储有预取并解码的多条指令;以及,在包含的情况下,从所述已解码缓存获取所述目标指令;在不包含的情况下,从内存获取所述目标指令。The target instruction acquisition unit 304 is configured to determine whether the target instruction is contained in the decoded cache, wherein a plurality of prefetched and decoded instructions are stored in the decoded cache; and, if it is contained, from the decoded cache Acquire the target instruction; if it is not included, acquire the target instruction from the memory.
执行操作单元305还用于在所述目标指令与所述分支指令一致的情况下,使所述分支指令进入执行阶段以执行所述分支指令。The execution operation unit 305 is further configured to enable the branch instruction to enter the execution stage to execute the branch instruction when the target instruction is consistent with the branch instruction.
执行操作单元305还用于在所述目标指令与所述分支指令不一致的情况下,清除所述分支指令,并将目标指令获取单元304获取到的目标指令送入CPU执行单元进行执行。The execution operation unit 305 is further configured to clear the branch instruction when the target instruction is inconsistent with the branch instruction, and send the target instruction acquired by the target instruction acquisition unit 304 to the CPU execution unit for execution.
以上各个单元可以根据需要采用各种电路元件实现,例如采用若干比较器来实现判断单元303等。The above units can be implemented by various circuit elements as required, for example, a number of comparators are used to implement the judgment unit 303 and the like.
通过以上的控制器,可以实现如图2所示的控制过程,从而在利用指令预测和预取的优势的基础上,在避免安全问题的同时,减少预测失败带来性能和功耗问题,减少CPU内部超线程之间的运行冲突,整体提升CPU在大数据场景下的整体吞吐性能。Through the above controller, the control process shown in Figure 2 can be realized, so that on the basis of using the advantages of instruction prediction and prefetching, while avoiding safety problems, it reduces the performance and power consumption problems caused by prediction failures, and reduces The running conflicts between the hyper-threads within the CPU improve the overall throughput performance of the CPU in the big data scenario as a whole.
本说明书实施例还提供了一种中央处理单元,包含上述的控制器。The embodiment of the present specification also provides a central processing unit including the above-mentioned controller.
上述对本说明书特定实施例进行了描述,其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,附图中描绘的过程不一定必须按照示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The specific embodiments of this specification have been described above, and other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims can be performed in a different order than in the embodiments and still achieve desired results. In addition, the processes depicted in the drawings do not necessarily have to be in the specific order or sequential order shown in order to achieve the desired result. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置、设备、非易失性计算机可读存储介质实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments. In particular, for the device, equipment, and non-volatile computer-readable storage medium embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiments.
本说明书实施例提供的装置、设备、非易失性计算机可读存储介质与方法是对应的,因此,装置、设备、非易失性计算机存储介质也具有与对应方法类似的有益技术效果,由于上面已经对方法的有益技术效果进行了详细说明,因此,这里不再赘述对应装置、设备、非易失性计算机存储介质的有益技术效果。The apparatus, equipment, non-volatile computer-readable storage medium, and method provided in the embodiments of this specification correspond to each other. Therefore, the apparatus, equipment, and non-volatile computer storage medium also have beneficial technical effects similar to the corresponding method. The beneficial technical effects of the method have been described in detail above, therefore, the beneficial technical effects of the corresponding device, equipment, and non-volatile computer storage medium will not be repeated here.
在20世纪90年代,对于一个技术的改进可以很明显地区分是硬件上的改进(例如,对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而,随着技术的发展,当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此,不能说一个方法流程的改进就不能用硬件实体模块来实现。例如,可编程 逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable Gate Array,FPGA))就是这样一种集成电路,其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上,而不需要请芯片制造厂商来设计和制作专用的集成电路芯片。而且,如今,取代手工地制作集成电路芯片,这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现,它与程序开发撰写时所用的软件编译器相类似,而要编译之前的原始代码也得用特定的编程语言来撰写,此称之为硬件描述语言(Hardware DescrIP地址tion Language,HDL),而HDL也并非仅有一种,而是有许多种,如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware DescrIP地址tion Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware DescrIP地址tion Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware DescrIP地址tion Language)等,目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware DescrIP地址tion Language)与Verilog。本领域技术人员也应该清楚,只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中,就可以很容易得到实现该逻辑方法流程的硬件电路。In the 1990s, the improvement of a technology can be clearly distinguished between hardware improvements (for example, improvements in circuit structures such as diodes, transistors, switches, etc.) or software improvements (improvements in method flow). However, with the development of technology, the improvement of many methods and processes of today can be regarded as a direct improvement of the hardware circuit structure. Designers almost always get the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be realized by the hardware entity module. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is such an integrated circuit whose logic function is determined by the user's programming of the device. It is programmed by the designer to "integrate" a digital system on a piece of PLD, without requiring chip manufacturers to design and manufacture dedicated integrated circuit chips. Moreover, nowadays, instead of manually making integrated circuit chips, this kind of programming is mostly realized with "logic compiler" software, which is similar to the software compiler used in program development and writing, but before compilation The original code must also be written in a specific programming language, which is called Hardware Description Language (HDL), and there is not only one HDL, but many, such as ABEL (Advanced Boolean Expression) Language), AHDL (Altera Hardware DescrIP Address Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware DescrIP Address Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Address) Language) and so on, the most commonly used at present are VHDL (Very-High-Speed Integrated Circuit Hardware DescrIP Address Language) and Verilog. It should also be clear to those skilled in the art that just a little bit of logic programming of the method flow in the above-mentioned hardware description languages and programming into an integrated circuit can easily obtain the hardware circuit that implements the logic method flow.
控制器可以按任何适当的方式实现,例如,控制器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式,控制器的例子包括但不限于以下微控制器:ARC 625D、Atmel AT91SAM、MicrochIP地址PIC18F26K20以及Silicone Labs C8051F320,存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。The controller can be implemented in any suitable manner. For example, the controller can take the form of, for example, a microprocessor or a processor and a computer-readable medium storing computer-readable program codes (such as software or firmware) executable by the (micro)processor. , Logic gates, switches, application specific integrated circuits (ASICs), programmable logic controllers and embedded microcontrollers. Examples of controllers include but are not limited to the following microcontrollers: ARC625D, Atmel AT91SAM, MicrochIP addresses PIC18F26K20 and Silicon Labs C8051F320, the memory controller can also be implemented as a part of the memory control logic. Those skilled in the art also know that, in addition to implementing the controller in a purely computer-readable program code manner, it is entirely possible to program the method steps to make the controller use logic gates, switches, application specific integrated circuits, programmable logic controllers, and embedded logic. The same function can be realized in the form of a microcontroller or the like. Therefore, such a controller can be regarded as a hardware component, and the devices included in it for realizing various functions can also be regarded as a structure within the hardware component. Or even, the device for realizing various functions can be regarded as both a software module for realizing the method and a structure within a hardware component.
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这 些设备中的任何设备的组合。The systems, devices, modules, or units illustrated in the above embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本说明书时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing this specification, the functions of each unit can be implemented in the same or multiple software and/or hardware.
本领域内的技术人员应明白,本说明书实施例可提供为方法、系统、或计算机程序产品。因此,本说明书实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本说明书实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of this specification can be provided as a method, a system, or a computer program product. Therefore, the embodiments of this specification may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of this specification may adopt the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
本说明书是参照根据本说明书实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This specification is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to the embodiments of this specification. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。The memory may include non-permanent memory in a computer-readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer readable media.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或 技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带式磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cartridges, magnetic tape storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements includes not only those elements, but also Other elements that are not explicitly listed, or also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.
本说明书可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本说明书,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。This specification may be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. This specification can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media including storage devices.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments. In particular, as for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
以上所述仅为本说明书实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。The above descriptions are only examples of this specification, and are not intended to limit this application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims (11)

  1. 一种CPU指令处理方法,包括:A CPU instruction processing method, including:
    提取指令形成指令块,以送入CPU执行单元;其中,所述指令块包括单条跳转指令以及通过CPU指令预测得到的分支指令;The fetched instructions form an instruction block to be sent to the CPU execution unit; wherein the instruction block includes a single jump instruction and a branch instruction predicted by the CPU instruction;
    使CPU执行单元执行所述跳转指令之前的指令和所述跳转指令,以及在确定出所述跳转指令的跳转目标指令之前,拒绝所述分支指令进入执行阶段。The CPU execution unit is made to execute the instruction before the jump instruction and the jump instruction, and before the jump target instruction of the jump instruction is determined, the branch instruction is rejected to enter the execution stage.
  2. 根据权利要求1所述的方法,在拒绝所述分支指令进入执行阶段之前,所述方法还包括:The method according to claim 1, before rejecting the branch instruction to enter the execution phase, the method further comprises:
    使CPU执行单元执行所述分支指令的重命名和分配执行资源阶段,以使所述分支指令进入准备执行阶段。The CPU execution unit is made to execute the renaming and execution resource allocation stage of the branch instruction, so that the branch instruction enters the stage of preparing for execution.
  3. 根据权利要求2所述的方法,还包括:The method according to claim 2, further comprising:
    根据CPU执行单元执行的执行结果,确定所述跳转指令的目标指令;Determine the target instruction of the jump instruction according to the execution result executed by the CPU execution unit;
    判断所述目标指令与所述分支指令是否一致;Judging whether the target instruction is consistent with the branch instruction;
    如果所述目标指令与所述分支指令一致,则使所述分支指令进入执行阶段以执行所述分支指令。If the target instruction is consistent with the branch instruction, the branch instruction enters the execution stage to execute the branch instruction.
  4. 根据权利要求3所述的方法,还包括:The method according to claim 3, further comprising:
    如果所述目标指令与所述分支不指令一致,则清除所述分支指令,并获取所述目标指令,并将所述目标指令送入CPU执行单元进行执行。If the target instruction is consistent with the branch non-instruction, the branch instruction is cleared, the target instruction is acquired, and the target instruction is sent to the CPU execution unit for execution.
  5. 根据权利要求4所述的方法,获取所述目标指令包括:According to the method of claim 4, obtaining the target instruction comprises:
    判断已解码缓存中是否包含所述目标指令,其中已解码缓存中存储有预取并解码的多条指令;Judging whether the target instruction is contained in the decoded cache, wherein a plurality of prefetched and decoded instructions are stored in the decoded cache;
    在包含的情况下,从所述已解码缓存获取所述目标指令;In the case of inclusion, obtain the target instruction from the decoded cache;
    在不包含的情况下,从内存获取所述目标指令。If it is not included, the target instruction is obtained from the memory.
  6. 一种CPU控制器,包括:A CPU controller, including:
    指令提取单元,用于提取指令形成指令块,以送入CPU执行单元;其中,所述指令块包括单条跳转指令以及通过CPU指令预测得到的分支指令;The instruction extraction unit is used to extract instructions to form an instruction block to be sent to the CPU execution unit; wherein the instruction block includes a single jump instruction and a branch instruction predicted by the CPU instruction;
    执行操作单元,用于使CPU执行单元执行所述跳转指令之前的指令和跳转指令,以及在确定出所述跳转指令的跳转目标指令之前,拒绝所述分支指令进入执行阶段。The execution operation unit is configured to enable the CPU execution unit to execute the instruction before the jump instruction and the jump instruction, and refuse the branch instruction to enter the execution stage before the jump target instruction of the jump instruction is determined.
  7. 根据权利要求6所述的CPU控制器,The CPU controller according to claim 6,
    所述执行操作单元还用于在确定出所述跳转指令的目标指令之前,使CPU执行单元执行所述分支指令的重命名和分配执行资源阶段,以使所述分支指令进入准备执行阶 段。The execution operation unit is further configured to make the CPU execution unit execute the rename and execution resource allocation stage of the branch instruction before determining the target instruction of the jump instruction, so that the branch instruction enters the stage of preparing for execution.
  8. 根据权利要求7所述的控制器,还包括:The controller according to claim 7, further comprising:
    目标指令确定单元,用于根据CPU执行单元的执行结果,确定所述跳转指令的目标指令;The target instruction determining unit is configured to determine the target instruction of the jump instruction according to the execution result of the CPU execution unit;
    判断单元,用于判断所述目标指令与所述分支指令是否一致;A judging unit for judging whether the target instruction is consistent with the branch instruction;
    所述执行操作单元还用于在所述目标指令与所述分支指令一致的情况下,使所述分支指令进入执行阶段以执行所述分支指令。The execution operation unit is further configured to cause the branch instruction to enter the execution stage to execute the branch instruction when the target instruction is consistent with the branch instruction.
  9. 根据权利要求8所述的控制器,所述执行操作单元还用于在所述目标指令与所述分支指令不一致的情况下,清除所述分支指令,并将所述目标指令送入CPU执行单元进行执行。The controller according to claim 8, wherein the execution operation unit is further configured to clear the branch instruction when the target instruction is inconsistent with the branch instruction, and send the target instruction to the CPU execution unit Carry out execution.
  10. 根据权利要求9所述的控制器,还包括:The controller according to claim 9, further comprising:
    目标指令获取单元,用于判断已解码缓存中是否包含所述目标指令,其中已解码缓存中存储有预取并解码的多条指令;以及,The target instruction acquisition unit is used to determine whether the target instruction is contained in the decoded cache, wherein a plurality of prefetched and decoded instructions are stored in the decoded cache; and,
    在包含的情况下,从所述已解码缓存获取所述目标指令;In the case of inclusion, obtain the target instruction from the decoded cache;
    在不包含的情况下,从内存获取所述目标指令。If it is not included, the target instruction is obtained from the memory.
  11. 一种中央处理单元,包括权利要求6至10中任一项所述的控制器。A central processing unit comprising the controller according to any one of claims 6 to 10.
PCT/CN2021/087176 2020-04-28 2021-04-14 Cpu instruction processing method, controller, and central processing unit WO2021218633A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010349676.7A CN111538535B (en) 2020-04-28 2020-04-28 CPU instruction processing method, controller and central processing unit
CN202010349676.7 2020-04-28

Publications (1)

Publication Number Publication Date
WO2021218633A1 true WO2021218633A1 (en) 2021-11-04

Family

ID=71977272

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/087176 WO2021218633A1 (en) 2020-04-28 2021-04-14 Cpu instruction processing method, controller, and central processing unit

Country Status (2)

Country Link
CN (1) CN111538535B (en)
WO (1) WO2021218633A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117055961A (en) * 2023-08-15 2023-11-14 海光信息技术股份有限公司 Scheduling method and scheduling device for multithreading and processor

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538535B (en) * 2020-04-28 2021-09-21 支付宝(杭州)信息技术有限公司 CPU instruction processing method, controller and central processing unit
CN113868899B (en) * 2021-12-03 2022-03-04 苏州浪潮智能科技有限公司 Branch instruction processing method, system, equipment and computer storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104793921A (en) * 2015-04-29 2015-07-22 深圳芯邦科技股份有限公司 Instruction branch prediction method and system
CN108089883A (en) * 2013-01-21 2018-05-29 想象力科技有限公司 Thread is allocated resources to based on speculating to measure
CN109101276A (en) * 2018-08-14 2018-12-28 阿里巴巴集团控股有限公司 The method executed instruction in CPU
US20200065112A1 (en) * 2018-08-22 2020-02-27 Qualcomm Incorporated Asymmetric speculative/nonspeculative conditional branching
CN111538535A (en) * 2020-04-28 2020-08-14 支付宝(杭州)信息技术有限公司 CPU instruction processing method, controller and central processing unit

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5471593A (en) * 1989-12-11 1995-11-28 Branigin; Michael H. Computer processor with an efficient means of executing many instructions simultaneously
JP2004192021A (en) * 2002-12-06 2004-07-08 Renesas Technology Corp Microprocessor
US7281120B2 (en) * 2004-03-26 2007-10-09 International Business Machines Corporation Apparatus and method for decreasing the latency between an instruction cache and a pipeline processor
US8635437B2 (en) * 2009-02-12 2014-01-21 Via Technologies, Inc. Pipelined microprocessor with fast conditional branch instructions based on static exception state
CN106990942A (en) * 2011-06-29 2017-07-28 上海芯豪微电子有限公司 branch processing method and system
CN102360282A (en) * 2011-09-26 2012-02-22 杭州中天微系统有限公司 Production-line processor device for rapidly disposing prediction error of branch instruction
US9268569B2 (en) * 2012-02-24 2016-02-23 Apple Inc. Branch misprediction behavior suppression on zero predicate branch mispredict
CN103838550B (en) * 2012-11-26 2018-01-02 上海芯豪微电子有限公司 A kind of branch process system and method
CN103984525B (en) * 2013-02-08 2017-10-20 上海芯豪微电子有限公司 Instruction process system and method
CN103984523B (en) * 2013-02-08 2017-06-09 上海芯豪微电子有限公司 Multi-emitting instruction process system and method
CN104423929B (en) * 2013-08-21 2017-07-14 华为技术有限公司 A kind of branch prediction method and relevant apparatus
CN107783785A (en) * 2016-08-24 2018-03-09 上海芯豪微电子有限公司 A kind of branch processing method and system without branch prediction loss
US10691461B2 (en) * 2017-12-22 2020-06-23 Arm Limited Data processing
CN109634666B (en) * 2018-12-11 2022-11-15 华夏芯(北京)通用处理器技术有限公司 Method for fusing BTBs (Branch target bus) under prefetching mechanism

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108089883A (en) * 2013-01-21 2018-05-29 想象力科技有限公司 Thread is allocated resources to based on speculating to measure
CN104793921A (en) * 2015-04-29 2015-07-22 深圳芯邦科技股份有限公司 Instruction branch prediction method and system
CN109101276A (en) * 2018-08-14 2018-12-28 阿里巴巴集团控股有限公司 The method executed instruction in CPU
US20200065112A1 (en) * 2018-08-22 2020-02-27 Qualcomm Incorporated Asymmetric speculative/nonspeculative conditional branching
CN111538535A (en) * 2020-04-28 2020-08-14 支付宝(杭州)信息技术有限公司 CPU instruction processing method, controller and central processing unit

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117055961A (en) * 2023-08-15 2023-11-14 海光信息技术股份有限公司 Scheduling method and scheduling device for multithreading and processor

Also Published As

Publication number Publication date
CN111538535B (en) 2021-09-21
CN111538535A (en) 2020-08-14

Similar Documents

Publication Publication Date Title
WO2021218633A1 (en) Cpu instruction processing method, controller, and central processing unit
CN106406849B (en) Method and system for providing backward compatibility, non-transitory computer readable medium
KR101594090B1 (en) Processors, methods, and systems to relax synchronization of accesses to shared memory
US8424015B2 (en) Transactional memory preemption mechanism
US9811340B2 (en) Method and apparatus for reconstructing real program order of instructions in multi-strand out-of-order processor
US9772867B2 (en) Control area for managing multiple threads in a computer
US9223574B2 (en) Start virtual execution instruction for dispatching multiple threads in a computer
TWI719501B (en) Central processing unit (CPU), central processing unit (CPU) controller and method of executing instructions in central processing unit (CPU)
CN106170768B (en) Dispatching multiple threads in a computer
KR20150112774A (en) Method and apparatus for implementing a dynamic out-of-order processor pipeline
EP2764433A1 (en) Maintaining operand liveness information in a computer system
KR20180021812A (en) Block-based architecture that executes contiguous blocks in parallel
US9213569B2 (en) Exiting multiple threads in a computer
EP3186704B1 (en) Multiple clustered very long instruction word processing core
US11816061B2 (en) Dynamic allocation of arithmetic logic units for vectorized operations
CN114168202B (en) Instruction scheduling method, instruction scheduling device, processor and storage medium
US9389897B1 (en) Exiting multiple threads of a simulation environment in a computer
KR20240025019A (en) Provides atomicity for complex operations using near-memory computing
US20140201505A1 (en) Prediction-based thread selection in a multithreading processor
US20220147393A1 (en) User timer directly programmed by application
JP2024523339A (en) Providing atomicity for composite operations using near-memory computing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21797864

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21797864

Country of ref document: EP

Kind code of ref document: A1