CN117055961A - Scheduling method and scheduling device for multithreading and processor - Google Patents

Scheduling method and scheduling device for multithreading and processor Download PDF

Info

Publication number
CN117055961A
CN117055961A CN202311034294.5A CN202311034294A CN117055961A CN 117055961 A CN117055961 A CN 117055961A CN 202311034294 A CN202311034294 A CN 202311034294A CN 117055961 A CN117055961 A CN 117055961A
Authority
CN
China
Prior art keywords
threads
thread
target
queue
branch prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311034294.5A
Other languages
Chinese (zh)
Inventor
金伟松
胡世文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haiguang Information Technology Co Ltd
Original Assignee
Haiguang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haiguang Information Technology Co Ltd filed Critical Haiguang Information Technology Co Ltd
Priority to CN202311034294.5A priority Critical patent/CN117055961A/en
Publication of CN117055961A publication Critical patent/CN117055961A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • G06F9/3844Speculative instruction execution using dynamic branch prediction, e.g. using branch history tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

A scheduling method for multithreading, a scheduling device and a processor are provided, wherein the scheduling method for multithreading comprises the following steps: acquiring the branch prediction history accuracy of each of a plurality of threads waiting to be scheduled in a queue; determining a target thread for the plurality of threads in the queue based on the branch prediction history accuracy of the plurality of threads; the target thread is selected in the queue for subsequent processing. The scheduling method improves system performance of, for example, a multithreaded processor or a multi-core processor.

Description

Scheduling method and scheduling device for multithreading and processor
Technical Field
The embodiment of the disclosure relates to a scheduling method, a scheduling device and a processor for multithreading.
Background
Meanwhile, a multithreading (SMT) technology is an important technology for improving the overall performance of a CPU, and the technology simultaneously executes instructions of a plurality of threads through mechanisms such as multi-emission, out-of-order execution and the like of a processor (CPU) core, wherein the applied physical CPU core is presented to software, and an operating system is a plurality of virtual CPU cores. When a high-performance CPU core of a modern multi-transmission mechanism executes a single thread, a plurality of execution units and hardware resources in the CPU core cannot be fully utilized most of the time; when the single thread runs at a stop for some reasons (for example, L2 cache Miss (Miss)), the hardware execution unit can only idle, which causes the waste of hardware resources and reduces the performance power consumption ratio, while in the SMT mode of the CPU, when one thread runs at a stop, other threads can still run, which improves the utilization of the hardware resources, thereby improving the multi-thread Cheng Tuntu capacity, the overall performance and the performance power consumption ratio of the CPU core.
Disclosure of Invention
At least one embodiment of the present disclosure provides a scheduling method for multithreading, including: acquiring the branch prediction history accuracy of each of a plurality of threads waiting to be scheduled in a queue; determining a target thread for the plurality of threads in the queue based on branch prediction history accuracy of the plurality of threads; the target thread is selected in the queue for subsequent processing.
For example, in the scheduling method for multithreading according to at least one embodiment of the present disclosure, before the determining, in the queue, a target thread for the plurality of threads based on the branch prediction history accuracy of the plurality of threads, further includes: judging whether the plurality of threads meet the executable condition, wherein the plurality of threads are paused to be scheduled when none of the plurality of threads meet the executable condition, or determining that the thread is the target thread and pausing to be scheduled when only one thread meets the executable condition, or determining that the target thread is continued to be determined in the queue for the plurality of threads according to the branch prediction history accuracy of the plurality of threads when more than one thread meets the executable condition.
For example, in a scheduling method for multithreading in accordance with at least one embodiment of the present disclosure, the determining a target thread for the plurality of threads in the queue at a branch prediction history accuracy based on the plurality of threads includes: and determining the thread with the highest branch prediction history accuracy rate in the plurality of threads as the target thread.
For example, in a scheduling method for multithreading in accordance with at least one embodiment of the present disclosure, the determining a target thread for the plurality of threads in the queue at a branch prediction history accuracy based on the plurality of threads includes: the target thread is determined for the plurality of threads in the queue in combination with a branch prediction history accuracy of the plurality of threads and a base scheduling algorithm for the queue.
For example, in a scheduling method for multithreading in accordance with at least one embodiment of the present disclosure, the combining the branch prediction history accuracy of the plurality of threads and a base scheduling algorithm for the queue determines the target thread for the plurality of threads in the queue, comprising: determining a first candidate thread of the plurality of threads using a branch prediction history of the plurality of threads; determining a second alternative thread of the plurality of threads using the base scheduling algorithm; and selecting the first alternative thread as the target thread when the first alternative thread and the second alternative thread are the same, or selecting the first alternative thread as the target thread preferentially according to the set continuous operation times when the first alternative thread and the second alternative thread are different.
For example, in a scheduling method for multiple threads according to at least one embodiment of the present disclosure, where the first candidate thread and the second candidate thread are different, preferentially selecting the first candidate thread as the target thread with a set number of consecutive operations includes: obtaining a target coefficient, wherein the target coefficient is the number of times that the first alternative thread is the target thread and is selected preferentially currently; the first candidate thread is preferentially selected as the target thread with a set number of continuous operations based on the target coefficient.
For example, in the scheduling method for multithreading according to at least one embodiment of the present disclosure, the preferentially selecting the first candidate thread as the target thread by the set number of consecutive operations based on the target coefficient further includes: judging whether the plurality of threads meet a first condition, wherein the first condition is that a difference value between branch prediction history accuracy rates of all the processed threads in the plurality of threads is smaller than a preset threshold value and the target coefficient is zero; the second alternative thread is selected as the target thread if the plurality of threads satisfy the first condition.
For example, in the scheduling method for multithreading according to at least one embodiment of the present disclosure, the preferentially selecting the first candidate thread as the target thread by the set number of consecutive operations based on the target coefficient further includes: and resetting the target coefficient based on a difference value of branch prediction history accuracy rates among the processed threads in the plurality of threads when the plurality of threads do not meet the first condition and the target coefficient is zero, wherein the difference value is a difference value among branch prediction history accuracy rates acquired by the plurality of threads respectively last time.
For example, in a scheduling method for multithreading in accordance with at least one embodiment of the present disclosure, the difference value is positively correlated with the coefficient value of the target coefficient.
For example, in the scheduling method for multithreading according to at least one embodiment of the present disclosure, the preferentially selecting the first candidate thread as the target thread by the set number of consecutive operations based on the target coefficient further includes: in the case where the plurality of threads do not satisfy the first condition and the target coefficient is not zero, the target coefficient is kept unchanged.
For example, in the scheduling method for multithreading according to at least one embodiment of the present disclosure, the preferentially selecting the first candidate thread as the target thread by the set number of consecutive operations based on the target coefficient further includes: and determining the first alternative thread as the target thread and decrementing the target coefficient once under the condition that the first alternative thread is different from the second alternative thread and the target coefficient is larger than zero.
For example, in a scheduling method for multithreading in accordance with at least one embodiment of the present disclosure, the target coefficient is maintained unchanged in the case where the first candidate thread is the same as the second candidate thread.
For example, in a scheduling method for multithreading according to at least one embodiment of the present disclosure, the obtaining a branch prediction history accuracy of each of a plurality of threads waiting to be scheduled in a queue includes: and determining the branch prediction history accuracy of the plurality of threads based on the total number of branch predictions and the correct number or the incorrect number of branch predictions of each of the plurality of threads.
For example, in a scheduling method for multithreading in accordance with at least one embodiment of the present disclosure, the base scheduling algorithm includes: the Round-Robin algorithm, the Icount algorithm, the Bcount algorithm, or the MissCount algorithm.
For example, in a scheduling method for multithreading in accordance with at least one embodiment of the present disclosure, the queues include a branch prediction queue, a branch target queue, a decode queue, or an instruction dispatch queue.
For example, in a scheduling method for multithreading in accordance with at least one embodiment of the present disclosure, the plurality of threads are respectively spawned by a plurality of hardware threads of a multithreaded processor, or the plurality of threads are respectively spawned by a plurality of processor cores of a multi-core processor.
At least one embodiment of the present disclosure provides a scheduling apparatus for multithreading, comprising: an acquisition module configured to acquire branch prediction history accuracy of each of a plurality of threads waiting to be scheduled in a queue; a determining module configured to determine a target thread for the plurality of threads in the queue based on branch prediction history accuracy of the plurality of threads; a selection module configured to select the target thread in the queue for subsequent processing.
At least one embodiment of the present disclosure provides a processor comprising: queues and multithreaded scheduling means. The multithreading scheduling device comprises: an acquisition module configured to acquire branch prediction history accuracy of each of a plurality of threads waiting to be scheduled in the queue; a determining module that determines a target thread for the plurality of threads in the queue based on branch prediction history accuracy rates of the plurality of threads; and the selection module is used for selecting the target thread in the queue for subsequent processing.
For example, a processor according to at least one embodiment of the present disclosure further includes: and the branch prediction monitoring module is configured to acquire the total number of predictions and the correct number of predictions or the incorrect number of predictions, which correspond to the plurality of threads, respectively, and acquire the historical accuracy rate by utilizing the total number of predictions and the correct number of predictions or the incorrect number of predictions of the plurality of threads.
For example, a processor in accordance with at least one embodiment of the present disclosure is a multithreaded processor or a multi-core processor, the multiple threads being respectively spawned by multiple hardware threads of the multithreaded processor, or the multiple threads being respectively spawned by multiple processor cores of the multi-core processor.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.
FIG. 1A shows a schematic diagram of a pipeline of a processor core;
FIG. 1B is a diagram illustrating a plurality of jump target addresses of a jump instruction;
FIG. 2A is a flow chart of a jump procedure for a branch instruction;
FIG. 2B is a flow chart of another jump procedure for a branch instruction;
FIG. 3 is a flow chart of branch instruction jump prediction;
FIG. 4A is a flow chart of thread scheduling for multiple threads;
FIG. 4B is a flow chart of a particular thread scheduling in the arbitration logic shown in FIG. 4A;
FIG. 5A is a flow chart of a scheduling method for multithreading provided in accordance with at least one embodiment of the present disclosure;
FIG. 5B is a schematic diagram of a multithreaded processor provided in accordance with at least one embodiment of the present disclosure;
FIG. 6 is a schematic diagram of branch prediction accuracy based thread scheduling provided in at least one embodiment of the present disclosure;
FIG. 7 is a schematic diagram of a scheduling apparatus for multithreading of a processor provided by at least one embodiment of the present disclosure;
fig. 8 is a schematic block diagram of an electronic device provided in accordance with at least one embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.
Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
The present disclosure is illustrated by the following several specific examples. Detailed descriptions of known functions and known parts (elements) may be omitted for the sake of clarity and conciseness in the following description of the embodiments of the present disclosure. When any part (element) of an embodiment of the present disclosure appears in more than one drawing, the part (element) is denoted by the same or similar reference numeral in each drawing.
The CPU (Central Processing Unit ) is a core component of the computer, and is responsible for executing instructions of the computer and controlling the operation of the computer. Modern multi-issue high performance CPUs (central processing units, central Processing Unit) include at least one Core (processor Core) with multiple execution units included in each Core to execute instructions. The basic operation of the CPU executes a stored instruction sequence (i.e., program), and the execution flow in the CPU is generally: fetch, instruction decode, execute instruction, access count, and result write back.
FIG. 1A shows a schematic diagram of a pipeline of a processor core, with the dashed lines with arrows representing redirected instruction flow.
As shown in fig. 1A, a processor core (e.g., CPU core) of a single-core processor or a multi-core processor improves inter-instruction parallelism by pipelining (Instruction Level Parallelism). The processor core includes a plurality of pipeline stages, for example, after the pipeline feeds into program counters of various sources, a next Program Counter (PC) is selected by a multiplexer (Mux), and an instruction corresponding to the program counter is subjected to branch prediction (Branch prediction), instruction fetch (Instruction fetch), instruction Decode (Decode), instruction dispatch and rename (Dispatch and Rename), instruction execution (Execute), instruction end (Retire), and the like. Wait queues, typically first-in-first-out (FIFO) queues, are provided as needed between the various pipeline stages. For example, after the branch prediction unit, a Branch Prediction (BP) FIFO queue is provided to store branch prediction results; after the instruction fetch unit, an instruction cache (Instruction Cache, IC) FIFO is provided to cache fetched instructions; after the instruction decode unit, a Decode (DE) FIFO is provided to buffer decoded instructions; after the instruction dispatch and rename unit, an end (RT) FIFO is provided to buffer instructions waiting for confirmation of end after execution. While the pipeline of the processor core also includes an instruction queue to cache waiting instruction execution units to execute instructions after instruction dispatch and renaming.
Specifically, during program execution, the CPU reads an instruction to be executed from the memory or the cache into the CPU according to an instruction address stored in the instruction calculator (PC) to execute the instruction, and then modifies the instruction counter according to the instruction length to sequentially read a next instruction, or modifies the instruction counter according to a jump instruction to be executed to store the instruction in the instruction counter, where the jump instruction indicates a jump to a target instruction address.
Jump instructions can be categorized into unconditional jumps and conditional jumps. The unconditional jump instruction does not depend on the data result of other instructions, the jump target position can be obtained by decoding the instruction through a decoder, and the jump position is fixed; conditional jump instructions rely on the result of the operation of other instructions and jump to different locations depending on the result, possibly with 1 or more jump target addresses.
FIG. 1B is a diagram illustrating a plurality of jump target addresses of a jump instruction. As shown in fig. 1B, a jump instruction may jump to instruction 1, instruction 2, instruction 3, or the like, which is arranged later in the instruction sequence.
The conditional jump instruction is also called a branch instruction, and includes a judging condition, where whether the judging condition is satisfied or not causes the program to execute different branches, in other words, the address of the next instruction to be executed is different according to the judging result of the judging condition. The unconditional jump instruction does not contain a judging condition, and after the unconditional jump instruction is executed, a specific instruction is executed unconditionally all the time, or the instructions are executed sequentially.
For example, machine instructions have jump instructions therein that change the flow direction of the program, the jump causing the program to take multiple execution paths, i.e., branches. The processing modes of the branch instruction in the CPU generally include the following two modes:
the first processing method is a method of directly identifying a branch instruction, wherein when an instruction fetching unit (Instruction Fetch Unit, IFU) encounters a jump instruction, the instruction fetching unit is stopped sending an instruction to a back end, after the jump instruction is decoded to obtain a decoding result or after an execution result obtained by executing an instruction on which the jump instruction depends is waited, an accurate jump target address of the jump instruction is obtained through the decoding result or the execution result, and a next instruction is sent based on the jump target address.
FIG. 2A is a flow chart of a branch instruction jump procedure, corresponding to an example of obtaining a jump target address from a decoding result in the above processing manner.
As shown in fig. 2A, when an instruction acquired by the instruction fetching unit is identified as a branch instruction that can acquire a jump target by decoding, the branch instruction is sent to a decoding unit (decoding unit) and waits for the decoding unit to decode, during which the IFU stops sending subsequent instructions to the back end, and after acquiring an accurate jump target address of the branch instruction by decoding, the IFU resumes instruction fetching and sends the instruction to the back end through the jump target address.
FIG. 2B is a flowchart of another branch instruction jump procedure, corresponding to an example of the above processing mode for obtaining the jump target address by the execution result.
As shown in fig. 2B, when the instruction fetched by the fetch unit is identified as a branch instruction whose jump target depends on other instruction results, the instruction is waited for to be executed and the IFU is caused to stop sending subsequent instructions to the backend during the execution until after the execution of the instruction is completed and the exact jump target address of the branch instruction is obtained by the execution result of the instruction, the IFU resumes fetching instructions through the jump target address and sends them to the backend.
The first processing method is a method of obtaining a jump position of an instruction by branch prediction, specifically, predicting the jump position of the jump instruction by a branch prediction unit, and implementing the jump of the branch instruction by a prediction result. A Branch Predictor (BP) may be used to predict whether an instruction jumps, the direction of the jump, the destination address of the jump, etc. For example, if the predicted result is correct, the pipeline may continue without interruption, and if the predicted result is incorrect, the instruction or micro-instruction that entered the pipeline after the branch instruction is flushed, fetching from the address where the branch instruction was actually adjusted. Overall, branch prediction may improve the pipeline efficiency of the CPU.
FIG. 3 is a flow chart of branch instruction jump prediction. As shown in fig. 3, the whole branch instruction jump prediction process is divided into a front end (front) and a back end (back end), in which a series of steps including instruction decoding, instruction buffering, instruction distribution, instruction execution and data writing back are sequentially performed, and information such as a history jump result of an instruction is collected and sent to a branch prediction unit (Branch Prediction Unit, BPU) of the front end, the BPU predicts in advance a position where a current branch is most likely to jump through the history branch instruction jump information, and then sends the prediction result to a branch target queue (Fetch Target Queue, FTQ) in the back end for temporary storage, and then sends the prediction result to an instruction fetching unit (Instruction Fetch Unit, IFU) which starts fetching instructions from a memory or a buffer according to a predicted instruction fetching position in the branch target queue and transfers the instruction backward, so that the pipeline of the CPU is prevented from being blocked.
The role of a Branch Prediction Unit (BPU) is to perform target address prediction on branch instructions, but with a certain probability of prediction error; a branch target queue (FTQ) is a buffer queue between the BPU and the IFU for temporarily storing the fetch targets predicted by the branch prediction unit and sending fetch requests to the fetch unit according to the fetch targets; instruction Fetch Unit (IFU) is used to read instruction data from memory or cache and pass it back for decoding the read instruction.
As described above, the branch prediction technique (BP technique) is a technique of predicting a branch jump instruction from a jump target position in advance, and based on history information of the instruction, the most likely jump position of the current branch instruction is predicted using, for branch instruction prediction, a technique including, but not limited to, BTB (Branch Target Buffer ), tag (Tagged Geometric History Branch Predictor, tag geometry history branch predictor), and the like.
The accuracy of various branch prediction techniques in the current processor field is high, for example, the accuracy can be generally over 90%, which plays an important role in improving the CPU performance, but for the situation of branch prediction errors, a back-end module (such as a decoding module and an executing module) sends feedback information to a branch prediction unit and a fetching unit for updating branch information of the prediction errors, stops fetching instructions on an error path, flushes (flushes) all instructions and execution results after predicting the error instructions, and then flushes an execution pipeline (pipeline), and then resumes execution from a branch prediction position, which may cause a CPU to execute a large number of unnecessary instructions, causing great performance loss to the processor, so that the CPU performance is improved when the BP accuracy is high, but the CPU performance is reduced when the BP accuracy is low, so that the performance of the system lacks stability.
Further, the accuracy of branch prediction techniques is limited mainly by the following three aspects:
1. the algorithm for branch prediction itself has a certain probability of error;
2. the branch prediction needs to be based on a large amount of branch instruction jump history information, namely when a CPU executes a new code segment, the code segment is executed for the first time, so that the branch instruction history information is empty, and a large amount of prediction errors can occur, thereby affecting the overall performance of the system; only when the history information of the code segment is enough, the prediction accuracy can be greatly improved;
3. the regularity of the program behavior itself is not obvious, so that accurate prediction is difficult to make.
A pipelined architecture (also simply "pipeline") for instruction execution of a Simultaneous Multithreaded (SMT) processor supports simultaneous execution of two or more (e.g., 4 or 8, etc.) threads (hardware threads). For example, SMT processors are typically single core processors. In a pipeline of a processor supporting simultaneous multithreading, one or more computing resources are shared by multiple threads, e.g., each thread has a separate logical register; among the queues of the various control functions of the pipeline of the multithreaded processor are shared by multiple threads, such as instruction dispatch queues, and are statically partitioned among multiple threads, such as instruction reorder queues. Meanwhile, the multithreading technology can improve the utilization rate of pipeline resources by utilizing the parallelism among threads.
Fig. 4A is a schematic flow diagram of thread scheduling in a Simultaneous Multithreading (SMT) processor. As shown in fig. 4A, in the multithreading (for example, 2 threads) mode, two threads T0 and T1 to be scheduled are selected by the arbitration logic, and a target thread Tx (x=0 or 1) for scheduling execution is determined.
For example, a Branch Prediction Unit (BPU) predicts a fetch target location for each thread to be scheduled by a prediction algorithm, so as to ensure that the fetch unit (IFU) can continuously fetch instructions from a memory or a cache for each thread to complete subsequent operations.
For example, current thread switching schemes for a processor (CPU) or processor core in simultaneous multithreading mode typically select one of a plurality of threads to be scheduled for subsequent operations (e.g., decoding, execution, etc.) at a node location (see fig. 1A) such as a Branch Prediction Unit (BPU), branch target queue (FTQ), instruction cache queue, decode queue, instruction dispatch queue, etc., according to one of a plurality of algorithms such as Round-Robin algorithm (i.e., polling scheduling algorithm), icount algorithm, bcount algorithm, or Miss Count algorithm, etc., wherein the Round-Robin algorithm is a polling scheduling algorithm that processes instructions to be processed in turn; in the Bcount algorithm, according to the decoded and not yet executed jump instruction number as a thread priority scheduling condition, a thread with a small jump instruction number has high priority; in the Miss Count algorithm, according to the number of query deletions (mis) occurring in a certain level or a plurality of levels of caches (caches), as a thread priority scheduling condition, a thread with a small number of Cache query deletions (caches) has a high priority in a memory access part; in the Icount algorithm, according to the number of instructions to be executed as a thread priority scheduling condition, a thread with a small number of instructions to be executed has a high priority. For the specific implementation of these known algorithms, no further description is given here.
FIG. 4B is a flow chart of a particular thread scheduling in the arbitration logic shown in FIG. 4A. The algorithms used by the different nodes may be different, taking the node of the target thread selected by the branch target queue and taking the Round-Robin algorithm as an example, the scheduling rule at the node is:
if neither thread T0 nor T1 satisfies the execution condition, then the round (cycle) is considered to be a program process pause (Start);
if only 1 thread of the two threads T0 and T1 meets the execution condition, selecting the thread;
if both threads T0 and T1 meet the execution condition, then the first round of thread scheduling (cycle 0) selects thread T0, the second round of scheduling (cycle 1) selects thread T1, the third round of thread scheduling (cycle 2) selects thread T0, and so on.
Or taking the Icount algorithm as an example, namely judging the thread scheduling priority according to the number of instructions to be executed in the instruction cache, wherein the scheduling rule of the node is as follows:
if neither thread T0 nor T1 satisfies the execution condition, the round is regarded as a program process pause;
if only 1 thread of the two threads T0 and T1 meets the execution condition, selecting the thread;
if both threads T0 and T1 satisfy the execution condition, a thread having a small number of execution instructions is selected for priority scheduling.
In the above-described flow or other similar flows, the occurrence of Branch Prediction (BP) errors is not considered and thread scheduling is optimized specifically. However, under multi-thread Cheng Changjing, when the misprediction rate (error rate) of a branch prediction of a thread is high, the Instruction Fetch Unit (IFU) will send a large number of instructions on the wrong branch path to the back end, and these instructions will contend with other threads, wasting a large amount of system power and resulting in reduced system performance.
Similarly, in a multi-Core processor including a plurality of processor cores, a processor Core cluster (CPU Complex, CCX) is composed of a plurality of processor cores (cores), for example, the processor cores in each CCX share a tertiary Cache (L3 Cache); in addition, multiple CCXs are located on the same Core Chip (CCD), multiple CCDs share a Data Fabric interface, and all CCDs share the entire DRAM. Therefore, in the multi-core processor, there is also resource contention of the shared resource between the cores. And in a multi-core processor, each core may be considered a hardware thread.
In view of the foregoing, at least one embodiment of the present disclosure provides a scheduling method and apparatus for multiple threads, in a thread scheduling process, BP information (such as BP error information) of each thread is collected, and according to the information, the probability that a thread with a high prediction error rate is scheduled from a target queue (such as FTQ) is reduced, for example, the probability may be gradually reduced from high probability to low probability, or may be directly reduced to a certain minimum, so that occupation of CPU resources by a subsequent instruction starting from an instruction with a prediction error and contention with resources of another thread may be reduced, thereby increasing utilization efficiency of resources by other threads, and further improving overall performance of the CPU.
At least one embodiment of the present disclosure provides a scheduling method for Simultaneous Multithreading (SMT) processors (herein simply multithreaded processors), including, for example, 2 or more threads (e.g., 4 threads or 8 threads, etc.), or for multi-core processors, each providing, for example, at least one thread; fig. 5A shows a flow chart of the scheduling method. Referring to fig. 5A, the scheduling method for multithreading includes the following steps 100 to 300:
step 100, obtaining the branch prediction history accuracy of each of a plurality of threads waiting to be scheduled in a queue;
step 200, determining a target thread for a plurality of threads in a queue based on branch prediction history accuracy of the plurality of threads;
step 300, selecting a target thread in the queue for subsequent processing.
At least one embodiment of the present disclosure also provides a processor, which may be a multi-threaded processor or a multi-core processor, configured to perform the scheduling method of the above embodiment; as described above, the method is applicable not only to a single-core processor but also to a multi-core processor, for example, in a scenario in which priority determination needs to be performed between different threads of the processor in the single-core processor, determining a scheduling priority of each thread according to a branch prediction history accuracy of each thread; in the scene that priority judgment needs to be carried out among threads of different processor cores in the multi-core processor, the scheduling priority of the cores is determined according to the branch prediction history accuracy of the threads of each core. The following embodiments are described with respect to internal thread scheduling for a (e.g., single-core) multithreaded processor.
Fig. 5B shows a schematic diagram of a multithreaded processor. Referring to fig. 5B, the multithreaded processor 10 includes one or more queues 101 and a multithreaded scheduler 102. The multithreaded scheduler 102 is configured to perform the scheduling method described above, and includes an acquisition module 1021, a determination module 1022, and a selection module 1023. The obtaining module 1021 is configured to obtain a branch prediction history accuracy of each of a plurality of threads waiting to be scheduled in a queue; the determination module 1022 determines a target thread for the plurality of threads in the queue based on the branch prediction history accuracy of the plurality of threads; the selection module 1023 is configured to select the target thread in the queue for subsequent processing.
The queues mentioned herein are, for example, queues at any node position of a Branch Prediction Unit (BPU), a branch target queue (FTQ), an instruction cache queue, a decode queue, an instruction dispatch queue, etc., and the embodiments of the present disclosure are not limited thereto, and may also be applied to queues shared by multiple threads in other nodes of a multithreaded processor and requiring multithreaded scheduling. Such as a dual-threaded, four-threaded, etc., as embodiments of the present disclosure are not limited in this regard.
In running a program on a multithreaded processor, multiple instructions having multiple threads that are scheduled are waiting simultaneously in the same queue, and one thread (i.e., the target thread) needs to be selected from the multiple threads for subsequent processing. For different nodes, the subsequent operations are different, for example, the queue is a decoding queue, then the instruction in the queue is sent to the decoding unit for decoding, for example, the queue is an instruction dispatch queue, then the instruction in the queue is sent to the execution unit for executing.
In the scheduling method of at least one embodiment of the present disclosure, before determining the target thread for the plurality of threads in the queue based on the branch prediction history accuracy of the plurality of threads, determining whether the plurality of threads satisfy the executable condition may be further included. In the process of this judgment, different processes are performed for different cases. For example, in a case where none of the plurality of threads satisfies the executable condition, the scheduling of the plurality of threads is suspended (stalled), or in a case where one and only one of the plurality of threads satisfies the executable condition, the thread satisfying the executable condition is determined as a target thread, and the scheduling of the plurality of threads is suspended, or in a case where more than one of the plurality of threads satisfies the executable condition, the scheduling of the plurality of threads is suspended, and in a case where the more than one thread satisfies the executable condition, the target thread is continuously determined for the plurality of threads in the queue with the branch prediction history accuracy based on the plurality of threads.
By excluding the thread which does not currently satisfy the execution condition from the process of determining the target thread before determining the target thread, it is possible to avoid suspension of the flow or need to reselect the schedule due to the fact that the last selected thread does not conform to the executable condition, thereby avoiding waste of scheduling resources and meaningless occupation of system computing power.
In a scheduling method of at least one embodiment of the present disclosure, determining a target thread for a plurality of threads in a queue with a branch prediction history accuracy based on the plurality of threads, comprising: and determining the thread with the highest branch prediction history accuracy rate from the plurality of threads as a target thread.
The target thread is directly determined by the accuracy of the branch prediction histories of the multiple alternative threads, so that the scheduling flow can be simplified, the system calculation force can be saved, and the system performance can be improved.
Alternatively, in a scheduling method of at least another embodiment of the present disclosure, determining a target thread for a plurality of threads in a queue at a branch prediction history accuracy rate based on the plurality of threads, includes: a target thread is determined for a plurality of threads in a queue in combination with branch prediction history accuracy of the plurality of threads and a base scheduling algorithm for the queue.
In addition to determining the target thread directly by determining the accuracy of the branch prediction histories of the multiple threads, in other embodiments, the target thread may be selected by combining the branch prediction histories (hereinafter collectively referred to as "historic accuracy" or "branch prediction accuracy") and the results respectively predicted by the base scheduling algorithm, so that the two scheduling methods may be mutually supplemented and coordinated, thereby improving the flexibility of the system.
For example, determining a target thread for a plurality of threads in a queue in combination with branch prediction history accuracy of the plurality of threads and a base scheduling algorithm for the queue, comprising:
determining a first candidate thread of the plurality of threads using the branch prediction history of the plurality of threads;
determining a second alternative thread of the plurality of threads using a base scheduling algorithm;
and selecting the first alternative thread as a target thread when the first alternative thread and the second alternative thread are the same, or selecting the first alternative thread as the target thread preferentially according to the set continuous operation times when the first alternative thread and the second alternative thread are different.
In the above embodiment, the high priority of the thread with high prediction accuracy in the continuous X (i.e. the target coefficient, or the continuous skip number) scheduling processes can be ensured by setting the continuous skip number, and in addition, the scheduling process is modified by the basic scheduling policy with a certain probability, so that the situation that the thread with high prediction accuracy is always selected and other threads are not always selected and other threads are substantially suspended is avoided.
For example, when the first candidate thread and the second candidate thread are different, the first candidate thread is preferentially selected as the target thread with the set number of continuous operations, including: obtaining a target coefficient, wherein the target coefficient is the number of times that the first alternative thread is currently selected as the target thread preferentially; the first candidate thread is preferentially selected as the target thread with the set continuous operation times based on the target coefficient.
Here, the target coefficient is an integer equal to or greater than zero.
For example, preferentially selecting the first candidate thread as the target thread with the set number of consecutive operations based on the target coefficient, further includes: judging whether the plurality of threads meet a first condition, wherein the first condition is that a difference value between branch prediction history accuracy rates of all the processed threads in the plurality of threads is smaller than a preset threshold value and a target coefficient is zero; in the case where the plurality of threads satisfy the first condition, the second candidate thread is selected as the target thread.
In the above embodiment, first, by determining whether the plurality of threads satisfy the difference value of the branch prediction history accuracy rates between them within a certain preset range, that is, whether the "jitter" of the data of the history accuracy rates of the plurality of threads is between a certain range, and the skip coefficient is equal to zero, this means that the "skip number" has not been set before this step is performed, or the set "skip number" has been performed as expected, and in the case that the above two conditions are satisfied at the same time, the thread determined by the base scheduling algorithm may be regarded as the target thread. Therefore, the problems of the prior art are avoided, the scheduling result based on the branch prediction history accuracy can be further balanced, and the flexibility of the system is improved. Moreover, the mode of using the target coefficient also saves the calculation force of the system and improves the dispatching efficiency because the judging process is relatively simple.
For example, preferentially selecting the first candidate thread as the target thread with the set number of consecutive operations based on the target coefficient, further includes: in the case where the plurality of threads do not satisfy the first condition and the target coefficient is zero, resetting the target coefficient based on a difference value of the branch prediction history accuracy rates between the threads being processed among the plurality of threads, wherein the difference value is a difference value between the branch prediction history accuracy rates acquired last time of the plurality of threads, respectively. The next round of scheduling strategies based on branch prediction history accuracy is started by resetting the target coefficients.
For example, a statistical history sample set is defined within the last 5 ten thousand CPU clocks, and the difference between the last acquired branch prediction history accuracy of the thread T0 and the last acquired branch prediction history accuracy of the thread T1 is selected as the difference value.
For example, the difference value is positively correlated with the coefficient value of the target coefficient.
In a scheduling method of at least one embodiment of the present disclosure, the base scheduling algorithm includes a Round-Robin algorithm, an Icount algorithm, a Bcount algorithm, or a misscunt algorithm.
In a scheduling method of at least one embodiment of the present disclosure, obtaining a branch prediction history accuracy of each of a plurality of threads waiting to be scheduled in a queue includes: and determining the branch prediction history accuracy of the plurality of threads based on the total number of branch predictions and the correct number or the incorrect number of branch predictions of each of the plurality of threads.
Fig. 6 is a schematic diagram of BP accuracy-based thread scheduling according to at least one embodiment of the present disclosure. In the embodiment shown in fig. 6, two-thread scheduling is illustrated as an example, and as described above, it is understood that other embodiments of the present disclosure may support more thread scheduling, such as 2, 4, 8, or 32, etc., which is not limiting in this disclosure. In this embodiment, a target thread is determined for a plurality of threads in a queue in combination with branch prediction history accuracy for the plurality of threads and a base scheduling algorithm for the queue.
First, at a certain node in the multithreaded processor, for a plurality of threads waiting to be scheduled in the current queue (here, threads T0 and T1), determining whether all of the threads do not satisfy the executable condition (step 501), and if so, suspending (stall) scheduling all of the threads, for example, ending the scheduling process (step 502); if not, then a determination is made as to whether any and only one of the plurality of threads satisfies an executable condition (step 506).
If there is and only one thread (thread Tx, x=0 or 1) satisfies the executable condition, then selecting the thread Tx as the target thread for subsequent processing (step 503); if not, the branch prediction monitoring module receives the prediction accuracy (i.e., branch prediction history accuracy) of the last multiple (e.g., 300) branch instructions in each thread collected statistically after the start of the self-scheduling process (step 507), then determines whether the multiple threads satisfy the difference between the accuracy (e.g., the absolute value of the difference between the two or the ratio between the two) is smaller than the preset threshold, and when it has been currently selected preferentially whether the number of times the target thread (hereinafter referred to as "target coefficient") is equal to zero (hereinafter referred to as "first condition") based on the branch prediction history accuracy of the multiple threads (step 508), if the result of the determination is true (i.e., yes), that is, when the first condition is satisfied, scheduling is performed by the Round-Robin algorithm (an example of the basic scheduling algorithm) (step 504), and acquires the scheduling result, and selects the target thread (thread Tx, x=0 or 1) for subsequent processing (step 505). Conversely, if the result of the determination is false (i.e., no), that is, when the first condition is not satisfied, continuing to determine whether the target coefficient X is zero for the plurality of threads (step 509), if the target coefficient X is 0 (x= 0), redefining the target coefficient based on the difference values (e.g., the difference value, the ratio, and other coefficients that can represent the difference value) between the plurality of threads (step 510), where, for example, the larger the difference value, the larger the redetermined target coefficient, for example, the difference value and the target coefficient are in positive correlation; if the target coefficient X is not 0, it is further determined whether the selected thread passing through the branch prediction history accuracy among the plurality of threads is adjusted to another thread according to the basic scheduling algorithm (steps 511 to 521).
For example, it is determined whether the branch prediction history accuracy of the thread T1 is greater than the branch prediction history accuracy of T0 (step 511), if so, it is then determined whether the thread selected by the Round-Robin algorithm is the thread T0 (step 512), which is equivalent to determining whether the thread T1 having the greater branch prediction history accuracy is the same thread as the thread selected by the Round-Robin algorithm, if not, it is then selected that the thread T1 is the target thread (step 515); if the thread T1 is not the same thread as the thread selected by the Round-Robin algorithm, then continuing to determine whether the current target coefficient is greater than zero (step 513), decrementing the target coefficient once if the current target coefficient is greater than zero (step 514), selecting the thread T1 as the target thread (step 515), otherwise, selecting another thread T0 as the target thread if the current target coefficient is not greater than zero (step 516), and then using the target thread for subsequent processing.
If it is determined that the branch prediction history accuracy of the thread T1 is not greater than the branch prediction history accuracy of T0, then determining whether the thread selected by the Round-Robin algorithm is the thread T1 (step 517), which is equivalent to determining whether the thread T0 having a greater branch prediction history accuracy is the same as the thread selected by the Round-Robin algorithm, and if not, selecting the thread T0 as the target thread (step 521); if the thread T0 is not the same thread as the thread selected by the Round-Robin algorithm, then continuing to determine whether the current target coefficient is greater than zero (step 518), decrementing the target coefficient once if the current target coefficient is greater than zero (step 519), i.e. subtracting one from the value of the target coefficient, selecting the thread T0 as the target thread (step 521), otherwise, selecting another thread T1 as the target thread if the current target coefficient is not greater than zero (step 520), and then using the target thread for subsequent processing.
In the above step 510, an exemplary method of determining the "number of skips" (i.e., the target coefficient X) according to the difference value of the branch history accuracy (hereinafter simply referred to as "history accuracy") between the plurality of threads includes: for the absolute value Diff of the maximum difference in accuracy between the multiple threads, for example, when 1.2> Diff > =1, x=1; when 1.3> diff > =1.2, x=2; when 1.4> diff > =1.3, x=3; when 1.5> diff > =1.4, x=4; when Diff > =1.5, x=5. For example, the difference value (Diff) may be determined by the maximum value among the absolute values of the plurality of differences between the plurality of historical accuracy rates between the plurality of threads in the above example, or may be determined by the maximum value among the plurality of ratios between the plurality of historical accuracy rates between the plurality of threads, or the difference value may be the difference between the branch prediction historical accuracy rates acquired by the plurality of threads, respectively, last time; the embodiments of the present disclosure are not limited in this regard.
As described above, after completing 1 "skip times" for a thread with high branch prediction accuracy (referred to as a first candidate thread), the target coefficient is decremented by one, and in the next round of scheduling process, the thread scheduling is performed according to the updated target thread participation determination; on the other hand, the thread with high current branch prediction history accuracy can be prioritized over the thread given by the basic scheduling algorithm by using the "skip times" mode in the disclosure, and the high priority of the thread with high prediction accuracy can be predicted in the process of ensuring continuous target number of thread scheduling in the scheduling process, without considering the result of the basic scheduling strategy.
Further, by using the "skip times" (i.e. setting the target coefficient), it is possible to avoid that the result of the base scheduling policy scheduling according to its internal rules is the same as the high priority thread, so that the thread scheduling based on prediction accuracy does not play a role or only plays a part, which may make the thread scheduling policy based on accuracy have no expected influence on the base policy.
For example, the preset threshold value of the difference value of the historical accuracy of the plurality of threads may be designed to be a fixed value or may be designed to be a configurable value.
For example, in a scheduling method for multithreading according to at least one embodiment of the present disclosure, preferentially selecting a first candidate thread as a target thread with a set number of consecutive operations based on a target coefficient, further comprising: in the case where the plurality of threads do not satisfy the first condition and the target coefficient is not zero, the target coefficient is kept unchanged.
In at least one embodiment, when the difference value between the historical accuracy rates of the threads is smaller than the preset threshold value, the target coefficient is kept unchanged as long as the target coefficient is not zero, and the scheduling process is carried with the target coefficient.
In the above embodiment, when the thread selected by the branch prediction history accuracy rate (i.e., the first candidate thread) and the thread determined by the base scheduling algorithm (i.e., the second candidate thread) are not the same thread, and the target coefficient is greater than zero, i.e., the number of skips is not zero, the thread selected by the branch prediction history accuracy rate may be given higher execution priority by the number of skips, so that the thread may be selected as the target thread for subsequent processing
In the above embodiment, when a thread selected by the branch prediction history accuracy rate (i.e., a first candidate thread) and a thread determined by the base scheduling algorithm (i.e., a second candidate thread) are the same thread, the thread is selected as the target thread, and the target coefficient is kept unchanged for the next processing of the flow.
For example, in a scheduling method for multithreading according to at least one embodiment of the present disclosure, obtaining a branch prediction history accuracy of each of a plurality of threads waiting to be scheduled in a queue includes: the branch prediction history accuracy of the plurality of threads is determined based on the total number of branch predictions and the number of branch prediction correctness or mistakes for each of the plurality of threads.
For example, the operation is started from the start of a processor (CPU) by a branch prediction monitor module, that is, the correct number or the incorrect number and the total number of predictions of the branch prediction history accuracy of each of a plurality of threads of a thread are collected and recorded, for example, the branch prediction accuracy=the correct number of predictions/the total number of predictions. The algorithm for obtaining the branch prediction history accuracy is not limited to the above method to obtain the history accuracy, but may also be used as a judgment basis by using the history branch prediction result within a certain period of time or other ranges, instead of using the execution result within a certain instruction number as the history accuracy, which is not limited in this embodiment.
Embodiments of the present disclosure may also determine in conjunction with more optional underlying scheduling algorithms and schedule in conjunction with the condition of using branch prediction history accuracy.
For example, in a scheduling method for multithreading in accordance with at least one embodiment of the present disclosure, the queues involved are used for branch prediction queues, branch target queues, decode queues, or instruction dispatch queues.
At least one embodiment of the present disclosure provides a scheduling apparatus for multithreading, referring to fig. 7, including an acquisition module 100, a determination module 200, and a selection module 300. The acquisition module 100 is configured to acquire a branch prediction history accuracy of each of a plurality of threads waiting to be scheduled in a queue; the determination module 200 is configured to determine a target thread for the plurality of threads in the queue based on the branch prediction history accuracy of the plurality of threads; the selection module 300 is configured to select a target thread in the queue for subsequent processing.
For example, the fetch module 100 is further configured to determine a branch prediction history accuracy for the plurality of threads based on the total number of branch predictions and the number of branch prediction correctness or mistakes for each of the plurality of threads.
For example, the scheduling apparatus may further include a judging module configured to judge whether the plurality of threads satisfy the executable condition before determining the target thread for the plurality of threads in the queue based on the branch prediction history accuracy of the plurality of threads, wherein the scheduling of the plurality of threads is suspended if none of the plurality of threads satisfies the executable condition, or the scheduling of the plurality of threads is determined to be the target thread and suspended if one and only of the plurality of threads satisfies the executable condition, or the determining of the target thread for the plurality of threads is continued for more than one thread if the branch prediction history accuracy of the plurality of threads satisfies the executable condition.
For example, for determining a target thread for a plurality of threads in a queue based on a branch prediction history accuracy rate for the plurality of threads, the determination module is further configured to determine a thread of the plurality of threads that has a highest branch prediction history accuracy rate as the target thread.
Alternatively, for example, for determining a target thread for a plurality of threads in a queue at a branch prediction history accuracy rate based on the plurality of threads, the determination module is further configured to determine the target thread for the plurality of threads in the queue in combination with the branch prediction history accuracy rate for the plurality of threads and a base scheduling algorithm for the queue.
For example, for determining a target thread for a plurality of threads in a queue in combination with a branch prediction history accuracy of the plurality of threads and a base scheduling algorithm for the queue, the determination module is further configured to: determining a first candidate thread of the plurality of threads using the branch prediction history of the plurality of threads; determining a second alternative thread of the plurality of threads using a base scheduling algorithm; and selecting the first alternative thread as a target thread when the first alternative thread and the second alternative thread are the same, or selecting the first alternative thread as the target thread preferentially according to the set continuous operation times when the first alternative thread and the second alternative thread are different.
For example, for a case where the first candidate thread and the second candidate thread are different, the first candidate thread is preferentially selected as the target thread with the set number of consecutive operations, the determination module is further configured to: obtaining a target coefficient, wherein the target coefficient is the number of times that the first alternative thread is currently selected as the target thread preferentially; the first candidate thread is preferentially selected as the target thread with the set continuous operation times based on the target coefficient.
For example, for preferentially selecting the first candidate thread as the target thread with the set number of consecutive operations based on the target coefficient, the determination module is further configured to: and resetting the target coefficient based on a difference value of the branch prediction history accuracy rate between the processed threads when the threads do not meet the first condition and the target coefficient is zero, wherein the difference value is a difference value of the branch prediction history accuracy rate acquired by the threads last time respectively.
For example, for preferentially selecting the first candidate thread as the target thread with the set number of consecutive operations based on the target coefficient, the determination module is further configured to: in the case where the plurality of threads do not satisfy the first condition and the target coefficient is not zero, the target coefficient is kept unchanged.
For example, for preferentially selecting the first candidate thread as the target thread with the set number of consecutive operations based on the target coefficient, the determination module is further configured to: and under the condition that the first alternative thread is different from the second alternative thread and the target coefficient is greater than zero, determining the first alternative thread as the target thread and decrementing the target coefficient once.
For example, in the case where the first alternative thread is the same as the second alternative thread, the target coefficient is kept unchanged.
For example, the base scheduling algorithm includes a Round-Robin algorithm, an Icount algorithm, a Bcount algorithm, a misscunt algorithm, or the like.
For example, for obtaining branch prediction history accuracy for each of a plurality of threads waiting to be scheduled in a queue, the determination module is further configured to: the branch prediction history accuracy of the plurality of threads is determined based on the total number of branch predictions and the number of branch prediction correctness or mistakes for each of the plurality of threads.
For example, in embodiments of the present disclosure the processor is a multithreaded processor or a multi-core processor, the plurality of threads are respectively spawned by a plurality of hardware threads of the multithreaded processor, or the plurality of threads are respectively spawned by a plurality of processor cores of the multi-core processor.
For example, in an embodiment of the present disclosure, the queues may include a branch prediction queue, a branch target queue, a decode queue, or an instruction dispatch queue, as the present disclosure is not limited in this regard.
The acquisition module, the determination module, the selection module, and the determination module may be implemented by firmware, hardware, software, or any combination thereof, for example, and embodiments of the present disclosure are not limited thereto.
At least one embodiment of the present disclosure further provides a multithreading scheduler, where the scheduler includes a processing unit and a memory, and the memory stores executable instructions that are executed by the processing unit to implement the data caching method as described above.
At least one embodiment of the present disclosure also provides a non-transitory readable storage medium having stored thereon computer instructions, wherein the computer instructions, when executed by a processor, implement a scheduling method as in any of the embodiments above.
For example, a processor in accordance with at least one embodiment of the present disclosure further includes a branch prediction monitoring module configured to obtain a total number of predictions and a correct number of predictions or a wrong number of predictions, respectively, for a plurality of threads, and to obtain a historical accuracy using the total number of predictions and the correct number of predictions or the wrong number of predictions for the plurality of threads. For example, the branch prediction monitoring module may be implemented in hardware, software, firmware, or any combination thereof.
For example, the processor of at least one embodiment of the present disclosure may further include, for example, a branch prediction unit, a finger fetch unit, a decode unit, various types of execution units (such as an arithmetic operation unit, a multiplication unit, an address generation unit, a logic operation unit, etc.), a register rename unit, and various caches and the like, as needed, which are not described in detail herein.
Embodiments of the present disclosure are not limited in the type of microarchitecture employed by the processor, e.g., CISC or RISC microarchitectures may be employed, e.g., X86 type microarchitectures, ARM type microarchitectures, RISC-V type microarchitectures, etc.
At least some embodiments of the present disclosure also provide an electronic device comprising a processor of any one of the embodiments described above.
Fig. 8 is a schematic block diagram of an electronic device provided in accordance with at least one embodiment of the present disclosure. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device 1000 shown in fig. 8 is merely an example and should not be construed as limiting the functionality and scope of use of the disclosed embodiments.
For example, as shown in fig. 8, in some examples, an electronic device 1000 includes a processing device (e.g., central processor, graphics processor, etc.) 1001, which may include a processor of any of the above embodiments, which may perform various suitable actions and processes in accordance with a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage device 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the computer system are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected thereto via a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
For example, the following components may be connected to an I/O interface 1005, which includes an input device 1006, such as a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 1007 including a Liquid Crystal Display (LCD), speaker, vibrator, etc.; storage 1008 including, for example, magnetic tape, hard disk, etc.; for example, communication means 1009 may also include a network interface card such as a LAN card, modem, etc. The communication device 1009 may allow the electronic device 1000 to perform wireless or wired communication with other apparatuses to exchange data, performing communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. Removable media 1011, such as a magnetic disk, optical disk, magneto-optical disk, semiconductor memory, or the like, is mounted on drive 1010 as needed so that a computer program read therefrom is mounted to storage 1008 as needed. While fig. 8 illustrates an electronic device 1000 that includes various devices, it is to be understood that not all illustrated devices are required to be implemented or included. More or fewer devices may be implemented or included instead.
For example, the electronic device 1000 may further include a peripheral interface (not shown), and the like. The peripheral interface may be various types of interfaces, such as a USB interface, a lightning (lighting) interface, etc. The communication means 1009 may communicate with a network, such as the internet, an intranet, and/or a wireless network, such as a cellular telephone network, a wireless Local Area Network (LAN), and/or a Metropolitan Area Network (MAN), and other devices via wireless communication. The wireless communication may use any of a variety of communication standards, protocols, and technologies including, but not limited to, global System for Mobile communications (GSM), enhanced Data GSM Environment (EDGE), wideband code division multiple Access (W-CDMA), code Division Multiple Access (CDMA), time Division Multiple Access (TDMA), bluetooth, wi-Fi (e.g., based on the IEEE 802.11a, IEEE 802.11b, IEEE 802.11g, and/or IEEE 802.11n standards), voice over Internet protocol (VoIP), wi-MAX, protocols for email, instant messaging, and/or Short Message Service (SMS), or any other suitable communication protocol.
For example, the electronic device 1000 may be any device such as a mobile phone, a tablet computer, a notebook computer, an electronic book, a game console, a television, a digital photo frame, a navigator, or any combination of a data processing device and hardware, which is not limited in the embodiments of the present disclosure.
While the disclosure has been described in detail with respect to the general description and the specific embodiments thereof, it will be apparent to those skilled in the art that certain modifications and improvements may be made thereto based on the embodiments of the disclosure. Accordingly, such modifications or improvements may be made without departing from the spirit of the disclosure and are intended to be within the scope of the disclosure as claimed.
For the purposes of this disclosure, the following points are also noted:
(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.
(2) In the drawings for describing embodiments of the present disclosure, the thickness of layers or regions is exaggerated or reduced for clarity, i.e., the drawings are not drawn to actual scale.
(3) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.
The foregoing is merely specific embodiments of the disclosure, but the scope of the disclosure is not limited thereto, and the scope of the disclosure should be determined by the claims.

Claims (20)

1. A scheduling method for multithreading, comprising:
acquiring the branch prediction history accuracy of each of a plurality of threads waiting to be scheduled in a queue;
determining a target thread for the plurality of threads in the queue based on branch prediction history accuracy of the plurality of threads;
the target thread is selected in the queue for subsequent processing.
2. The method of claim 1, further comprising, prior to the determining a target thread for the plurality of threads in the queue based on branch prediction history accuracy of the plurality of threads:
determining whether the plurality of threads satisfy an executable condition, wherein,
in the event that none of the plurality of threads satisfies the executable condition, suspending scheduling of the plurality of threads, or,
in the case where only one thread of the plurality of threads satisfies the executable condition, determining the thread as the target thread and suspending scheduling the plurality of threads, or,
In the event that more than one of the plurality of threads satisfies the executable condition, then continuing to determine a target thread for the plurality of threads in the queue at a branch prediction history accuracy based on the plurality of threads for the more than one thread.
3. The method of claim 2, wherein the determining a target thread for the plurality of threads in the queue at a branch prediction history accuracy based on the plurality of threads comprises:
and determining the thread with the highest branch prediction history accuracy rate in the plurality of threads as the target thread.
4. The method of claim 2, wherein the determining a target thread for the plurality of threads in the queue at a branch prediction history accuracy based on the plurality of threads comprises:
the target thread is determined for the plurality of threads in the queue in combination with a branch prediction history accuracy of the plurality of threads and a base scheduling algorithm for the queue.
5. The method of claim 4, wherein the combining the branch prediction history accuracy of the plurality of threads and the base scheduling algorithm for the queue determines the target thread for the plurality of threads in the queue comprises:
Determining a first candidate thread of the plurality of threads using a branch prediction history of the plurality of threads;
determining a second alternative thread of the plurality of threads using the base scheduling algorithm;
and selecting the first alternative thread as the target thread when the first alternative thread and the second alternative thread are the same, or selecting the first alternative thread as the target thread preferentially according to the set continuous operation times when the first alternative thread and the second alternative thread are different.
6. The method of claim 5, wherein the preferentially selecting the first candidate thread as the target thread for the set number of consecutive operations if the first candidate thread and the second candidate thread are different, comprises:
obtaining a target coefficient, wherein the target coefficient is the number of times that the first alternative thread is the target thread and is selected preferentially currently;
the first candidate thread is preferentially selected as the target thread with a set number of continuous operations based on the target coefficient.
7. The method of claim 6, wherein the preferentially selecting the first candidate thread as the target thread for a set number of consecutive operations based on the target coefficient further comprises:
Judging whether the plurality of threads meet a first condition, wherein the first condition is that a difference value between branch prediction history accuracy rates of all the processed threads in the plurality of threads is smaller than a preset threshold value and the target coefficient is zero;
the second alternative thread is selected as the target thread if the plurality of threads satisfy the first condition.
8. The method of claim 7, wherein the preferentially selecting the first candidate thread as the target thread for a set number of consecutive operations based on the target coefficient further comprises:
and resetting the target coefficient based on a difference value of branch prediction history accuracy rates among the processed threads in the plurality of threads when the plurality of threads do not meet the first condition and the target coefficient is zero, wherein the difference value is a difference value among branch prediction history accuracy rates acquired by the plurality of threads respectively last time.
9. The method of claim 8, wherein,
the difference value is positively correlated with the coefficient value of the target coefficient.
10. The method of claim 7, wherein the preferentially selecting the first candidate thread as the target thread for a set number of consecutive operations based on the target coefficient further comprises:
In the case where the plurality of threads do not satisfy the first condition and the target coefficient is not zero, the target coefficient is kept unchanged.
11. The method of claim 8 or 10, wherein the preferentially selecting the first candidate thread as the target thread for a set number of consecutive operations based on the target coefficient, further comprises:
and determining the first alternative thread as the target thread and decrementing the target coefficient once under the condition that the first alternative thread is different from the second alternative thread and the target coefficient is larger than zero.
12. The method of claim 6, wherein the target coefficient remains unchanged if the first alternative thread is the same as the second alternative thread.
13. The method of claim 4, wherein the base scheduling algorithm comprises: the Round-Robin algorithm, the Icount algorithm, the Bcount algorithm, or the MissCount algorithm.
14. The method of claim 1, wherein the obtaining branch prediction history accuracy for each of the plurality of threads waiting to be scheduled in the queue comprises:
and determining the branch prediction history accuracy of the plurality of threads based on the total number of branch predictions and the correct number or the incorrect number of branch predictions of each of the plurality of threads.
15. The method of any of claims 1-10, 12-14, wherein the queue comprises: a branch prediction queue, a branch target queue, a decode queue, or an instruction dispatch queue.
16. The method of any of claims 1-10, 12-14, wherein the plurality of threads are respectively spawned by a plurality of hardware threads of a multi-threaded processor or the plurality of threads are respectively spawned by a plurality of processor cores of a multi-core processor.
17. A scheduling apparatus for multithreading, comprising:
an acquisition module configured to acquire branch prediction history accuracy of each of a plurality of threads waiting to be scheduled in a queue;
a determining module configured to determine a target thread for the plurality of threads in the queue based on branch prediction history accuracy of the plurality of threads; and
a selection module configured to select the target thread in the queue for subsequent processing.
18. A processor, comprising:
the queue is provided with a plurality of data channels,
a multithreaded scheduling apparatus comprising:
an acquisition module configured to acquire branch prediction history accuracy of each of a plurality of threads waiting to be scheduled in the queue;
a determining module that determines a target thread for the plurality of threads in the queue based on branch prediction history accuracy rates of the plurality of threads; and
And the selection module is used for selecting the target thread in the queue for subsequent processing.
19. The processor of claim 18, further comprising:
and the branch prediction monitoring module is configured to acquire the total number of branch predictions and the correct number of branch predictions or the incorrect number of branch predictions, which correspond to the plurality of threads, respectively, and acquire the historical accuracy rate by utilizing the total number of branch predictions and the correct number of branch predictions or the incorrect number of branch predictions of the plurality of threads.
20. The processor of claim 18 or 19, wherein the processor is a multi-threaded processor or a multi-core processor,
the plurality of threads are respectively generated by a plurality of hardware threads of the multi-threaded processor, or the plurality of threads are respectively generated by a plurality of processor cores of the multi-core processor.
CN202311034294.5A 2023-08-15 2023-08-15 Scheduling method and scheduling device for multithreading and processor Pending CN117055961A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311034294.5A CN117055961A (en) 2023-08-15 2023-08-15 Scheduling method and scheduling device for multithreading and processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311034294.5A CN117055961A (en) 2023-08-15 2023-08-15 Scheduling method and scheduling device for multithreading and processor

Publications (1)

Publication Number Publication Date
CN117055961A true CN117055961A (en) 2023-11-14

Family

ID=88654885

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311034294.5A Pending CN117055961A (en) 2023-08-15 2023-08-15 Scheduling method and scheduling device for multithreading and processor

Country Status (1)

Country Link
CN (1) CN117055961A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105094750A (en) * 2014-04-25 2015-11-25 华为技术有限公司 Method and apparatus for predicting return address of multi-thread processor
WO2021218633A1 (en) * 2020-04-28 2021-11-04 支付宝(杭州)信息技术有限公司 Cpu instruction processing method, controller, and central processing unit
CN114518900A (en) * 2020-11-20 2022-05-20 上海华为技术有限公司 Instruction processing method applied to multi-core processor and multi-core processor
CN115686639A (en) * 2022-10-21 2023-02-03 中国科学院计算技术研究所 Branch prediction method applied to processor and branch predictor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105094750A (en) * 2014-04-25 2015-11-25 华为技术有限公司 Method and apparatus for predicting return address of multi-thread processor
WO2021218633A1 (en) * 2020-04-28 2021-11-04 支付宝(杭州)信息技术有限公司 Cpu instruction processing method, controller, and central processing unit
CN114518900A (en) * 2020-11-20 2022-05-20 上海华为技术有限公司 Instruction processing method applied to multi-core processor and multi-core processor
CN115686639A (en) * 2022-10-21 2023-02-03 中国科学院计算技术研究所 Branch prediction method applied to processor and branch predictor

Similar Documents

Publication Publication Date Title
US9645819B2 (en) Method and apparatus for reducing area and complexity of instruction wakeup logic in a multi-strand out-of-order processor
US7213137B2 (en) Allocation of processor bandwidth between main program and interrupt service instruction based on interrupt priority and retiring micro-ops to cache
US20140208074A1 (en) Instruction scheduling for a multi-strand out-of-order processor
US9858115B2 (en) Task scheduling method for dispatching tasks based on computing power of different processor cores in heterogeneous multi-core processor system and related non-transitory computer readable medium
US20150242254A1 (en) Method and apparatus for processing message between processors
WO2011155097A1 (en) Instruction issue and control device and method
WO2013171362A1 (en) Method in a processor, an apparatus and a computer program product
US20100082867A1 (en) Multi-thread processor and its interrupt processing method
EP2573673B1 (en) Multithreaded processor and instruction fetch control method of multithreaded processor
GB2492457A (en) Predicting out of order instruction level parallelism of threads in a multi-threaded processor
JPWO2008155834A1 (en) Processing equipment
US20150205614A1 (en) Method in a processor, an apparatus and a computer program product
US10271326B2 (en) Scheduling function calls
KR20150067316A (en) Memory based semaphores
CN114168202B (en) Instruction scheduling method, instruction scheduling device, processor and storage medium
EP4386554A1 (en) Instruction distribution method and device for multithreaded processor, and storage medium
CN116048627B (en) Instruction buffering method, apparatus, processor, electronic device and readable storage medium
CN116795503A (en) Task scheduling method, task scheduling device, graphic processor and electronic equipment
CN117055961A (en) Scheduling method and scheduling device for multithreading and processor
CN118132233A (en) Thread scheduling method and device, processor and computer readable storage medium
US9201688B2 (en) Configuration of asynchronous message processing in dataflow networks
US20140201505A1 (en) Prediction-based thread selection in a multithreading processor
US20040128476A1 (en) Scheme to simplify instruction buffer logic supporting multiple strands
CN118152132A (en) Instruction processing method and device, processor and computer readable storage medium
CN118245186A (en) Cache management method, cache management device, processor and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination