CN112612585A

CN112612585A - Thread scheduling method, configuration method, microprocessor, device and storage medium

Info

Publication number: CN112612585A
Application number: CN202011492666.5A
Authority: CN
Inventors: 胡世文; 薛大庆
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2021-04-06
Anticipated expiration: 2040-12-16
Also published as: CN112612585B

Abstract

The application provides a thread scheduling method, a configuration method, a microprocessor, a device and a storage medium, which are applied to a microprocessor supporting a pipeline technology and concurrent multithreading; the thread scheduling method comprises the following steps: when a thread scheduling node arrives, a target thread is obtained based on a thread scheduling algorithm pre-configured by the scheduling node; wherein the thread scheduling algorithm is associated with a type of the thread scheduling node; and scheduling the target thread. The method and the device for optimizing the mixed-mode SMT operation efficiency can achieve optimization of the operation efficiency in the mixed-mode SMT, and make up for the deficiency of the thread scheduling strategy optimized in the current mixed-mode SMT.

Description

Thread scheduling method, configuration method, microprocessor, device and storage medium

Technical Field

The embodiments of the present disclosure relate to the field of integrated circuit technologies, and in particular, to a thread scheduling method, a configuration method, a microprocessor, a device, and a storage medium.

Background

Concurrent MultiThreading (SMT) is an important technology for improving the overall performance of a Central Processing Unit (CPU). It uses the mechanism of multi-emission, out-of-order execution, etc. of high-performance physical CPU core to execute the instruction of several threads at the same time. One physical CPU core is presented to the software, and the operating system may be multiple virtual CPU cores.

Compared with a single thread, the SMT technology can improve the resource utilization efficiency of a high-performance CPU. For example, when a modern multi-issue high-performance CPU core executes a single thread, a plurality of hardware execution units and hardware resources (such as memory resources, registers, etc.) inside the CPU core cannot be fully utilized most of the time. For example, when a thread is stalled due to some reason, such as a Miss (Miss) of the level two Cache (L2 Cache), the hardware execution unit can only idle, which wastes hardware resources and reduces the performance-power ratio. However, in the SMT mode, when one thread is running and stalled, other threads can still run, which improves the utilization of hardware resources, thereby improving the multithreading throughput, overall performance, and performance-to-power ratio of the CPU core.

Modern CPU core operations are in Pipeline (Pipeline) form, and the Pipeline generally has multiple Pipeline stages, such as Branch prediction stage (Branch prediction), Instruction Fetch stage (Fetch Instruction), Instruction Decode stage (Instruction Decode), Instruction Dispatch stage (Instruction Dispatch), Instruction execution stage (Instruction Execute), and Instruction retirement stage (Instruction retirement). In these pipeline stages, it may be necessary to select one of the threads to pass its instructions to the next pipeline stage, referred to as "thread scheduling". The selection of thread scheduling has an important influence on the overall SMT performance, power consumption and fairness among threads.

There are different ways to allocate SMT internal hardware resources. Common methods are for example: all static partitioning (All static Partitioned), that is, All hardware resources are equally divided according to the number of active threads supported by SMT; full dynamic sharing, that is, all hardware resources are dynamically shared by all threads; mixed mode, i.e. part of the hardware resources are shared dynamically by all threads, while another part of the resources are partitioned statically, etc.

The most common thread scheduling algorithm for SMT is Round Robin (Round Robin) algorithm, i.e. at a certain thread scheduling node, a different thread is selected per clock. This approach can be used in a variety of resource configuration modes. However, because the execution conditions of multiple threads and the resource use conditions thereof are not considered, the method is often not an optimal method, and the system performance cannot be comprehensively improved.

For mixed-mode SMT, no targeted and optimized thread scheduling algorithm is proposed at present. The round-robin scheduling algorithm is available, but is not optimal, and the larger the number of active threads is, the worse the performance and fairness are. Due to the existence of the static partition resources, the internal pipeline stage of the mixed mode SMT is partitioned into a plurality of relatively independent blocks, so that the scheduling algorithm aiming at the full dynamic sharing mode is not suitable for the mixed mode SMT.

Therefore, how to find an optimized thread scheduling algorithm for the mixed-mode SMT to fill the deficiency becomes an urgent technical problem in the industry.

Disclosure of Invention

In view of the above, embodiments of the present application provide a thread scheduling method, a configuration method, a microprocessor, an apparatus, and a storage medium, which solve the problems in the prior art.

The embodiment of the application provides a thread scheduling method, which is applied to a microprocessor supporting a pipeline technology and concurrent multithreading and is used for thread scheduling at each thread scheduling node in a pipeline, wherein the pipeline comprises a plurality of pipeline stages, and the thread scheduling node is a time point before the pipeline stage; the microprocessor is configured with cache groups used for data transmission among the pipeline stages, and each cache group comprises a cache which is exclusively shared by each thread; the thread scheduling method comprises the following steps: when a thread scheduling node arrives, a target thread is obtained based on a thread scheduling algorithm pre-configured by the scheduling node; wherein the thread scheduling algorithm is associated with a type of the thread scheduling node; scheduling the target thread; the types of the thread scheduling nodes comprise a first type and a second type, and the first type of thread scheduling nodes are defined as: the output target of the next flow stage does not contain the thread scheduling node of the hardware resource shared by multiple threads; the second type thread scheduling node is defined as: the output target of the next flow stage comprises a thread scheduling node of the hardware resource shared by multiple threads; the thread scheduling algorithm of the first type thread scheduling node before the second type thread scheduling node comprises the following steps: acquiring a thread corresponding to the most empty cache in a cache group for output in a flow stage after the first type thread scheduling node in the flow line, and taking the thread as a target thread; the thread scheduling algorithm of the first type thread scheduling node after the second type thread scheduling node comprises the following steps: and acquiring a thread corresponding to the most full cache in the cache group for output in the pipeline in the previous flow stage of the first type thread scheduling node as a target thread.

Optionally, each of the pipeline stages includes an instruction distribution stage; the second type thread scheduling node exists before entering an instruction dispatching stage.

Optionally, the hardware resources shared by multiple threads include: and the instruction queue is shared by a plurality of threads to access or fetch the instructions.

Optionally, the instruction queue is configured to store instructions of each thread after the instruction dispatch stage for execution in the instruction execution stage.

Optionally, the obtaining method of the thread corresponding to the most empty cache in the cache group for output in a downstream stage after the first type of thread scheduling node includes:

traversing each thread, and judging whether a first preset condition is met; the first preset condition includes: the corresponding cache of the thread in the cache group for outputting in the first flow stage before the first type thread scheduling node is not completely empty, and the corresponding cache of the thread in the cache group for outputting in the first flow stage after the first type thread scheduling node is the most empty cache of the traversed caches corresponding to the threads;

and obtaining the target thread which meets the first preset condition and is obtained after traversing each thread.

Optionally, the obtaining method of the thread corresponding to the most full cache in the cache group for output in a downstream stage after the first type of thread scheduling node includes:

Traversing each thread, and judging whether a second preset condition is met; the second preset condition includes: the cache corresponding to the thread in the cache group used for outputting in the previous flow stage of the first type thread scheduling node is the most full of the traversed caches corresponding to the threads;

and obtaining the target thread which is obtained after traversing each thread and accords with the second preset condition.

Optionally, the second preset condition further includes: and the corresponding cache in the cache group used for outputting in a flow stage after the thread is scheduled by the first type thread scheduling node is not full.

The embodiment of the application provides a thread scheduling configuration method, which is applied to the design of a microprocessor supporting a pipeline technology and is used for configuring a thread scheduling algorithm used by each thread scheduling node in a pipeline, wherein the pipeline comprises a plurality of pipeline stages, and the thread scheduling node is a time point before the pipeline stage; the microprocessor is configured with cache groups used for data transmission among the pipeline stages, and each cache group comprises a cache which is exclusively shared by each thread; the thread scheduling configuration method comprises the following steps:

determining the type of each thread scheduling node in the assembly line; the types of the thread scheduling nodes comprise a first type and a second type, and the first type of thread scheduling nodes are defined as: the output target of the next flow stage does not contain the thread scheduling node of the hardware resource shared by multiple threads; the second type thread scheduling node is defined as: the output target of the next flow stage comprises a thread scheduling node of the hardware resource shared by multiple threads;

Configuring a thread scheduling algorithm for each thread scheduling node according to the type; the thread scheduling algorithm of the first type thread scheduling node before the second type thread scheduling node comprises the following steps: acquiring a thread corresponding to the most empty cache in a cache group for output in a flow stage after the first type thread scheduling node in the flow line, and taking the thread as a target thread to be scheduled; the thread scheduling algorithm of the first type thread scheduling node after the second type thread scheduling node comprises the following steps: and acquiring a thread corresponding to the most full cache in the cache group for output by the first type thread scheduling node in the pipeline at a previous flow stage as a target thread to be scheduled.

The embodiment of the application provides a microprocessor, which supports a pipeline technology, wherein a pipeline comprises a plurality of pipeline stages; the microprocessor is coupled with or comprises a memory, the memory comprises cache groups used for data transmission among the pipeline stages, and each cache group comprises a cache which is exclusively shared by each thread; the microprocessor runs executable program code to perform the thread scheduling method to schedule threads at each of the thread scheduling nodes.

An embodiment of the present application provides a processing chip, including: at least one said microprocessor.

An embodiment of the present application provides a computer apparatus, including: a memory and a processor; the memory stores executable program codes, and the processor executes the thread scheduling configuration method when running the executable program codes.

The embodiment of the application provides a computer readable storage medium, which stores executable program codes, and the executable program codes are used for executing the thread scheduling method or the thread scheduling configuration method when being executed.

Compared with the prior art, the technical scheme of the embodiment of the application has the following beneficial effects:

the application relates to thread scheduling of a microprocessor supporting a pipeline technology and concurrent multithreading, in particular to a method for optimizing the running efficiency in a mixed-mode SMT (surface Mount technology) by determining the type of a thread scheduling node and providing a corresponding thread scheduling strategy for a first type of thread scheduling node in the mixed-mode SMT, and making up for the deficiency of the optimized thread scheduling strategy in the current mixed-mode SMT.

Drawings

Fig. 1 shows a schematic structure of a pipeline in the embodiment of the present application.

Fig. 2 is a flowchart illustrating a thread scheduling method according to an embodiment of the present application.

Fig. 3 is a flowchart illustrating a target thread determining method of a first-type thread scheduling node before a second-type thread scheduling node in an embodiment of the present application.

Fig. 4 is a flowchart illustrating a manner of obtaining a target thread of a first-type thread scheduling node before a second-type thread scheduling node in a more specific embodiment of the present application.

Fig. 5 is a flowchart illustrating a target thread determining method for a first-type thread scheduling node after a second-type thread scheduling node in an embodiment of the present application.

Fig. 6 is a flowchart illustrating a manner of obtaining a target thread of a first-type thread scheduling node after a second-type thread scheduling node in a more specific embodiment of the present application.

Fig. 7 is a flowchart illustrating a thread scheduling configuration method according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of a computer device in an embodiment of the present application.

Detailed Description

Modern processors (CPU, SoC, etc.) employ pipelined operation. Specifically, the pipeline technology is a technology that decomposes an instruction into multiple steps and overlaps operations of different instructions, thereby realizing parallel processing of several instructions to accelerate the program running process.

The multiple steps into which the instruction is broken are the various pipeline stages in the pipeline.

Referring to fig. 1, a schematic structural diagram of a pipeline in an embodiment of the present application is shown.

In the illustrated example of the structure of the pipeline, each pipeline stage has: branch prediction stage 101(Branch prediction), Instruction Fetch stage 102(Fetch Instruction), Instruction Decode stage 103(Instruction Decode), Instruction Dispatch stage 104(Instruction Dispatch), Instruction execution stage 110(Instruction Execute), and Instruction retirement stage 111(Instruction retirement). It should be noted that the structure of the pipeline in fig. 1 is only an example, in other scenarios, various stages may be added or deleted according to requirements, and the order may also be changed, and is not limited to this example.

The functions of each of the following stages of flow are described one by one.

Regarding the branch prediction stage 101, since a processor including pipelining may generate branches in a pipeline according to the true/false of the determination condition, since the next instruction B needs to be executed after the true/false of the condition is determined according to the processing result of the previous instruction a, B needs to wait until a finishes executing. Thus, the longer the pipeline, the longer the B wait time, which results in processor idleness and inefficiency; by branch prediction, the most likely branch of instruction B will be predicted (i.e., the prediction condition is true or false) during the pipelined processing of instruction A, so that B is processed according to the predicted branch without waiting for A to finish executing.

More specifically, the branch prediction may be performed by predicting the most likely branch to be selected at this time based on historical branch selection conditions, and processing the branch accordingly.

With respect to the instruction fetch stage 102, for example, the processor fetches an instruction from memory or cache into an instruction register; with respect to the instruction decode stage 103, an instruction consists of an opcode, which is decoded to analyze the nature and function of the operation to be performed by the instruction, i.e. translated into control signals, and an address code, which is used to indicate the address of the data (abbreviated as operands) being operated on; with respect to instruction dispatch stage 104, i.e., assigning an In-Order (In-Order) or Out-of-Order (Out-of-Order) execution Order to decoded instructions, instructions to be executed may be placed In instruction Queue 109(Queue) for execution In instruction execution stage 110; instruction retirement 106 refers to the flushing of relevant data from hardware resources, such as caches, registers, etc., for subsequent instructions after the instructions have completed execution.

The cache for data transmission is configured between the pipeline stages, that is, for example, one instruction is processed according to a pipeline, the output data of the previous pipeline stage is stored in the cache, and the output data is taken out from the cache as the data of the next pipeline stage for continuous processing in the next pipeline stage. The data generally refers to an instruction input at each pipeline stage or result data obtained by processing according to the instruction.

For processors employing concurrent multithreading, which may enable multiple threads to pipeline respective data in parallel, cache sets 105, 106, 107, 108 may be configured between pipeline stages as shown in FIG. 1, including caches that are individually shared by each thread, as shown by way of example where cache set 105 includes

caches

1051, 1052, 1053, 1054. In a possible example, each of the buffer types may be First In First Out (First In First Out) to output data earlier In the previous pipeline stage to the next pipeline stage earlier. Further, in some possible examples, the relationship between the Cache sets 105-108 and the caches therein may be different regions in the same Cache, such as different regions in a Cache (Cache); or may be different buffers such as different registers, different register sets, different caches (caches), etc.

The current SMT modes include SMT2, SMT4, SMT8, etc., where 2, 4, and 8 refer to the ratio of threads to processor cores, such as 2 cores and 4 threads, which is the number of processor cores and the number of threads 1: 2, namely SMT 2; for example, 16-core 64 threads, which are the number of processor cores and the number of threads 1: and 4, namely SMT 4. 4 threads of a CPU in SMT4 mode are illustratively shown in FIG. 1, although variations may be made in other examples.

In fig. 1, several dashed arrows represent a redirection of the instruction stream, as indicated by the dashed arrows connecting out from the various pipeline stages, i.e. for example: dashed arrow (1) a branch prediction stage fails to find a new branch instruction, found during the instruction decode stage; dashed arrow (2) branch errors predicted by the branch prediction stage are found at the instruction execution stage; alternatively, the dotted arrow (3) Load (Load) instruction uses error data or the like.

In such a case, it is necessary to flush the pipeline, either to re-execute the correct instruction branch, i.e. dashed line 1), dashed line 2), or to reload the instruction following the instruction using the wrong data, i.e. dashed line 3).

Furthermore, the dashed arrow (4) represents a predicted branch jump. In the case if there is no branch jump as with dashed arrow (4) or other redirection of the instruction stream with dashed arrows (1) - (3), the default data is fetched sequentially.

The branch prediction stage 101 is preceded by a set of selectors 110 corresponding to the number of threads, for example 4 threads corresponding to 4 selectors, each selector being arranged to select among the inputs an instruction for output, including accessing the output data in the dotted lines described above. The figure only shows exemplarily that output data of a dotted arrow may be obtained on the selector of one thread, and actually, output data flows indicated by such dotted arrows may also be obtained on selectors of other 3 threads, but all of them are not depicted in fig. 1 for simplicity of illustration.

The time point when thread scheduling is performed before each pipeline stage is started, that is, a thread entering the next stage is selected from the 4 threads, so as to fetch the instruction in the cache corresponding to the thread for the processing of the next stage, may be referred to as a "thread scheduling node", that is, as shown in thread scheduling nodes a to E in the figure. In a possible example, each of the thread scheduling nodes a to E may be a time point of arrival of a clock signal of each pipeline stage, for example, a microprocessor operating according to a periodic clock signal, and each pipeline stage being respectively allocated, for example, one or several clock cycles to complete one operation in the respectively allocated one or several clock cycles.

In the example of FIG. 1, each thread has an exclusive statically partitioned hardware resource, i.e., the caches in cache groups 105-108 described above, and the instruction queue 109 may be shared by multiple threads, i.e., dynamically shared hardware resources, to implement mixed-mode SMT. In some examples, the instruction queue 109 may be of a type that selects the Oldest Ready instruction to execute (Oldest Ready First Out).

For mixed-mode SMT such as that of FIG. 1, the present application can provide a corresponding optimized thread scheduling method to optimize its operating efficiency. It should be noted that, although it may be determined that the second-type thread scheduling node illustrated in fig. 1 has only the thread scheduling node D before the instruction dispatch stage starts and the remaining thread scheduling nodes A, B, C, E before the pipelining stage starts are the first-type thread scheduling nodes according to the previous definitions of the first-type and second-type thread scheduling nodes, this is only set based on the arrangement structure of the pipelining stages in the example of fig. 1, and in other examples, it is not excluded that the output targets of multiple pipelining stages include hardware resources shared by multiple threads, so that there may exist multiple second-type thread scheduling nodes. Furthermore, the hardware resources shared by multiple threads are not limited to the instruction queue in the example of fig. 1, and may be other types of storage resources.

Fig. 2 is a schematic flow chart showing a thread scheduling method in the embodiment of the present application.

The thread scheduling method is applied to a processor supporting a pipeline technology and a concurrent multithreading technology, such as a microprocessor; in a possible implementation example, the microprocessor described below may be implemented as a processor core in a multi-core processor (CPU or SoC).

The thread scheduling method comprises the following steps:

step S201: when the thread scheduling node arrives, a target thread is obtained based on a thread scheduling algorithm pre-configured by the scheduling node.

Wherein the thread scheduling algorithm is associated with a type of the thread scheduling node; the types of the thread scheduling nodes comprise a first type and a second type, and the first type of thread scheduling nodes are defined as: the output target of the next flow stage does not contain the thread scheduling node of the hardware resource shared by multiple threads; the second type thread scheduling node is defined as: the output target of the latter pipeline stage comprises a thread scheduling node of the hardware resource shared by multiple threads.

Step S202: and scheduling the target thread.

FIG. 1 illustrates the first and second types of thread scheduling nodes, wherein only the output target after the instruction dispatch stage 104 has an instruction queue 109 in addition to the cache set 108, and the instruction queue 109 is a hardware resource shared by multiple threads, so that the thread scheduling node D before the instruction dispatch stage 102 starts is the second type of thread scheduling node, and the output targets of other thread scheduling nodes such as the pipeline stage after A, B, C, E have cache sets or no cache sets, such as the cache set 105 of the output target after A branch prediction stage, the cache set 106 of the output target after B instruction fetch stage, the cache set 107 of the output target after C instruction decode stage, and the instruction retirement stage 111 after E has no output target and therefore does not contain the hardware resource shared by multiple threads, so A, B, C. E are all first type thread scheduling nodes.

In a scenario where the second type thread scheduling node exists, data of each thread to enter a subsequent pipeline stage before the second type thread scheduling node is stored in hardware resources which are shared by each thread (i.e., statically partitioned), however, an output target of the subsequent pipeline stage after the second type thread scheduling node includes hardware resources which are shared by multiple threads (i.e., dynamically shared), and therefore, selection of a target thread of the first type thread scheduling node before and after the second type thread scheduling node may be different, and the principle thereof will be described in detail below.

It should be noted that, in the scenario outside fig. 1, situations that there is no first-type thread scheduling node before the second-type thread scheduling node, or there is no first-type thread scheduling node after the second-type thread scheduling node, or there is no second-type thread scheduling node, etc. are not excluded, so that different combinations of the two thread scheduling manners at the first-type thread scheduling node in step S202 may be used according to different situations, as indicated by "and/or" therein.

In order to facilitate more intuitive description, the second type thread scheduling node is defined as a type 2 scheduling node, the first type thread scheduling node located before the second type thread scheduling node is defined as a type 1.1 scheduling node, and the first type thread scheduling node located after the second type thread scheduling node is defined as a type 1.2 scheduling node.

In some examples, when performing thread scheduling, the scheduling node based on type 2 may involve the problem of allocating hardware resources, which are dynamically shared by the threads, among the threads, so that an overall consideration of the influence of the threads on the performance of the microprocessor may be performed. In a possible example, some existing scheduling algorithm for concurrent multithreading under the dynamically shared hardware resource may be used, or a more optimized thread scheduling algorithm may be used, which is not limited herein, and only needs to be able to determine the target thread to be scheduled at the type 2 scheduling node.

However, even if the type 2 scheduling node obtains the target thread X, but assuming that the cache corresponding to X in a cache group before the type 2 scheduling node is completely empty, no data enters the pipeline stage after the type 2 scheduling node even though the target thread X is scheduled, for example, in fig. 1, the target thread to be scheduled is predicted to be thread 1 at the second type thread scheduling node D, and the cache 1 in the cache group before the thread scheduling node D is a hardware resource dedicated to thread 1, if the thread scheduling node D actually schedules thread 1, the cache 1 is completely empty (the entry occupied by data is 0), that is, no data can enter the instruction distribution stage, the microprocessor also idles (that is, waits for data in the cache 1 to be processed) to affect performance; similarly, if the type 2 scheduling node expects the cache behind thread X to be scheduled to be full, the performance of the microprocessor will be affected by the unacceptable results of the instruction of the X thread being processed and output by the pipeline stage behind the type 2 scheduling node. For example, in fig. 1, if the second type thread scheduling node D expects that the target thread to be scheduled is thread 1, and the cache 2 in the cache set behind the thread scheduling node D is a hardware resource dedicated to thread 1, if the thread scheduling node D actually schedules thread 1, the data of thread 1 enters the instruction distribution stage for processing and then is output to the cache 2, but the cache 2 is full and cannot be received at the time, the microprocessor may also idle (i.e., wait for space available in the cache 2) and affect performance.

Therefore, at each type 1.1 scheduling node located before the type 2 scheduling node, the thread scheduling algorithm adopted should enable the target thread to be scheduled to be the data output in the pipeline stage after the corresponding data enters the type 1.1 scheduling node, and there is relatively more space in the cache corresponding to the target thread for use, so as to improve idle use efficiency.

Therefore, in some embodiments, the thread scheduling algorithm of the first type thread scheduling node (i.e. type 1.1 scheduling node) before the second type thread scheduling node comprises: and acquiring a thread corresponding to the most empty cache in the cache group for output in a flow stage after the first type thread scheduling node in the flow line as a target thread.

For example, in fig. 1, a thread is selected at the first-type thread scheduling node a to be scheduled, so as to input corresponding data into the branch prediction stage 101, if the cache 1051 in the cache group 105 for output after the branch prediction stage 101 is the least empty and the thread ID corresponding to the cache 1051 is 3, then it is obtained that the target thread to be scheduled at the first-type thread scheduling node a is the thread 3, and corresponding scheduling is performed.

By analogy, in the process of gradually approaching the second type thread scheduling node in the pipeline, the target threads of each type 1.1 scheduling node are scheduled one by one; and selecting a thread corresponding to the most empty cache in the cache group of the pipeline stage output target after the type 2 scheduling node as a target thread of the type 1.1 scheduling node at a type 1.1 scheduling node before the type 2 scheduling node, so that the instruction of the target thread can be output to the most empty cache after being processed by the pipeline stage after the type 2 scheduling node, and the more full or full cache can be avoided, the idling possibility of the microprocessor can be reduced, and the performance can be prevented from being reduced.

In addition, when a first type thread scheduling node (i.e. a type 1.2 scheduling node) after a second type thread scheduling node is scheduled, the thread scheduling algorithm comprises: and acquiring a thread corresponding to the most full cache in the cache groups output by the first type thread scheduling node in the pipeline at a previous pipeline stage, and taking the thread as a target thread, namely preferentially selecting the thread corresponding to the most full cache in the last cache group for scheduling so as to avoid the more full or full cache, reduce the idling possibility of the microprocessor and prevent the performance from being reduced.

For example, in fig. 1, a thread is selected to be scheduled at the first-type thread scheduling node E, so as to enable corresponding data to enter the instruction retirement stage 111, if a third cache in the cache group 108 before the first-type thread scheduling node E is the most full and a thread ID corresponding to the third cache is 3, it is obtained that a target thread to be scheduled at the first-type thread scheduling node E is the thread 3, and corresponding scheduling is performed.

Thus, taking fig. 1 as an example, the optimized thread scheduling of each pipeline stage is implemented at each first-type thread scheduling node A, B, C, E one by one along the pipeline, so that the thread scheduling of the whole pipeline is optimized.

As shown in fig. 3, a schematic flow chart of acquiring a target thread corresponding to a type 1.1 thread scheduling node in the embodiment of the present application is shown, that is, an acquiring manner of acquiring a thread corresponding to a most full cache from a cache group after the type 1.1 thread scheduling node is shown.

The process comprises the following steps:

step S301: traversing each thread, and judging whether a first preset condition is met; the first preset condition includes: the cache corresponding to the thread in the cache group for outputting in the first class of thread scheduling node in the previous flow stage is not completely empty, and the cache corresponding to the thread in the cache group for outputting in the first class of thread scheduling node in the subsequent flow stage is the most empty of the caches corresponding to the traversed threads.

In an implementation, there may be a cache set before the type 1.1 thread scheduling node, such as the thread scheduling node B, C in fig. 1, so even if the thread corresponding to the least-recently-cached thread in the next cache set after the type 1.1 thread scheduling node is obtained by calculation, the corresponding cache in the cache set before the type 1.1 thread scheduling node may not be completely empty (i.e., no instruction available processing may cause idle running). However, this is merely an example, for example, there is a type 1.1 thread scheduling node in fig. 1 where there is a cache set before, however, in other examples, if there is no cache set before the type 1.1 thread scheduling node, the condition that the cache is completely empty may not be considered.

Step S302: and obtaining the target thread which meets the first preset condition and is obtained after traversing each thread.

Fig. 4 is a schematic flow chart illustrating a manner of acquiring a target thread of a type 1.1 thread scheduling node in a more specific embodiment of the present application.

As shown in the figure, the process specifically includes:

step S401: initializing and setting variables i, P and M.

Wherein i represents the ith thread and can be 0 initially; p is a variable of the ID of the target thread and can be-1 initially; m is the number of entries used in the cache, and a smaller number of entries indicates that the cache is more empty, and the initial value of M may be set to the maximum entry storage amount F of the cache.

Step S402: the ID of the current thread is calculated.

In a specific example, tid represents the ID of the current thread, and the calculation mode tid of tid is (next _ tid + i)% num _ threads; wherein num _ threads is the maximum number of active threads supported by the SMT; next _ tid is the ID value of thread that is recorded to start traversing one clock cycle later, and is initially 0. Wherein, the result of tid is obtained by the operation of% complementation, and then tid is initially 0; when next _ tid is 1, i is 1, there are 4 threads, thread 0 to thread 3, and when tid is 1, the traversal is started from the second thread.

Step S403: and judging whether the cache corresponding to the current thread in a cache group before the type 1.1 thread scheduling node is not completely empty or not.

S represents a function of the number of used entries in the cache corresponding to the thread, B represents a function of a previous cache set of the thread scheduling node of type 1.1, then B (tid) represents that the current thread tid in the cache set B corresponds to the cache, and S (B (tid)) represents the number of used entries of B (tid), then step S403 may represent to determine whether S (B (tid)) >0 (or whether it is equal to 0).

If S (b (tid)) is false, that is, b (tid) is completely empty, the available space in the subsequent cache does not need to be determined, and the process proceeds to step S406; otherwise, step S404 is performed.

Step S404: and judging whether the cache quantity corresponding to the current thread in a cache group behind the type 1.1 thread scheduling node is less than M.

For example, if a represents a function of a previous cache set of the type 1.1 thread scheduling node, the number of entries used in the cache corresponding to the current thread tid in the cache set may be represented as S (a (tid)), and step S404 determines whether S (a (tid)) < M is true; if yes, go to step S405; if not, go to step S406.

Step S405: tid to P and S (A (tid)) to M;

step S406: i is increased by 1;

step S407: judging whether threads which are not traversed exist, namely judging whether i < num _ threads exists; if yes, all threads are not traversed, and the step S402 is returned to be executed circularly; if not, go to step S408;

step S408: judging whether P is an initial value, namely whether P is equal to-1;

if P is also an initial value, the condition that a target thread meeting the requirements is not found is represented;

if P is not the initial value, that is, it indicates that the target thread is found, the process proceeds to step S409, and the ID of the target thread is obtained as P.

Optionally, in order to realize that tid starting at the time of scheduling node to this type 1.1 thread in the pipeline may be changed, i.e. not all starting from thread 1, for example, and fairness for all threads, step S410 may be further set between steps S409 and S408: and taking the ID of the next thread of P as the next _ tid of the next clock cycle so as to achieve the purpose of fair polling among threads.

In a possible example, the specific algorithm may be: next _ tid is (P + 1)% num _ reads, i.e., the calculated value of (P + 1)% num _ reads is assigned to next _ tid for later use. For example, if there are 4 threads, thread 0 to thread 3, and when P is 3, the thread starting to traverse the subsequent clock cycle returns to thread 0.

As shown in fig. 5, a schematic flow chart of acquiring a target thread corresponding to a type 1.2 thread scheduling node in the embodiment of the present application is shown, that is, an acquiring manner of acquiring a thread corresponding to a most full cache from a cache group after the type 1.2 thread scheduling node is shown.

The process comprises the following steps:

step S501: traversing each thread, and judging whether a second preset condition is met; the second preset condition includes: and the cache corresponding to the thread in the cache group used for outputting in the previous flow stage of the first type thread scheduling node is the most full cache corresponding to each traversed thread.

In a specific implementation, there may be no cache set after the type 1.2 thread scheduling node, such as the thread scheduling node E in fig. 1, so that only the cache empty/full condition of each thread in the cache set before E may be considered, and the cache empty/full condition of each thread in the cache set after E does not need to be considered.

However, this is only an example, and in other examples, to prevent that there may be a cache group after the type 1.2 thread schedules the node, the second preset condition may further include: the cache corresponding to the thread in a cache group behind the first type thread scheduling node (i.e. the type 1.2 thread scheduling node) is not full.

Step S502: and obtaining the target thread which is obtained after traversing each thread and accords with the second preset condition.

Fig. 6 is a schematic flow chart illustrating a manner of acquiring a target thread of a type 1.2 thread scheduling node in a more specific embodiment of the present application.

As shown in the figure, the process specifically includes:

step S601: initializing and setting variables i, P and M.

As with the previous principle, i represents the ith thread, which may be initially 0; p is a variable of the ID of the target thread and can be-1 initially; m is the number of used entries in the cache, and the smaller the number of entries, the more empty the cache, and the initial value of M may be set to 0.

Step S602: the ID of the current thread is calculated.

Step S603: and judging whether the corresponding cache number of the current thread in a cache group before the type 1.2 thread scheduling node is more than M.

For example, S represents a function of the number of used entries in the cache corresponding to the thread, B represents a function of a previous cache set of the type 1.2 thread scheduling node, then B (tid) represents a cache corresponding to the current thread tid in the cache set B, and S (B (tid)) represents the number of used entries of B (tid), then step S603 may be expressed as determining whether S (B (tid)) > M.

Step S604: tid to P and S (B (tid)) to M;

step S605: i is increased by 1;

step S606: judging whether threads which are not traversed exist, namely judging whether i < num _ threads exists; if yes, it indicates that all threads have not been traversed, and returns to step S602 for loop execution; if not, go to step S607;

step S607: judging whether P is an initial value, namely-1;

if P is not the initial value, i.e. it indicates a request for the selected target thread, the process proceeds to step S608, and the ID of the target thread is obtained as P.

Optionally, in order to realize that tid starting at the time of scheduling node to this type 1.2 thread in the pipeline may be changed each time, i.e. not all starting from thread 1, for example, and fairness for all threads, step S609 may be further set between steps S608 and S607: the ID of the thread after P is taken as the next _ tid of the next clock cycle.

In a possible example, if it is considered that there may be a cache set, for example, cache set a, after the type 1.2 thread scheduling node, a layer of judgment may be added before step S603, and whether S (a (tid)) is full is determined, which is expressed as: whether S (a (tid)) is F, F being the maximum entry storage amount of the cache F; if the thread is not full, the determination of step S603 is performed again for the currently running thread tid.

In combination with the above, an embodiment of the present application may provide a microprocessor, which implements a pipeline including a plurality of pipeline stages; the microprocessor is coupled with or comprises a memory, the memory comprises cache groups used for data transmission among the pipeline stages, and each cache group comprises a cache which is exclusively shared by each thread; the microprocessor runs executable program code to perform the thread scheduling method to schedule threads at each of the thread scheduling nodes.

Optionally, the microprocessor is implemented as a processor core. For example, in a multi-core processor that supports pipeline and SMT techniques, the microprocessor may be implemented as one of the processor cores.

In some examples, the memory may be a register of a microprocessor, a cache (e.g., a level one cache shared exclusively by processor cores), or other storage medium (e.g., a level two cache, a level three cache, an embedded memory of a SoC, a run time memory, etc.).

In some examples, the present application may also provide a processing chip including at least one of the microprocessors. The processing chip can be a CPU chip, an SoC chip and the like.

Fig. 7 is a schematic flow chart showing a thread scheduling configuration method according to an embodiment of the present application.

Specifically, the thread scheduling configuration method is used for configuring a thread scheduling algorithm for a thread scheduling node in a pipeline during the design of a processor, so that when the processor comes to the thread scheduling node according to the pipeline work, the thread scheduling algorithm pre-configured by the thread scheduling node can be called to complete the thread scheduling work.

The thread scheduling configuration method comprises the following steps:

step S701: the type of each thread scheduling node in the pipeline is determined.

The types of the thread scheduling nodes comprise a first type and a second type, and the first type of thread scheduling nodes are defined as: the output target of the next flow stage does not contain the thread scheduling node of the hardware resource shared by multiple threads; the second type thread scheduling node is defined as: the output target of the latter pipeline stage comprises a thread scheduling node of the hardware resource shared by multiple threads.

In some examples, the type of all thread scheduling nodes in the pipeline may be determined based on a known in advance order of progress of the various pipeline stages of the pipeline.

Step S702: and configuring a thread scheduling algorithm for each thread scheduling node according to the type.

The thread scheduling algorithm of the first type thread scheduling node before the second type thread scheduling node comprises the following steps: acquiring a thread corresponding to the most empty cache in a cache group for output in a flow stage after the first type thread scheduling node in the flow line, and taking the thread as a target thread to be scheduled; the thread scheduling algorithm of the first type thread scheduling node after the second type thread scheduling node comprises the following steps: and acquiring a thread corresponding to the most full cache in the cache group for output by the first type thread scheduling node in the pipeline at a previous flow stage as a target thread to be scheduled.

Specifically, for the above specific implementation of obtaining the target thread, reference may be made to the previous embodiments shown in fig. 3 to 6, where the principle has been explicitly explained and is not repeated here.

Through the configuration method in the example of fig. 7, a corresponding thread scheduling algorithm can be configured for each first-type thread scheduling node, so that when the corresponding processor is actually working or simulating, thread scheduling can be performed according to the thread scheduling method as shown in the embodiment of fig. 2.

Fig. 8 is a schematic structural diagram of a computer device in the embodiment of the present application.

The computer device 800 comprises a memory 801 and a processor 802, wherein the memory 801 stores a computer program operable on the processor 802, and the processor 802 executes the computer program to perform the thread scheduling configuration method as described above, for example, in the embodiment of fig. 7.

In practical applications, a designer or a manufacturer of the microprocessor in the foregoing embodiments may execute the thread scheduling configuration algorithm in the above example in the process of designing or manufacturing the microprocessor.

In some examples, the processor 802 may be a combination that implements a computing function, such as a combination comprising one or more microprocessors, Digital Signal Processing (DSP), ASIC, or the like; the Memory 801 may include a high-speed RAM Memory, and may also include a Non-volatile Memory (Non-volatile Memory), such as at least one disk Memory.

In some examples, the computer apparatus 800 may be implemented in, for example, a server bank, a desktop computer, a laptop computer, a smart phone, a tablet computer, a smart band, a smart watch, or other smart devices, or a processing system formed by communicatively coupling such smart devices.

A computer-readable storage medium may also be provided in an embodiment of the present application, on which a computer program is stored, wherein the computer program is executed to perform the method steps in the embodiments described above, for example, in fig. 2 to 7.

That is, the method flow in the embodiments of the present application (fig. 2 to 7 embodiments) may be implemented as software or computer code that can be stored in a recording medium such as a CDROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code that is originally stored in a remote recording medium or a non-transitory machine-readable medium and is to be stored in a local recording medium downloaded through a network, so that the method described herein may be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the method steps in the embodiments described above. In addition, when a general-purpose computer accesses code for implementing the methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the method steps illustrated herein.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer program may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

Although the embodiments of the present application are disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected by one skilled in the art without departing from the spirit and scope of the embodiments of the invention as defined by the appended claims.

Claims

1. A thread scheduling method is applied to a microprocessor supporting a pipeline technology and concurrent multithreading, and is used for thread scheduling at each thread scheduling node in a pipeline, the pipeline comprises a plurality of pipeline stages, and the thread scheduling node is a time point before the pipeline stage; the microprocessor is configured with cache groups used for data transmission among the pipeline stages, and each cache group comprises a cache which is exclusively shared by each thread; the thread scheduling method comprises the following steps:

When a thread scheduling node arrives, a target thread is obtained based on a thread scheduling algorithm pre-configured by the scheduling node; wherein the thread scheduling algorithm is associated with a type of the thread scheduling node;

scheduling the target thread;

the types of the thread scheduling nodes comprise a first type and a second type, and the first type of thread scheduling nodes are defined as: the output target of the next flow stage does not contain the thread scheduling node of the hardware resource shared by multiple threads; the second type thread scheduling node is defined as: the output target of the next flow stage comprises a thread scheduling node of the hardware resource shared by multiple threads;

the thread scheduling algorithm of the first type thread scheduling node before the second type thread scheduling node comprises the following steps: acquiring a thread corresponding to the most empty cache in a cache group for output in a flow stage after the first type thread scheduling node in the flow line, and taking the thread as a target thread; the thread scheduling algorithm of the first type thread scheduling node after the second type thread scheduling node comprises the following steps: and acquiring a thread corresponding to the most full cache in the cache group for output in the pipeline in the previous flow stage of the first type thread scheduling node as a target thread.

2. The thread scheduling method of claim 1 wherein each of said pipeline stages comprises an instruction dispatch stage; the second type thread scheduling node exists before entering an instruction dispatching stage.

3. The thread scheduling method of claim 1 or 2, wherein the hardware resources shared by multiple threads comprise: and the instruction queue is shared by a plurality of threads to access or fetch the instructions.

4. The method of claim 3, wherein the instruction queue is configured to store instructions of threads after the instruction dispatch stage for execution during the instruction execution stage.

5. The thread scheduling method according to claim 1, wherein the obtaining manner of the thread corresponding to the empty buffer in the buffer group for output in a streaming stage after the first type thread scheduling node comprises:

6. The thread scheduling method according to claim 1, wherein the obtaining manner of the thread corresponding to the most full cache in the cache set for output in a streaming stage after the first type thread scheduling node comprises:

7. The thread scheduling method according to claim 6, wherein the second preset condition further comprises: and the corresponding cache in the cache group used for outputting in a flow stage after the thread is scheduled by the first type thread scheduling node is not full.

8. A thread scheduling configuration method is applied to the design of a microprocessor supporting a pipeline technology and is used for configuring a thread scheduling algorithm used by each thread scheduling node in a pipeline, the pipeline comprises a plurality of pipeline stages, and the thread scheduling node is a time point before the pipeline stage; the microprocessor is configured with cache groups used for data transmission among the pipeline stages, and each cache group comprises a cache which is exclusively shared by each thread; the thread scheduling configuration method comprises the following steps:

9. The thread scheduling configuration method according to claim 8, wherein the obtaining manner of the thread corresponding to the empty buffer in the buffer group for output in a streaming stage after the first type thread scheduling node comprises:

10. The thread scheduling configuration method according to claim 8, wherein the obtaining manner of the thread corresponding to the most full cache in the cache set for output in a streaming stage after the first type thread scheduling node comprises:

11. The thread scheduling configuration method according to claim 10, wherein the second preset condition further comprises: and the corresponding cache in the cache group used for outputting in a flow stage after the thread is scheduled by the first type thread scheduling node is not full.

12. A microprocessor supporting pipelining, said pipeline comprising a plurality of pipeline stages; the microprocessor is coupled with or comprises a memory, the memory comprises cache groups used for data transmission among the pipeline stages, and each cache group comprises a cache which is exclusively shared by each thread; the microprocessor running executable program code to perform the thread scheduling method of any one of claims 1 to 7 to schedule a thread at each of the thread scheduling nodes.

13. A processing chip, comprising: at least one microprocessor according to claim 12.

14. A computer device, comprising: a memory and a processor; the memory stores executable program code which when executed by the processor performs the thread scheduling configuration method of any of claims 8 to 11.

15. A computer-readable storage medium, characterized in that executable program code is stored, which when executed performs the thread scheduling method of any of claims 1 to 7 or the thread scheduling configuration method of any of claims 8 to 11.