CN118245188A - Thread control method and device, processor and computer readable storage medium - Google Patents

Thread control method and device, processor and computer readable storage medium Download PDF

Info

Publication number
CN118245188A
CN118245188A CN202410382007.8A CN202410382007A CN118245188A CN 118245188 A CN118245188 A CN 118245188A CN 202410382007 A CN202410382007 A CN 202410382007A CN 118245188 A CN118245188 A CN 118245188A
Authority
CN
China
Prior art keywords
thread
threads
register
registers
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410382007.8A
Other languages
Chinese (zh)
Inventor
陈静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haiguang Information Technology Co Ltd
Original Assignee
Haiguang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haiguang Information Technology Co Ltd filed Critical Haiguang Information Technology Co Ltd
Priority to CN202410382007.8A priority Critical patent/CN118245188A/en
Publication of CN118245188A publication Critical patent/CN118245188A/en
Pending legal-status Critical Current

Links

Landscapes

  • Executing Machine-Instructions (AREA)

Abstract

At least one embodiment of the present disclosure provides a thread control method and apparatus, a processor, and a computer readable storage medium. The thread control method comprises the following steps: for a plurality of threads executed by a processor, acquiring the occupation state of the plurality of threads on physical registers in the processor; and in response to the occupancy state meeting the overflow threshold condition, selecting a target thread of the plurality of threads, storing data in at least a portion of the physical registers occupied by the target thread into the expansion memory, releasing the data stored into the physical registers in the expansion memory for use by other threads of the plurality of threads other than the target thread. The thread control method can enable multiple threads to share the physical registers under the condition of reasonable number of physical registers, and meets the requirements of main frequency, power consumption and multi-thread performance.

Description

Thread control method and device, processor and computer readable storage medium
Technical Field
Embodiments of the present disclosure relate to a thread control method and apparatus, a processor, and a computer-readable storage medium.
Background
Modern multi-issue high performance CPUs (central processing units, central Processing Unit) include at least one processor Core, each of which includes a plurality of execution units to execute instructions. For example, a pipelined process of instruction execution includes: five stages of fetch (IF, instruction Fetch), decode (ID, instruction Dispatch/Decode), execute (EX), memory, and Write Back (WB, write Back), update the result of the instruction Execution to the register).
To increase parallelism of instruction execution in a processor, the processor may employ simultaneous multithreading (Simultaneous Multithreading, SMT) techniques, such that a pipeline structure for instruction execution (also referred to as a "pipeline") in the processor may support simultaneous execution of two or more (hardware) threads. For example, an SMT processor can be SMT2 (supporting at most two concurrent threads), SMT4 (supporting at most four concurrent threads), or SMT8 (supporting at most eight concurrent threads), etc. When a single-threaded processor runs, a plurality of execution units in the processor may be idle due to a pause caused by cache miss and the like, and for a processor adopting the SMT technology, even if a certain thread pauses, other threads can continuously use the hardware resources, so that the utilization rate of the hardware resources is improved, idle running can be avoided, and the throughput and the performance power consumption ratio of the processor are improved.
The physical registers in the multithreading technology are generally shared by a plurality of threads, the number of physical registers consumed (or occupied) by each thread is generally the number of logical registers of the thread, and idle running (free running) physical registers exist in the running process of the processor. Thus, the number of physical registers actually required in the multithreading technique is enormous. However, due to the high main frequency and low power consumption requirements, it is generally difficult to design a physical register file that can hold a large number of entries. This presents challenges for sharing of physical registers under multithreading. Therefore, how to share these registers with a reasonable number of physical registers becomes a problem to be solved.
Disclosure of Invention
At least one embodiment of the present disclosure provides a thread control method, including: for a plurality of threads executed by a processor, acquiring the occupation state of the plurality of threads on physical registers in the processor; and in response to the occupancy state meeting the overflow threshold condition, selecting a target thread of the plurality of threads, storing data in at least a portion of the physical registers occupied by the target thread into the expansion memory, releasing the data stored into the physical registers in the expansion memory for use by other threads of the plurality of threads other than the target thread.
For example, in accordance with a thread control method of at least one embodiment of the present disclosure, the overflow threshold condition includes a quantity threshold, wherein the quantity threshold is associated with a maximum number of physical registers in the processor that at least some of the plurality of threads may occupy.
For example, in accordance with a thread control method of at least one embodiment of the present disclosure, a quantity threshold indicates a maximum number of physical registers in a processor that at least a portion of threads may occupy, and acquiring occupancy states of physical registers in the processor by a plurality of threads includes: the number of physical registers in the processor occupied by at least some threads is counted.
For example, according to a thread control method of at least one embodiment of the present disclosure, counting the number of physical registers in a processor occupied by at least a portion of a thread includes: counting the number of valid mappings for each thread in at least a portion of the threads, wherein a valid mapping indicates that one logical register corresponds to one valid physical register; and summing the counts as a number of physical registers in the processor occupied by at least a portion of the threads.
For example, in accordance with a thread control method of at least one embodiment of the present disclosure, the number threshold indicates a maximum number of physical registers in the processor that some or all of the threads may occupy.
For example, a thread control method according to at least one embodiment of the present disclosure further includes: and determining that the occupied state meets an overflow threshold condition in response to the number of physical registers in the acquired processor occupied by at least part of the threads being greater than or equal to the maximum number.
For example, a thread control method in accordance with at least one embodiment of the present disclosure, the overflow threshold condition comprising a first time threshold, wherein the first time threshold indicates a maximum time that logical registers of at least some of the plurality of threads are not used, and acquiring occupancy states of physical registers in the processor by the plurality of threads, comprises: the unoccupied time of the logical registers of at least some threads is acquired.
For example, a thread control method according to at least one embodiment of the present disclosure further includes: in response to the unoccupied time to acquire the logical registers of at least a portion of the threads being greater than or equal to a first time threshold, it is determined that the occupancy state satisfies an overflow threshold condition.
For example, according to a thread control method of at least one embodiment of the present disclosure, a logical register of at least a portion of a thread includes a plurality of logical register sets, wherein acquiring a non-occupied time of the logical register of at least a portion of the thread includes: obtaining a non-occupation time of a logical register within a target logical register group of the plurality of logical register groups, and wherein selecting a target thread of the plurality of threads to store data in at least a portion of physical registers occupied by the target thread into an expansion memory in response to the occupation state satisfying an overflow threshold condition, comprises: and in response to the unoccupied time of the logic registers in the target logic register set in the plurality of logic register sets being greater than or equal to a first time threshold, selecting the corresponding thread as the target thread, and storing data in at least part of the physical registers corresponding to the logic registers in the target logic register set of the target thread into the expansion memory.
For example, a thread control method according to at least one embodiment of the present disclosure further includes: in response to the target thread requiring re-execution of the instruction, data is filled from the expansion memory into currently available physical registers in the processor.
For example, a thread control method according to at least one embodiment of the present disclosure further includes: obtaining a time at which the instruction was stalled, wherein the instruction is stalled in response to the occupancy state meeting an overflow threshold condition; and determining that the target thread needs to re-execute the instruction in response to the instruction being stalled being greater than or equal to a second time threshold, wherein the second time threshold indicates a time minimum time for the instruction corresponding to the target thread to be stalled.
For example, according to a thread control method of at least one embodiment of the present disclosure, a logical register of a target thread includes a plurality of logical register sets, wherein in response to the target thread requiring re-execution of an instruction, filling available physical registers from an expansion memory includes: in response to an instruction requiring use of a first logical register, only data corresponding to each logical register in a logical register set in which the first logical register is located is filled from the expansion memory into a currently available physical register.
For example, in accordance with a thread control method of at least one embodiment of the present disclosure, an extended memory includes a plurality of memory addresses, each memory address dedicated to storing data corresponding to a logical register of a respective thread of the plurality of threads.
For example, according to a thread control method of at least one embodiment of the present disclosure, a memory address is accessed through a thread number and a logical register number.
At least one embodiment of the present disclosure provides a thread control apparatus comprising: an acquisition module configured to acquire, for a plurality of threads executed by a processor, an occupancy state of the plurality of threads on a physical register in the processor; and a processing module configured to select a target thread of the plurality of threads, store data in at least a portion of the physical registers occupied by the target thread into the expansion memory, and release the data stored in the physical registers in the expansion memory for use by other threads of the plurality of threads than the target thread in response to the occupancy state meeting the overflow threshold condition.
For example, a thread control device in accordance with at least one embodiment of the present disclosure, the overflow threshold condition includes a quantity threshold, wherein the quantity threshold is associated with a maximum number of physical registers in a processor that at least some of the plurality of threads may occupy.
For example, a thread control device in accordance with at least one embodiment of the present disclosure, the overflow threshold condition comprises a first time threshold, wherein the first time threshold indicates a maximum time that a logical register of at least some of the plurality of threads is not used, and the acquisition module comprises: and the time acquisition module is configured to acquire the unoccupied time of the logic registers of at least part of the threads.
For example, in accordance with at least one embodiment of the present disclosure, a thread control device, at least a portion of the logical registers of a thread comprising a plurality of sets of logical registers, a time acquisition module comprising: a target logical register set time acquisition module configured to acquire a non-occupation time of a logical register within a target logical register set of the plurality of logical register sets; and the processing module comprises: a target thread selection unit that selects a corresponding thread as a target thread in response to a non-occupation time of a logical register within a target logical register group of the plurality of logical register groups being greater than or equal to a first time threshold; and an overflow storage control unit configured to store data in at least part of physical registers corresponding to logical registers in the target logical register group of the target thread into the expansion memory.
For example, in accordance with at least one embodiment of the present disclosure, the thread control device, the processing module further comprises: and a fill storage control unit configured to fill data from the expansion memory into currently available physical registers in the processor in response to the target thread requiring re-execution of the instruction.
For example, in accordance with a thread control device of at least one embodiment of the present disclosure, a plurality of logical register sets are included in a logical register of a target thread, and a fill storage control unit includes: and the logic register group filling storage control unit is configured to only fill the data corresponding to each logic register in the logic register group where the first logic register is located into the current available physical register from the expansion memory by using the first logic register in response to the instruction.
For example, in accordance with a thread control device of at least one embodiment of the present disclosure, an expansion memory is disposed around a physical register, and data in at least a portion of the physical register occupied by a target thread can be transferred through a direct data channel between the expansion memory and the physical register.
At least one embodiment of the present disclosure provides a thread control apparatus comprising at least one processing unit and a memory; wherein the memory stores computer readable instructions and is communicatively coupled to the at least one processing unit; the at least one processing unit is configured to execute the computer readable instructions stored in the memory to implement the thread control method as described above.
At least one embodiment of the present disclosure provides a processor including a thread control device as described above.
At least one embodiment of the present disclosure provides a computer-readable storage medium having computer-readable instructions stored therein that, when executed by a processor, implement a thread control method as described above.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments of the present disclosure will be briefly described below. It is apparent that the figures in the following description relate only to some embodiments of the present disclosure and are not limiting of the present disclosure.
FIG. 1 illustrates a schematic diagram of a pipeline of an exemplary processor core;
FIG. 2 illustrates a schematic diagram of an exemplary processor pipeline;
FIG. 3 illustrates a schematic diagram of an exemplary pipeline of a processor involving out-of-order execution, register renaming;
FIG. 4 illustrates a flow diagram of a thread control method in accordance with at least one embodiment of the present disclosure;
FIG. 5 illustrates a schematic diagram of a thread control device in accordance with at least one embodiment of the present disclosure;
FIG. 6 illustrates a schematic diagram of another thread control device in accordance with at least one embodiment of the present disclosure;
7A-7E illustrate logical register-to-physical register mapping tables for four threads in accordance with at least one embodiment of the present disclosure;
FIG. 8 illustrates a schematic diagram of an anomaly detection module in accordance with at least one embodiment of the present disclosure;
FIG. 9 illustrates a schematic diagram of a logical register grouping in accordance with at least one embodiment of the present disclosure;
FIG. 10 illustrates a schematic diagram of the storage of a thread in expansion memory in accordance with at least one embodiment of the present disclosure;
FIG. 11 illustrates a schematic diagram of another thread control device in accordance with at least one embodiment of the present disclosure;
FIG. 12 illustrates a schematic diagram of a computer-readable storage medium in accordance with at least one embodiment of the present disclosure;
Fig. 13 shows a schematic diagram of an electronic device in accordance with at least one embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the specific embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings. While the disclosure will be described in conjunction with the specific embodiments, it will be understood that it is not intended to limit the disclosure to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the disclosure as defined by the appended claims. It should be noted that the method operations described herein may be implemented by any functional block or arrangement of functions, and that any functional block or arrangement of functions may be implemented as a physical entity or a logical entity, or a combination of both.
In order that those skilled in the art will better understand the present disclosure, the present disclosure will be described in further detail below with reference to the accompanying drawings and detailed description.
Note that the examples to be presented below are only specific examples and are not intended to limit the embodiments of the present disclosure to the particular shapes, hardware, connection relationships, operations, values, conditions, data, sequences, etc., shown and described. Those skilled in the art can, upon reading the present specification, utilize the concepts of the present disclosure to construct additional embodiments not described in the present specification.
The terms used in the present disclosure are those general terms that are currently widely used in the art in view of the functions of the present disclosure, but may vary according to the intention, precedent, or new technology in the art of the person of ordinary skill in the art. Furthermore, specific terms may be selected by the applicant, and in this case, their detailed meanings will be described in the detailed description of the present disclosure. Accordingly, the terms used in the specification should not be construed as simple names, but rather based on the meanings of the terms and the general description of the present disclosure.
A flowchart is used in this disclosure to describe the operations performed by a system according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.
The abbreviations and related terms involved in the present application are first defined and explained.
The space out refers to a process of storing data in a physical register into an expansion memory.
Fill in refers to the process of filling data in the extended expansion memory into physical registers.
It is to be understood that the terminology defined above is for the purpose of describing particular embodiments only, and is not intended to be limiting.
To improve the performance of the processor core, the processor core may use a pipeline manner, that is, an instruction is divided into a plurality of pipeline stages from the whole process of being extracted, decoded, executed and result written back, and one instruction can only be in one pipeline stage in one operation cycle; the processor core may run multiple instructions in different pipeline stages.
FIG. 1 illustrates a schematic diagram of an exemplary pipeline of processor cores, with the dashed lines with arrows representing redirected instruction flow.
As shown in fig. 1, a processor core (e.g., CPU core) of a single-core processor or a multi-core processor improves Instruction-to-Instruction parallelism (Instruction LEVEL PARALLELISM) through pipelining. The pipeline within the processor core includes a plurality of pipeline stages, such as, after the pipeline feeds into program counters of various sources, a next Program Counter (PC) is selected by a multiplexer (Mux), and the instruction corresponding to the program counter is subject to branch prediction (Branch prediction), instruction fetch (Instruction fetch), instruction Decode (Decode), instruction dispatch and renaming (DISPATCH AND RENAME), instruction execution (Execute), instruction end (Retire), and the like. Wait queues, typically first-in-first-out (FIFO) queues, are provided as needed between the various pipeline stages. For example, after the branch prediction unit, a Branch Prediction (BP) FIFO queue is provided to store branch prediction results; after the instruction fetch unit, an instruction cache (Instruction Cache, IC) FIFO is provided to cache fetched instructions; after the instruction decode unit, a Decode (DE) FIFO is provided to buffer decoded instructions; after the instruction dispatch and rename unit, an end (Retire, RT) FIFO is provided to buffer instructions waiting for confirmation of end after execution. While the pipeline of the processor core also includes an instruction queue to cache waiting instruction execution units to execute instructions after instruction dispatch and renaming.
FIG. 2 illustrates a schematic diagram of an exemplary processor pipeline.
Referring to FIG. 2, each stage in the pipeline may input 4 instructions in parallel, e.g., instructions 1-4 are processed in parallel, instructions 5-8 are processed in parallel, and instructions 9-12 are processed in parallel. As in the conventional scalar pipeline of fig. 1, the number of instructions executed on average per clock cycle is greater than 1, i.e., its instruction level parallelism is greater than 1.
For example, superscalar processors may further support out-of-order execution. Out-of-order execution refers to a technique employed by a CPU that allows multiple instructions to be split out of program-specified order into corresponding circuit units for processing. Out-of-order execution involves a number of algorithms that are basically designed on a reservation station basis. The core idea of the reservation station is to send decoded instructions to the respective reservation station for reservation according to the respective instruction type, and if all operands of the instructions are ready, out-of-order launch can be started.
FIG. 3 illustrates a schematic diagram of an exemplary pipeline of a processor involving out-of-order execution, register renaming.
As shown in fig. 3, after each instruction (e.g., a branch instruction) is fetched from the instruction cache according to a Program Counter (PC) (a "fetch"), the fetched instruction is instruction decoded ("decoded"); register renaming operation ('renaming') is carried out on the decoded instruction, and the instruction after register renaming is sent to out-of-order execution so as to remove pseudo-correlation among the instructions; the instruction then enters an instruction dispatch module that decides when to dispatch the instruction to what execution units to execute ("out-of-order dispatch"), e.g., in different cycles of operation, the instruction will be dispatched to different execution units (e.g., arithmetic Logic Unit (ALU), multiplication Unit (MUL), division unit (DIV), load and Store Unit (LSU), etc.) corresponding to different ports (e.g., port 0, port 1, port 2, port 3, etc.). While register renaming is being performed, instructions enter an instruction commit unit which records the original instruction fetch order of the instructions being processed in the pipeline. The instruction submitting unit is used for submitting the instructions according to the original instruction fetching sequence after the instruction execution is finished, and meanwhile, the instruction submitting unit can update the actual execution information of the branch instruction to the branch prediction unit.
As described above, possible pipeline conflicts of WAW (write after write) and WAR (write after read) may be resolved by register renaming techniques. This technique redefines Physical Registers (PR) within the processor without increasing the number of general purpose registers, the registers defined in the instruction set being referred to as Architectural Registers (AR). The physical registers are actually present in the processor, for example referred to as a physical register file (PHYSICAL REGISTER FILE, PRF). The processor will dynamically map the architectural registers AR to the physical registers PR to solve the problem of WAW and WAR dependencies; the processor completes the register renaming process by constructing a register map and a list of physical register availability. When a processor renames an architectural register used in a current instruction, both a source register and a destination register (both logical registers) in the instruction need to be processed. For the source register, the processor searches the architectural register mapping table to find the corresponding PR number (PRN), and for the destination register, it is necessary to read a PR number from the physical register availability list, and to establish a mapping relationship between the PR number and the destination register and write the PR number into the architectural register mapping table. If the free list is empty, the pipeline of the processor needs to wait until there is instruction retirement to release PR. The architectural register map is implemented, for example, by a hardware structure such as a storage device (e.g., a cache or register)
In addition, to improve parallelism of instruction execution in a processor, the processor may also employ a Simultaneous Multithreading (SMT) technique, and a pipeline structure of the processor for instruction execution (also referred to as a "pipeline") may support two or more (hardware) threads to execute simultaneously, for example, SMT2 (supporting at most two concurrent threads), SMT4 (supporting at most four concurrent threads), or SMT8 (supporting at most eight concurrent threads). In a pipeline of a processor supporting simultaneous multithreading, computing resources are shared by multiple threads, e.g., each thread may have a separate logical register, but multiple physical registers within the processor are shared by multiple threads; among the queues for the various control functions of the pipeline, some may be shared by multiple threads, such as instruction dispatch queues, and others may be statically partitioned among multiple threads, such as instruction reorder queues. Meanwhile, the multithreading technology can improve the utilization rate of pipeline resources by utilizing the parallelism among threads.
Further, processors employing Simultaneous Multithreading (SMT) techniques may operate in an SMT mode as well as in a single threaded mode. For example, in the SMT mode, when one thread encounters a waiting state, the other threads can continue to execute, so that the utilization rate of hardware resources can be effectively improved, and the multithreading processing capacity, the overall performance and the performance power consumption ratio of the CPU core are further enhanced.
In this disclosure, an "operation cycle" may be, for example, a clock cycle or a machine cycle, or other time period for completing a beat operation in a pipeline of a processor. The execution of an instruction in each thread includes several stages, each of which completes a basic operation (e.g., instruction fetch, memory read, memory write, etc.), the time required to complete a basic operation being referred to as a machine cycle, also referred to as a CPU cycle.
The inventor of the present disclosure notes that in the existing multithreading technology, the technical scheme of designing a larger physical register file under multithreading meets the maximum requirements of the design, sacrificing the frequency and the area, or the technical scheme of allocating a fixed physical register file to each thread, not realizing sharing, and sacrificing the performance is adopted. However, the technical solution of designing a larger physical register file is easy to implement under the condition of fewer supported threads, but the main frequency and the power consumption are sacrificed under four-wire process, and finally the performance of the processor is also affected. Each thread is allocated a fixed physical register file (i.e., a fixed number of registers) to solve the problem of low frequency, but four threads cannot realize physical register sharing, which seriously affects processor performance.
At least one embodiment of the present disclosure provides a thread control method and apparatus, a processor, and a computer readable storage medium that enable multiple threads to share physical registers, even with a reasonable number of physical registers, meeting the requirements of main frequency, power consumption, and multi-threaded performance.
FIG. 4 illustrates a flow diagram of a thread control method 400 in accordance with at least one embodiment of the present disclosure. The thread control method 400 described with reference to FIG. 4, and additional aspects thereof, may be implemented in a thread control device, an electronic device, a hardware architecture, a software architecture, or a combination of hardware and software as described below.
Referring to fig. 4, the thread control method 400 includes steps S410 to S420.
In step S410, for a plurality of threads executed by a processor, occupancy states of physical registers in the processor by the plurality of threads are acquired.
In step S420, in response to the occupancy state meeting the overflow threshold condition, a target thread of the plurality of threads is selected, data in at least a portion of the physical registers occupied by the target thread is stored in the expansion memory, and the data is released from the physical registers stored in the expansion memory for use by other threads of the plurality of threads other than the target thread.
For example, the processor herein may be an SMT processor (e.g., SMT4 processor, SMT8 processor, etc.) so that multiple threads may be supported. The "target thread" refers to a thread that is a description object, and may be any one of a plurality of threads.
It will be appreciated that execution of an instruction requires a physical register, and that "busy state" herein may be used to indicate or characterize the condition of a thread in the processor as a result of the physical register that the thread was busy and not released in a previous cycle of operation.
As described above, a thread control method in accordance with at least one embodiment of the present disclosure stores data in physical registers of a partial thread to an expansion memory based on occupancy states of the physical registers in a processor satisfying an overflow threshold condition, freeing the physical registers for use by other threads. Thus, a thread control method according to at least one embodiment of the present disclosure may enable multiple threads to share a reasonable number of physical registers, meeting the requirements of main frequency, power consumption, and multi-threaded performance.
Some exemplary additional aspects of thread control methods in accordance with at least one embodiment of the present disclosure are described below.
For example, in accordance with a thread control method of at least one embodiment of the present disclosure, the overflow threshold condition includes a quantity threshold, wherein the quantity threshold is associated with a maximum number of physical registers in the processor that at least some of the plurality of threads may occupy.
As such, a thread control method in accordance with at least one embodiment of the present disclosure may determine whether the occupancy state meets an overflow threshold condition by setting a quantity threshold.
For example, in accordance with a thread control method of at least one embodiment of the present disclosure, a quantity threshold indicates a maximum number of physical registers in a processor that at least a portion of threads may occupy, and acquiring occupancy states of physical registers in the processor by a plurality of threads includes: the number of physical registers in the processor occupied by at least some threads is counted.
As such, the thread control method according to at least one embodiment of the present disclosure may determine whether the occupancy state satisfies the overflow threshold condition by setting the number threshold corresponding to the maximum number described above.
For example, according to a thread control method of at least one embodiment of the present disclosure, counting the number of physical registers in a processor occupied by at least a portion of a thread includes: counting the number of valid mappings for each thread in at least a portion of the threads, wherein a valid mapping indicates that one logical register corresponds to one valid physical register; and summing the counts as a number of physical registers in the processor occupied by at least a portion of the threads.
As such, a thread control method in accordance with at least one embodiment of the present disclosure may enable acquisition of an occupancy state by counting the number of valid mappings of threads.
For example, in accordance with a thread control method of at least one embodiment of the present disclosure, a quantity threshold indicates a maximum ratio of physical registers in a processor that at least a portion of threads may occupy, and acquiring occupancy states of physical registers in the processor by a plurality of threads includes: the ratio of physical registers in the processor occupied by at least some of the threads is obtained.
As such, a thread control method according to at least one embodiment of the present disclosure may determine whether the occupancy state satisfies the overflow threshold condition by setting a ratio threshold corresponding to the above.
In some examples, obtaining the ratio of physical registers in the processor occupied by at least some threads may include the steps of: counting the number of physical registers in a processor occupied by at least part of threads; and dividing the number by all physical registers of the plurality of threads of the processor.
In some examples, the number threshold may be set by including setting the maximum number or maximum ratio described above, or the like, ensuring flexibility in setting the number threshold.
For example, in accordance with a thread control method of at least one embodiment of the present disclosure, the number threshold indicates a maximum number of physical registers in the processor that some or all of the threads may occupy.
As such, a thread control method in accordance with at least one embodiment of the present disclosure may enable acquisition of an occupancy state by setting a number threshold for all threads (mode one) or a number threshold for some threads (mode two).
In some examples, in an SMT4 processor, the above-described mode one and mode two may be provided, where the partial threads of mode two may be 2 threads (e.g., thread 0 and thread 2, or thread 1 and thread 3, to name a few), or other numbers of partial threads.
For example, in accordance with a thread control method of at least one embodiment of the present disclosure, in response to the number of physical registers in the processor occupied by at least some of the threads being acquired being greater than or equal to a maximum number, it is determined that the occupancy state satisfies an overflow threshold condition.
Thus, the thread control method according to at least one embodiment of the present disclosure may determine that the occupancy state satisfies the overflow threshold condition by the obtained number and the maximum number.
For example, in accordance with a thread control method of at least one embodiment of the present disclosure, in response to a ratio of physical registers in a processor occupied by at least a portion of the threads being acquired being greater than or equal to a maximum ratio, it is determined that an occupancy state satisfies an overflow threshold condition.
As such, the thread control method according to at least one embodiment of the present disclosure may determine that the occupancy state satisfies the overflow threshold condition by the above-described acquired ratio and the above-described maximum ratio.
For example, a thread control method in accordance with at least one embodiment of the present disclosure, the overflow threshold condition comprising a first time threshold, wherein the first time threshold indicates a maximum time that logical registers of at least some of the plurality of threads are not used, and acquiring occupancy states of physical registers in the processor by the plurality of threads, comprises: the unoccupied time of the logical registers of at least some threads is acquired.
As such, a thread control method in accordance with at least one embodiment of the present disclosure may determine whether the occupancy state meets an overflow threshold condition by setting a time threshold. In this way, a thread control method in accordance with at least one embodiment of the present disclosure may free physical registers of logical registers that are unused for a long time for use by other threads.
For example, according to a thread control method of at least one embodiment of the present disclosure, acquiring the unoccupied time of a logical register of at least a portion of a thread includes: the time that the instructions of at least some threads are unused after being dispatched is counted.
As such, a thread control method in accordance with at least one embodiment of the present disclosure may obtain the unoccupied time of a logical register of at least a portion of a thread. Of course, the embodiment is not limited thereto, and the above-described unoccupied time may be acquired by other means or external instruction information, for example.
For example, a thread control method according to at least one embodiment of the present disclosure further includes: in response to the unoccupied time to acquire the logical registers of at least a portion of the threads being greater than or equal to a first time threshold, it is determined that the occupancy state satisfies an overflow threshold condition.
As such, a thread control method in accordance with at least one embodiment of the present disclosure may determine that the occupancy state meets the overflow threshold condition by comparing the unoccupied time of the absence or the presence with the first time threshold.
For example, according to a thread control method of at least one embodiment of the present disclosure, a logical register of at least a portion of a thread includes a plurality of logical register sets, wherein acquiring a non-occupied time of the logical register of at least a portion of the thread includes: obtaining a non-occupation time of a logical register within a target logical register group of the plurality of logical register groups, and wherein selecting a target thread of the plurality of threads to store data in at least a portion of physical registers occupied by the target thread into an expansion memory in response to the occupation state satisfying an overflow threshold condition, comprises: and in response to the unoccupied time of the logic registers in the target logic register set in the plurality of logic register sets being greater than or equal to a first time threshold, selecting the corresponding thread as the target thread, and storing data in at least part of the physical registers corresponding to the logic registers in the target logic register set of the target thread into the expansion memory.
As such, a thread control method according to at least one embodiment of the present disclosure stores data of a corresponding physical register into an expansion memory with respect to a logical register group included in a logical register of a thread. In this way, the thread control method according to at least one embodiment of the present disclosure may store data of physical registers into the expansion memory with finer granularity than storing data of corresponding physical registers into the expansion memory for all logical registers of a thread as objects, thereby avoiding unnecessary occupation of physical registers and further improving performance of a processor.
For example, a thread control method according to at least one embodiment of the present disclosure further includes: in response to the target thread requiring re-execution of the instruction, data is filled from the expansion memory into currently available physical registers in the processor.
Thus, the thread control method according to at least one embodiment of the present disclosure may resume execution of the target thread when the target thread needs to re-execute instructions.
For example, a thread control method according to at least one embodiment of the present disclosure further includes: obtaining a time at which the instruction was stalled, wherein the instruction is stalled in response to the occupancy state meeting an overflow threshold condition; and determining that the target thread needs to re-execute the instruction in response to the instruction being stalled being greater than or equal to a second time threshold, wherein the second time threshold indicates a time minimum time for the instruction corresponding to the target thread to be stalled.
As such, a thread control method in accordance with at least one embodiment of the present disclosure may determine that a target thread needs to re-execute instructions by setting a time threshold. However, embodiments are not limited in this regard, as it may be determined that the target thread needs to re-execute instructions by other means or external indication information, for example.
For example, according to a thread control method of at least one embodiment of the present disclosure, a logical register of a target thread includes a plurality of logical register sets, wherein in response to the target thread requiring re-execution of an instruction, filling available physical registers from an expansion memory includes: in response to an instruction requiring use of a first logical register, only data corresponding to each logical register in a logical register set in which the first logical register is located is filled from the expansion memory into a currently available physical register.
As such, a thread control method according to at least one embodiment of the present disclosure targets a logical register group included in a logical register of a thread to fill data corresponding to each logical register from an expansion memory into a currently available physical register. In this way, the thread control method according to at least one embodiment of the present disclosure can fill data required for an instruction to be executed into available physical registers with finer granularity than filling data corresponding to each logical register from an expansion memory into currently available physical registers for a thread, avoiding unnecessary occupation of physical registers, and thus can further improve the performance of a processor.
For example, in accordance with a thread control method of at least one embodiment of the present disclosure, an extended memory includes a plurality of memory addresses, each memory address dedicated to storing data corresponding to a logical register of a respective thread of the plurality of threads.
In this way, according to the thread control method of at least one embodiment of the present disclosure, a storage location dedicated to a logic of each thread may be set in the expansion memory, so as to implement dedicated storage of corresponding data, thereby avoiding a miss (miss) in data access.
In some examples, when data of a physical register of a logical register in a target thread is stored into an expansion memory, the data may be stored into the expansion memory to a memory address dedicated to the logical register.
For example, according to a thread control method of at least one embodiment of the present disclosure, a memory address is accessed through a thread number and a logical register number.
As such, a thread control method according to at least one embodiment of the present disclosure may enable simple data access.
Corresponding to the thread control method 400 according to at least one embodiment of the present disclosure, at least one embodiment of the present disclosure also provides a thread control device.
Fig. 5 illustrates a schematic diagram of a thread control device 500 in accordance with at least one embodiment of the present disclosure.
Referring to fig. 5, a thread control device 500 in accordance with at least one embodiment of the present disclosure includes an acquisition module 510 and a processing module 520.
The acquisition module 510 is configured to acquire, for a plurality of threads executed by a processor, occupancy states of physical registers in the processor by the plurality of threads.
The processing module 520 is configured to select a target thread of the plurality of threads, store data in at least a portion of the physical registers occupied by the target thread into the expansion memory, and release the data stored to the physical registers in the expansion memory for use by other threads of the plurality of threads than the target thread in response to the occupancy state meeting the overflow threshold condition.
As described above, a thread control device in accordance with at least one embodiment of the present disclosure stores data in physical registers of a partial thread to an expansion memory based on the occupancy state of the physical registers in a processor satisfying an overflow threshold condition, freeing the physical registers for use by other threads. As such, a thread control device in accordance with at least one embodiment of the present disclosure may enable multiple threads to share a reasonable number of physical registers, meeting the requirements of primary frequency, power consumption, and multi-threaded performance.
The additional aspects of the thread control apparatus 500 described above in accordance with at least one embodiment of the present disclosure may correspond to additional aspects of the thread control method 400 in accordance with at least one embodiment of the present disclosure, for example, the acquisition module 510 and the processing module 520 in the thread control apparatus 500 may be modified or other modules may be added to implement additional aspects of the thread control method 400 in accordance with at least one embodiment of the present disclosure. Accordingly, technical effects of additional aspects of the thread control method 400 according to at least one embodiment of the present disclosure may also be mapped to additional aspects of the thread control apparatus 500 according to at least one embodiment of the present disclosure, which are not described in detail herein.
For example, a thread control device in accordance with at least one embodiment of the present disclosure, the overflow threshold condition includes a quantity threshold, wherein the quantity threshold is associated with a maximum number of physical registers in a processor that at least some of the plurality of threads may occupy.
For example, a thread control device in accordance with at least one embodiment of the present disclosure, the overflow threshold condition comprises a first time threshold, wherein the first time threshold indicates a maximum time that a logical register of at least some of the plurality of threads is not used, and the acquisition module comprises: and the time acquisition module is configured to acquire the unoccupied time of the logic registers of at least part of the threads.
For example, in accordance with at least one embodiment of the present disclosure, a thread control device, at least a portion of the logical registers of a thread comprising a plurality of sets of logical registers, a time acquisition module comprising: a target logical register set time acquisition module configured to acquire a non-occupation time of a logical register within a target logical register set of the plurality of logical register sets; and the processing module comprises: a target thread selection unit that selects a corresponding thread as a target thread in response to a non-occupation time of a logical register within a target logical register group of the plurality of logical register groups being greater than or equal to a first time threshold; and an overflow storage control unit configured to store data in at least part of physical registers corresponding to logical registers in the target logical register group of the target thread into the expansion memory.
For example, in accordance with at least one embodiment of the present disclosure, the thread control device, the processing module further comprises: and a fill storage control unit configured to fill data from the expansion memory into currently available physical registers in the processor in response to the target thread requiring re-execution of the instruction.
For example, in accordance with a thread control device of at least one embodiment of the present disclosure, a plurality of logical register sets are included in a logical register of a target thread, and a fill storage control unit includes: and the logic register group filling storage control unit is configured to only fill the data corresponding to each logic register in the logic register group where the first logic register is located into the current available physical register from the expansion memory by using the first logic register in response to the instruction.
For example, in accordance with a thread control device of at least one embodiment of the present disclosure, an expansion memory is disposed around a physical register, and data in at least a portion of the physical register occupied by a target thread can be transferred through a direct data channel between the expansion memory and the physical register.
One or more exemplary aspects described above in connection with fig. 4 and 5 are described below in connection with example application scenarios. It will be appreciated that the example application scenarios described below are merely examples and are not intended to be limiting, and that one or more aspects described above in connection with fig. 4 and 5 are intended to be implemented in a particular application scenario, and that aspects described below in connection with the example application scenarios may be combined with one or more aspects described above in connection with fig. 4 and 5.
Fig. 6 illustrates a schematic diagram of another thread control device 600 in accordance with at least one embodiment of the present disclosure. It is to be appreciated that the thread control apparatus 600 is merely an example thread control apparatus illustrated in an example application scenario for ease of understanding, and that various modules therein may be combined, deleted, or other modules added.
Referring to fig. 6, to enable sharing of physical registers of multiple threads (hereinafter, illustrated as four threads, although embodiments of the present disclosure are not limited thereto), thread control device 600 stores data of a physical register of a thread into expansion memory 606 through exception detection module 602 (e.g., where the above fetch module 510 may be embodied) and exception processing module 604 (e.g., which may correspond to the above processing module 520) while freeing the physical register of the thread for use by other threads. After a period of time, data is loaded from expansion memory 606 into physical register file 608 as the thread continues to execute instructions. It is understood that the physical register file may comprise a plurality of physical registers. In this example scenario, expansion memory 606 is a temporary cache of physical register file 608. In some examples, expansion memory 606 may be implemented by registers, static Random Access Memory (SRAM), or a cache (cache) in a multiplexed processor architecture, or the like.
The exception detection module 602 detects that an exception needs to be triggered after, for example, the number of physical registers currently being consumed by a total of four threads reaches a certain threshold (e.g., may correspond to the number threshold above), and notifies the exception handling module 604, for example, by generating an exception signal. Exception handling module 604 may load microcode programs from microcode program store 610, for example, to handle exceptions. In some examples, microcode program memory 610 may be Read Only Memory (ROM), which is generally fixed and immutable, such that microcode programs may be secured. Of course, embodiments are not limited in this regard and microcode program memory 610 may be other memory that may store microcode programs.
The exception handling module 604 stores the data of the physical registers of a thread or threads (e.g., corresponding to the target thread above) into the expansion memory (i.e., the spine out) by, for example, an arbitration algorithm, while freeing the physical registers used by the thread. The released physical registers may be available for use by other threads.
During the processing of an exception, the thread stops dispatching instructions, thus providing sufficient computing resources for the thread to be executed.
When the thread of the stop dispatch instruction reaches a certain time, the dispatch of the thread is continued. The register data store used by the thread is also in the extended memory, and needs to be moved from the extended register to the physical register file (i.e., fill in) and then the thread instruction continues to execute.
The logical register-physical register mapping table for the four threads is described below.
Fig. 7A-7E illustrate logical register-to-physical register mapping tables for four threads (thread 0 through thread 3) in accordance with at least one embodiment of the present disclosure.
In this example application scenario, the logical registers are embodied as vector logical registers, however, the logical registers may not necessarily be embodied as vector logical registers. Referring to fig. 7A to 7E, the vector logic register is denoted by v, the initial letter of vector, v0 denotes a vector register of 512-bit data bit width, low 256-bit is denoted by vL0 (vector low), and high 256-bit is denoted by vH0 (vector high). Each thread has 31 vector logic registers, v0-v31 being divided into low 256-bit registers (vL 0-vL 31) and high 256-bit registers (vH 0-vH 31).
As can be seen from fig. 7A to 7E, each thread has a logical register-physical register map, and each logical register map consumes one physical register. For example, thread 0's vL0 register consumes the physical register number 8. When some physical registers of a thread are released through the space out mechanism, the physical register number in the mapping table becomes Invalid, which indicates that the physical register can be used by other threads, i.e. is available.
For example, releasing the physical registers of vL0-vL3 of thread 0, the logical register-physical register mapping table of thread 0 will change from FIG. 7A to FIG. 7E, where the physical registers 8,9, 16, 20 are released for use by other threads.
An example of the anomaly detection module is described below.
FIG. 8 illustrates a schematic diagram of an anomaly detection module in accordance with at least one embodiment of the present disclosure.
In this example application scenario, the role of the exception detection module is, for example, to collect information, detect exceptions, and notify the instruction dispatch module to stop the dispatch of threads. After a thread stops dispatching for a period of time, the thread is notified to continue dispatching.
In connection with fig. 6 and 8, register map information may be retrieved, for example, from register map 612, and instruction dispatch information 614 may be retrieved, for example, from an instruction dispatch module. Register valid map refers to a logical register corresponding to a valid physical register, where register valid map counter 802 needs to count 1. Register valid map counter 802 may count the number of valid maps per thread and the number of physical registers consumed by all threads.
The threshold setting module 804 may set the number of physical registers threshold in a split mode. For example, in a mode, it may be required that the total number of physical register consumptions of four threads exceeds X, which may be set by the threshold setting module (e.g., corresponding to the maximum number above), by an exception that needs to trigger a spring out. For another example, in mode two, the exception may be triggered after the total number of physical registers consumed by thread 0 and thread 2 reaches Y and/or the exception may be triggered after the total number of physical registers consumed by thread 1 and thread 3 reaches Z, where the total number Y, Z is set by the threshold setting module. Thus, the threshold value of the number of the physical registers set in the mode can be set in a more flexible threshold value setting mode so as to adapt to different use environments of the processor. For example, if more programs for threads 0 and 2 are detected and fewer programs for threads 1 and 3 are detected, threads 0 and 2 may be biased by setting threshold Y higher and/or threshold Z lower, such that more registers may be used by threads 0 and 2, thereby improving the performance of threads 0 and 2.
The threshold setting module 804 may also set a time threshold (e.g., corresponding to the second time threshold described above) at which the thread stops dispatching. In this way, processor performance may be improved. For example, for a thread with sparse instructions, it may be controlled to stop dispatch for a longer period of time.
The comparator 806 may be responsible for comparing the number actually consumed by the physical registers to a number threshold to determine a spring out exception. In other aspects, the comparator may be responsible for comparing the ratio of actual consumption of the physical registers to a threshold set number threshold or may be responsible for comparing the time and time thresholds that the logical registers are not in use (unoccupied) (e.g., corresponding to the first time threshold above) to determine a bailout exception, or the comparator may be responsible for comparing the time and time thresholds that the instruction is stalled (e.g., corresponding to the second time threshold above) to determine a Fill in exception. The comparator 806 may determine that a threshold condition is met and report the type of exception based on the comparison, whether a spring out exception (e.g., corresponding to the occupancy state above meeting an overflow threshold condition) or a Fill in exception (e.g., corresponding to the target thread above requiring re-execution of instructions).
The arbiter 808 may determine the thread number of the spin out and the thread number to stop dispatching according to an arbitration algorithm. In some examples, the arbitration algorithm may include a least recently used algorithm (LRU), a least recently used algorithm (LFU), or a polling algorithm (RH), among others.
The arbiter 808 counts the time to stop dispatch by the second time counter 810 after sending the spin out signal and the stop dispatch signal, and notifies the thread of spin out to continue executing instructions when the second time counter reaches a certain time threshold, and triggers the exception handler of Fill in.
An example of a Fill in packet for a logical register is described below.
After the logical registers of a thread are mapped out, if the thread re-executes instructions, then the Fill in logical registers are needed. Since one logical register per Fill in then needs to consume one physical register again, the grouping takes place at Fill in, with only one set of registers at a time (e.g., corresponding to the logical register set above). I.e. Fill in when the logical register is used, and not Fill in when the logical register is not used. In this way, the Fill in grouping can provide a fine-grained physical register consumption compared to all registers for one thread Fill in, e.g., only a small amount of Fill in must be used when Fill in, avoiding unnecessary physical register occupation, and thus further improving processor performance).
FIG. 9 illustrates a schematic diagram of a logical register grouping in accordance with at least one embodiment of the present disclosure.
Referring to fig. 9, an exemplary grouping is: the low 256-bits of the vector registers v0-v15 are set A, the high 256-bits of v0-v15 are set B, the low 256-bits of v15-v31 are set C, the high 256-bits of v16-v31 are set D (the low 256-bits of v0-v15 are vL0-vL15, the low 256-bits of v16-v31 are vL16-vL 31).
When thread 0's logical register has been mapped out into expansion memory, an exemplary instruction sequence is executed as follows:
VADD VL0, VL1, VL2 (vector register addition vl0=vl1+vl2)
In this case, the vL0, vL1, vL2 vector registers are used, and the vL0, vL1, vL2 vector registers are located in the A group, and the A group logic registers need to be filled in, while the B group, the C group and the D group do not need to be filled in.
For another example, the following exemplary instruction sequence is executed:
VADD VL0,VL1,VL16
In this case, the vL0, vL1, vL16 vector registers are used, and the vL0, vL1, vL16 vector registers refer to the A group and the C group, and the logic registers of the A group and the C group need to be filled in, while the B group and the D group do not need to be filled in.
It should be noted that the 4 logical register sets are only exemplary, and may be divided into 8 sets, 16 sets, or other numbers of logical register sets as desired.
An example of an exception handling module is described below.
Referring to FIG. 6, upon receipt of an exception type from the exception detection module, the exception handling module 604 may present a corresponding exception type to the microcode program store to read the microcode program from the microcode program store 610 and execute the corresponding microinstruction. For example, exception types are classified into a spring out and Fill in exception type.
In the process of the space out, the exception handling module can read physical register data corresponding to a logic register in a certain thread into the exception handling module and then store the physical register data into the expansion memory. In general, the physical registers are closer to the controller and the portion of the operator within the core, while the expansion memory is farther from the physical registers, and there is no direct data path for the physical registers and the expansion memory. In other aspects, the microinstructions may include instructions that are released by physical registers. Once the physical register is released, the corresponding location in its mapping table may be set to Invalid.
The process of Fill in can be seen as the opposite of the process of spin out. For example, the Fill in may be implemented by the Fill in packet described above, and the spin out may be all spin out of a certain class of registers of a certain thread. For another example, the bailout may be grouped similarly to the Fill in group, for example, the number of logical registers in each group in the bailout group is greater than the number of logical registers in each group in the Fill in group.
An example of an expansion memory is described below.
The expansion memory is used for temporarily storing data of the thread spin out. The fixed address of the memory holds fixed data.
FIG. 10 illustrates a schematic diagram of the storage of a thread in expansion memory in accordance with at least one embodiment of the present disclosure.
Referring to fig. 10, for example, the value of the vL0 logical register is dedicated to the location deposited at address 0, the value of the vL1 logical register is dedicated to the location deposited at address 1, and the value of the vL2 logical register is dedicated to the location deposited at address 3. In this way, data access can be performed simply compared to random storage employed, for example, in a cache memory.
In other aspects, the memory reads may be based on thread numbers and logical register numbers, avoiding access miss (miss) situations. It is understood that other threads may have similar structures in expansion memory.
An example of mode setting is described below.
Mode one and mode two are mentioned above, different mode thresholds being configurable. An exemplary description is given below.
Assuming that 80 physical registers (32 vL0-vL31, 32 vH0-vH31,8 multimedia registers media0-media7 with a bit width of 80-bit,8 temp registers) are required to be consumed by each thread under four threads, 320 registers are required to be consumed by four threads. If there are 8 free running physical registers per thread, the physical register file requires a total of 352 entries. Whereas the actual physical register file has only 256 entries or 192 entries.
Taking a 256 entry physical register file as an example, mode one may be employed: that is, when the total number of physical registers that four threads effectively map reaches 248, for example, the spring out of one thread is triggered and the dispatch of that thread is stopped for a period of time, and then the thread is released and the thread is allowed to re-execute instructions. The thread of the spin out may be determined by the arbiter above.
If mode two is employed: that is, when the total number of physical registers of the valid map of a certain two threads reaches, for example, 144, the jump out of the two threads is triggered and the dispatch of a thread is stopped for a period of time, and then the thread is released and the thread is allowed to re-execute instructions.
If a 192-entry physical register file is an example, two threads may be spun out at a time and the dispatch of two threads may be stopped for a period of time.
The mode setting may increase the utilization of the physical registers, thereby increasing the performance of the multithreading. For example, the mode setting may allow physical registers to be used on threads that must be used without being left unused in the mapping table. For example, micro instruction (uop) execution of different threads may be detected by hardware logic, and then information may be sent to an exception detection module, which determines to configure into a corresponding mode. That is, the mode setting can be performed by a dynamic configuration.
The logical registers that are not used for a long time need to be lifted out as described below.
The exception detection module also has the function of detecting and processing a register which is not used for a long time (for example, realized by a first time counter) according to the allocated information, triggering a corresponding exception and outputting the register space.
For example, a program with 256-bit register bit width is executed without using the multimedia register for a long time (this is just an example, and there is no special meaning for the program with multimedia registers), but the multimedia register still has a mapping in the register mapping table, and then the physical register resources occupied by the multimedia register space needs to be released. For example, a 256-bit register bit width program is executed only long after a segment of a 512-bit register bit width program is executed, and the high 256 registers to v0-v15 and the high low 256 registers to v16-v31 are not used (the 256-bit register bit width program only uses group A of the Fill in packet). At this time, the registers of group B, group C and group D can be set out to release occupied physical register resources.
In this way, the utilization of the physical registers is improved, thereby improving the performance of the multithreading. For example, some programs may use only a portion of the registers (e.g., an instruction set program that supports only AVX2 would not use high low for v16-v31 and high for v0-v 15), need to group registers, and not necessarily also high out for low for v0-v15, after all frequent spin out and Fill in would also affect performance.
In the example application scenario described above, the thread control device 600 solves the problem of physical register sharing under multithreading. More specifically, the design of the number of entries of the physical register file under multithreading is too large, which can seriously reduce the main frequency of the processor and increase the power consumption of the processor. The thread control device 600 reduces the number of physical registers used by expanding the space out and Fill in mechanisms of the memory, while allowing the physical register resources to be shared. In addition, the thread control device 600 outputs the register space out which is not used for a long time, releases the physical register resource, and improves the performance of the processor.
In other aspects, in the example application scenario described above, a physical register file may be extended with memory-like components. For example, other storage elements are used as backups to the physical register file, placed closer to the physical registers. A direct data channel with a physical register file and an expansion unit uses instructions to implement data movement between the physical register and the expansion unit.
FIG. 11 illustrates a schematic diagram of another thread control device 1100 in accordance with at least one embodiment of the present disclosure.
As shown in fig. 11, the thread control device 1100 includes at least one processing unit 1120 and a memory 1110. Memory 1110 stores computer readable instructions and is communicatively coupled to processing unit 1120. Processing unit 1120 executes computer-readable instructions stored by memory 1110 to implement a thread control method in accordance with at least one embodiment of the present disclosure, as well as additional aspects thereof.
For example, the memory 1110 and the processing unit 1120 may communicate with each other directly or indirectly. For example, in some examples, as shown in FIG. 11, the thread control device 1100 may also include a system bus 1130, where the memory 1110 and the processing unit 1120 may communicate with each other via the system bus 1130, e.g., where the processing unit 1120 may access the memory 1110 via the system bus 1130. For example, in other examples, components such as memory 1110 and processing unit 1120 may communicate via a Network On Chip (NOC) connection.
For example, the processing unit 1120 may control other components in the thread control device 1100 to perform desired functions. The processing unit 1120 may be a Central Processing Unit (CPU), a Tensor Processor (TPU), a Network Processor (NP), or a Graphics Processor (GPU) with data processing capability and/or program execution capability, or may be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or the like.
For example, memory 1110 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, and the like.
For example, one or more computer-readable instructions may be stored on memory 1110 and processing unit 1120 may execute the computer-readable instructions to implement various functions. Various applications and various data, such as instruction processing code and various data used and/or generated by the applications, may also be stored in the computer readable storage medium.
For example, some of the computer instructions stored by memory 1110, when executed by processing unit 1120, may perform one or more steps in accordance with the thread control method described above.
For example, as shown in FIG. 11, the thread control device 1100 may also include an input interface 1140 that allows an external device to communicate with the thread control device 1100. For example, input interface 1140 may be used to receive instructions from an external computer device, from a user, or the like. Thread control device 1100 may also include an output interface 1150 that interconnects thread control device 1100 and one or more external devices. For example, thread control device 1100 may pass through output interface 1150, etc.
It should be noted that the thread control device 1100 according to at least one embodiment of the present disclosure is exemplary, not limiting, and the thread control device 1100 may further include other conventional components or structures according to practical application requirements, for example, to implement the necessary functions of the thread control device, and those skilled in the art may set other conventional components or structures according to specific application scenarios, which the embodiments of the present disclosure are not limited to.
At least one embodiment of the present disclosure also provides a processor, such as an SMT processor, that includes a thread control device according to at least one embodiment of the present disclosure. The maximum number of threads supportable by an SMT processor may be, for example, 2,4, 8, etc., may be a single-core or multi-core processor, for example, a processor core may employ a microarchitecture of X86, ARM, RISC-V, etc., may include one or more levels of cache, and embodiments of the present disclosure are not limited in this respect.
At least one embodiment of the present disclosure also provides a computer-readable storage medium. Fig. 12 shows a schematic diagram of a computer-readable storage medium 1200 in accordance with at least one embodiment of the present disclosure.
For example, as shown in fig. 12, the computer-readable storage medium 1200 stores computer-readable instructions 1210 that, when executed by a computer (including a processor), may implement a thread control method in accordance with at least one embodiment of the present disclosure, as well as additional aspects thereof.
For example, one or more computer-readable instructions may be stored on the computer-readable storage medium 1200. Some of the computer readable instructions stored on the computer readable storage medium 1200 may be, for example, instructions for implementing one or more steps in the thread control method described above.
For example, a computer-readable storage medium may include a memory component of a tablet computer, a hard disk of a personal computer, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), compact disc read only memory (CD-ROM), flash memory, or any combination of the foregoing, as well as other suitable storage media. For example, the computer readable storage medium 1200 may include the memory 1110 in the thread control device 1100 described above.
At least some embodiments of the present disclosure also provide an electronic device. Fig. 13 illustrates a schematic diagram of an electronic device 1300 in accordance with at least one embodiment of the present disclosure.
An electronic device according to at least one embodiment of the present disclosure may be implemented as, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), an in-vehicle terminal (e.g., an in-vehicle navigation terminal), etc., and a fixed terminal such as a digital TV, a desktop computer, etc.
The electronic device 1300 shown in fig. 13 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
For example, as shown in fig. 13, in some examples, an electronic device 1300 includes a processor 1301, which may include a processor (e.g., an SMT processor) of any of the above embodiments, which may perform various suitable actions and processes according to a program stored in a Read Only Memory (ROM) 1302 or a program loaded from a storage 1308 into a Random Access Memory (RAM) 1303. In the RAM 1303, various programs and data required for the operation of the computer system are also stored. The processor 1301, ROM 1302, and RAM 1303 are connected thereto via a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.
For example, the following components may be connected to the I/O interface 1305: input devices 1306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 1307 including a Liquid Crystal Display (LCD), a speaker, a vibrator, or the like; storage 1308 including, for example, magnetic tape, hard disk, etc.; for example, communication device 1309 may also include a network interface card such as a LAN card, modem, or the like. The communication device 1309 may allow the electronic apparatus 1300 to perform wireless or wired communication with other apparatuses to exchange data, performing communication processing via a network such as the internet. The drive 1310 is also connected to the I/O interface 1305 as needed. Removable media 1311, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, etc., is mounted on drive 1310 as needed so that a computer program read therefrom is mounted into storage 1308 as needed. While fig. 13 illustrates an electronic device 1300 that includes various devices, it is to be understood that not all illustrated devices are required to be implemented or included. More or fewer devices may be implemented or included instead.
For example, the electronic device 1300 may further include a peripheral interface (not shown), and the like. The peripheral interface may be various types of interfaces, such as a USB interface, a lightning (lighting) interface, etc. The communication means 1309 may communicate with networks and other devices by way of wireless communication, such as the internet, intranets and/or wireless networks such as cellular telephone networks, wireless Local Area Networks (LANs) and/or Metropolitan Area Networks (MANs). The wireless communication may use any of a variety of communication standards, protocols, and technologies including, but not limited to, global System for Mobile communications (GSM), enhanced Data GSM Environment (EDGE), wideband code division multiple Access (W-CDMA), code Division Multiple Access (CDMA), time Division Multiple Access (TDMA), bluetooth, wi-Fi (e.g., based on the IEEE 802.11a, IEEE 802.11b, IEEE 802.11g, and/or IEEE 802.11n standards), voice over Internet protocol (VoIP), wi-MAX, protocols for email, instant messaging, and/or Short Message Service (SMS), or any other suitable communication protocol.
For the present disclosure, in addition to the above exemplary descriptions, the following points are required:
(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.
(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.
The foregoing is merely exemplary embodiments of the present disclosure and is not intended to limit the scope of the disclosure, which is defined by the appended claims.

Claims (22)

1. A thread control method, comprising:
for a plurality of threads executed by a processor, acquiring the occupation state of the plurality of threads on physical registers in the processor; and
In response to the occupancy state meeting an overflow threshold condition, a target thread of the plurality of threads is selected, data in at least a portion of physical registers occupied by the target thread is stored in an expansion memory, and the data is released from the physical registers stored in the expansion memory for use by other threads of the plurality of threads other than the target thread.
2. The thread control method of claim 1, wherein the overflow threshold condition comprises a quantity threshold, wherein the quantity threshold is associated with a maximum number of physical registers in the processor that at least some of the plurality of threads may occupy.
3. The thread control method of claim 2, the number threshold indicating a maximum number of physical registers in the processor that the at least partial thread may occupy, and
Acquiring the occupation state of the plurality of threads on physical registers in the processor, wherein the method comprises the following steps: and counting the number of physical registers in the processor occupied by at least part of threads.
4. The thread control method of claim 3, wherein counting the number of physical registers in the processor occupied by the at least some threads comprises:
Counting the number of valid mappings for each thread in the at least some threads, wherein a valid mapping indicates that one logical register corresponds to one valid physical register; and
The counts are summed as a number of physical registers in the processor occupied by the at least some threads.
5. A thread control method as claimed in claim 3, wherein the number threshold indicates a maximum number of physical registers in the processor that the part or all of the threads may occupy.
6. The thread control method of claim 3, further comprising:
And determining that the occupied state meets an overflow threshold condition in response to the acquired number of physical registers in the processor occupied by the at least partial threads being greater than or equal to the maximum number.
7. The thread control method of claim 1, wherein the overflow threshold condition comprises a first time threshold, wherein the first time threshold indicates a maximum time that logical registers of at least some of the plurality of threads are unused, and
Acquiring the occupation state of the plurality of threads on physical registers in the processor, wherein the method comprises the following steps: and acquiring the unoccupied time of the logic registers of the at least partial threads.
8. The thread control method of claim 7, further comprising:
In response to a non-busy time of acquiring the logical registers of the at least some threads being greater than or equal to a first time threshold, it is determined that the busy state satisfies an overflow threshold condition.
9. The thread control method of claim 7 wherein the logical registers of the at least some threads comprise a plurality of sets of logical registers,
Wherein obtaining the unoccupied time of the logical registers of the at least partial threads comprises:
Acquiring unoccupied time of a logical register within a target logical register group of the plurality of logical register groups, and
Wherein selecting a target thread of the plurality of threads in response to the occupancy state meeting an overflow threshold condition, storing data in at least a portion of physical registers occupied by the target thread into an expansion memory, comprises:
And in response to the unoccupied time of the logic registers in the target logic register set in the plurality of logic register sets being greater than or equal to the first time threshold, selecting a corresponding thread as a target thread, and storing data in at least part of physical registers corresponding to the logic registers in the target logic register set of the target thread into an expansion memory.
10. The thread control method of any one of claims 1-9, further comprising:
The data is filled from the expansion memory into currently available physical registers in the processor in response to the target thread requiring re-execution of an instruction.
11. The thread control method of claim 10, further comprising:
Obtaining a time at which the instruction was stalled, wherein the instruction was stalled in response to the occupancy state meeting an overflow threshold condition; and
And determining that the target thread needs to re-execute the instruction in response to the instruction being stalled being greater than or equal to a second time threshold, wherein the second time threshold indicates a minimum time for the instruction corresponding to the target thread to be stalled.
12. The thread control method of claim 10 wherein the logical registers of the target thread comprise a plurality of sets of logical registers,
Wherein, in response to the target thread requiring re-execution of an instruction, the populating the available physical registers from the expansion memory comprises:
And in response to the instruction, a first logic register is required to be used, and only data corresponding to each logic register in the logic register group where the first logic register is located is filled into the currently available physical register from the expansion memory.
13. The thread control method of claim 10, wherein the expansion memory comprises a plurality of memory addresses, each memory address dedicated to storing data corresponding to a logical register of a respective thread of the plurality of threads.
14. A thread control apparatus comprising:
an acquisition module configured to acquire, for a plurality of threads executed by a processor, an occupancy state of the plurality of threads on a physical register in the processor; and
And a processing module configured to select a target thread of the plurality of threads, store data in at least a portion of physical registers occupied by the target thread into an expansion memory, and release the data stored to physical registers in the expansion memory for use by other threads of the plurality of threads than the target thread in response to the occupancy state satisfying an overflow threshold condition.
15. The thread control device of claim 14, wherein the overflow threshold condition comprises a first time threshold, wherein the first time threshold indicates a maximum time that a logical register of at least some of the plurality of threads is not used, and the acquisition module comprises:
And the time acquisition module is configured to acquire the unoccupied time of the logic registers of the at least partial threads.
16. The thread control device of claim 15 wherein the logical registers of the at least some threads comprise a plurality of sets of logical registers,
The time acquisition module comprises:
A target logical register set time acquisition module configured to acquire a non-occupation time of a logical register within a target logical register set of the plurality of logical register sets; and
The processing module comprises:
A target thread selection unit that selects a corresponding thread as a target thread in response to a non-occupation time of a logical register within a target logical register group of the plurality of logical register groups being greater than or equal to the first time threshold; and
And the overflow storage control unit is configured to store data in at least part of physical registers corresponding to the logic registers in the target logic register group of the target thread into an expansion memory.
17. The thread control apparatus of claim 14, wherein the processing module further comprises:
And a stuffing storage control unit configured to stuff the data from the expansion memory into a currently available physical register in the processor in response to the target thread requiring re-execution of an instruction.
18. The thread control apparatus of claim 17, wherein the logical register of the target thread comprises a plurality of logical register sets, and the fill storage control unit comprises:
And a logic register set filling storage control unit configured to fill only data corresponding to each logic register in a logic register set where a first logic register is located from the expansion memory into the currently available physical register in response to the instruction requiring the use of the first logic register.
19. The thread control apparatus of claim 14 wherein the expansion memory is disposed around the physical registers and data in at least a portion of the physical registers occupied by the target thread is capable of being transferred through a direct data path between the expansion memory and the physical registers.
20. A thread control apparatus comprising at least one processing unit and a memory; wherein,
The memory stores computer readable instructions and is communicatively coupled to the at least one processing unit;
The at least one processing unit is configured to execute the computer readable instructions stored by the memory to implement the thread control method according to any one of claims 1-13.
21. A processor comprising a thread control device according to any one of claims 14-20.
22. A computer readable storage medium having computer readable instructions stored therein, which when executed by a processor, implement the thread control method according to any one of claims 1-13.
CN202410382007.8A 2024-03-29 2024-03-29 Thread control method and device, processor and computer readable storage medium Pending CN118245188A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410382007.8A CN118245188A (en) 2024-03-29 2024-03-29 Thread control method and device, processor and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410382007.8A CN118245188A (en) 2024-03-29 2024-03-29 Thread control method and device, processor and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN118245188A true CN118245188A (en) 2024-06-25

Family

ID=91552709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410382007.8A Pending CN118245188A (en) 2024-03-29 2024-03-29 Thread control method and device, processor and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN118245188A (en)

Similar Documents

Publication Publication Date Title
KR101118486B1 (en) On-demand multi-thread multimedia processor
US9645819B2 (en) Method and apparatus for reducing area and complexity of instruction wakeup logic in a multi-strand out-of-order processor
EP1869536B1 (en) Multi-threaded processor comprising customisable bifurcated thread scheduler for automatic low power mode invocation
US8099582B2 (en) Tracking deallocated load instructions using a dependence matrix
US20140208074A1 (en) Instruction scheduling for a multi-strand out-of-order processor
US8386753B2 (en) Completion arbitration for more than two threads based on resource limitations
TWI639956B (en) Multi-core system including heterogeneous processor cores with different instruction set architectures
JP2000105699A (en) Reservation station for increasing instruction level parallelism
US10331357B2 (en) Tracking stores and loads by bypassing load store units
CN114356420B (en) Instruction pipeline processing method and device, electronic device and storage medium
CN106575220B (en) Multiple clustered VLIW processing cores
KR20040091538A (en) a method and circuit for modifying pipeline length in a simultaneous multithread processor
CN107567614B (en) Multicore processor for execution of strands of instructions grouped according to criticality
US20230244490A1 (en) Microprocessor with time counter for statically dispatching instructions
US10152329B2 (en) Pre-scheduled replays of divergent operations
WO2024041625A1 (en) Instruction distribution method and device for multithreaded processor, and storage medium
CN118245188A (en) Thread control method and device, processor and computer readable storage medium
KR102614515B1 (en) Scalable interrupts
TW201915715A (en) Select in-order instruction pick using an out of order instruction picker
JP2022549333A (en) Throttling while managing upstream resources
CN118132233A (en) Thread scheduling method and device, processor and computer readable storage medium
US20080244242A1 (en) Using a Register File as Either a Rename Buffer or an Architected Register File
US20230350680A1 (en) Microprocessor with baseline and extended register sets
CN118245186A (en) Cache management method, cache management device, processor and electronic device
CN118245218A (en) Cache management method, cache management device, processor and electronic device

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination