WO2024041625A1 - 多线程处理器的指令分发方法、装置和存储介质 - Google Patents

多线程处理器的指令分发方法、装置和存储介质 Download PDF

Info

Publication number
WO2024041625A1
WO2024041625A1 PCT/CN2023/114840 CN2023114840W WO2024041625A1 WO 2024041625 A1 WO2024041625 A1 WO 2024041625A1 CN 2023114840 W CN2023114840 W CN 2023114840W WO 2024041625 A1 WO2024041625 A1 WO 2024041625A1
Authority
WO
WIPO (PCT)
Prior art keywords
thread
instruction distribution
execution waiting
thread instruction
request
Prior art date
Application number
PCT/CN2023/114840
Other languages
English (en)
French (fr)
Inventor
肖皓
Original Assignee
海光信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海光信息技术股份有限公司 filed Critical 海光信息技术股份有限公司
Publication of WO2024041625A1 publication Critical patent/WO2024041625A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • Embodiments of the present disclosure relate to an instruction distribution method, an instruction distribution device, a data processing device, a processor, an electronic device, and a non-transitory readable storage medium for a multi-threaded processor.
  • Synchronous multithreading is a hardware multithreading technology that can execute instructions from multiple threads in one central processing unit (CPU) clock cycle.
  • simultaneous multithreading is a method of converting thread-level parallel processing (multiple CPUs) into instruction-level parallel processing (same CPU).
  • Synchronous multithreading is the ability of a single physical processor to dispatch instructions from multiple hardware thread contexts simultaneously. Synchronous multithreading is used to create performance benefits in commercial environments and for workloads with high cycle/instruction counts.
  • the processor uses a superscalar structure, which is suitable for reading and executing instructions in parallel. Synchronous multithreading allows two applications to be scheduled to run simultaneously on the same processor, thereby taking advantage of the superscalar nature of the processor.
  • At least one embodiment of the present disclosure provides an instruction distribution method for a multi-threaded processor.
  • the instruction distribution method includes: receiving multiple thread instruction distribution requests respectively issued by multiple decoded instruction queues of the multi-thread processor; determining whether the multiple thread instruction distribution requests are suitable for multiple executions of the multi-thread processor. Whether there is blocking or conflict in the waiting queue; based on the judgment, select one thread instruction distribution request from the plurality of thread instruction distribution requests and respond.
  • each of the plurality of thread instruction distribution requests includes multiple instructions that need to be sent to the execution waiting queue of the corresponding type respectively.
  • the multiple thread instructions Whether the distribution request is blocked or conflicts with the multiple execution waiting queues including: based on the number of tokens currently available in each of the multiple execution waiting queues, determining whether the multiple thread instruction distribution requests are suitable for the multiple execution waiting queues. Check whether there is any blocking or conflict in the execution waiting queue.
  • determining whether the multiple thread instruction distribution requests are blocked for the multiple execution waiting queues includes: responding to the first of the multiple thread instruction distribution requests.
  • the number of tokens in the first execution waiting queue among the multiple execution waiting queues required by a thread instruction distribution request is greater than the number of tokens currently available in the first execution waiting queue, and the first thread instruction distribution is determined The request is blocked on the first execution waiting queue.
  • the multiple execution waiting queues include at least one shared execution waiting queue shared by the multiple threads. Determining whether the plurality of thread instruction distribution requests conflicts with the plurality of execution waiting queues, including: responding to a second thread instruction distribution request in the plurality of thread instruction distribution requests for the at least one shared execution waiting queue. The second execution waiting queue is blocked, and the number of tokens in the second execution waiting queue required by the third thread instruction distribution request among the plurality of thread instruction distribution requests is not greater than the current number of tokens in the second execution waiting queue. The number of available tokens determines that the second thread instruction distribution request and the third thread instruction distribution request conflict with the second execution waiting queue.
  • selecting one thread instruction distribution request from the multiple thread instruction distribution requests and responding to it includes: based on the judgment, selecting the multiple thread instruction distribution requests At least one of the instruction distribution requests is added to the candidate request set; based on the priorities of the multiple threads, one thread instruction distribution request is selected from the candidate request set and responded to.
  • adding at least one of the plurality of thread instruction distribution requests to the candidate request set includes: responding to the plurality of thread instruction distribution requests If there is a fourth thread instruction distribution request in the request that is not blocked and does not conflict with the multiple execution waiting queues, the fourth thread instruction distribution request is added to the candidate request set.
  • adding at least one of the plurality of thread instruction distribution requests to the candidate request set includes: responding to the plurality of thread instruction distribution requests The fourth thread instruction dispatch request does not exist in the request, and the Among the multiple thread instruction distribution requests, there is a fifth thread instruction distribution request that conflicts with the multiple execution waiting queues, and the fifth thread instruction distribution request is added to the candidate request set.
  • selecting a thread instruction distribution request from the candidate request set based on the priorities of the multiple threads includes: using a least recently used algorithm to determine the multiple threads The current priority of each thread is selected, and the thread instruction distribution request with the highest priority is selected from the candidate request set.
  • the least recently used LRU algorithm is used to determine the current priorities of the multiple threads, including: initializing the priorities of the multiple threads; responding to the previous clock cycle The first thread among the plurality of threads is selected, the priority of the first thread in the current clock cycle is set to the lowest, and the priority of other threads among the plurality of threads except the first thread is set. Level increment.
  • selecting a thread instruction distribution request from the candidate request set based on the priorities of the multiple threads includes: using a polling algorithm to determine the multiple The current priority of the thread, selects the thread instruction distribution request with the highest priority from the candidate request set.
  • the multi-threads include at least 3 threads.
  • the instruction distribution device includes a receiving unit, a judging unit and a selecting unit.
  • the receiving unit is communicatively connected to the multi-thread processor and is configured to receive multiple thread instruction distribution requests respectively issued by multiple decoding instruction queues of the multi-thread processor.
  • Each of the plurality of thread instruction distribution requests includes a plurality of instructions that need to be respectively sent to the execution waiting queue of the corresponding type.
  • the determination unit is communicatively connected to the multi-thread processor and is configured to determine whether multiple thread instruction distribution requests of the multi-thread processor are blocked or conflict with the multiple execution waiting queues.
  • the selection unit is configured to select one thread instruction distribution request from the plurality of thread instruction distribution requests and respond based on the determination.
  • the judgment unit includes a combination judgment subunit.
  • the combination determination subunit is configured to determine, based on the number of tokens currently available in each of the multiple execution waiting queues, whether the multiple thread instruction distribution requests are blocked or conflicting with the multiple execution waiting queues.
  • the combination judgment subunit includes a blocking judgment unit.
  • the blocking judgment unit is configured to respond to a first thread instruction distribution request among the plurality of thread instruction distribution requests requiring a number of tokens in a first execution waiting queue of the plurality of execution waiting queues that is greater than the first The number of tokens currently available in the execution waiting queue determines that the first thread instruction distribution request is blocked for the first execution waiting queue.
  • the multiple execution waiting queues include at least one shared execution waiting queue shared by multiple threads
  • the combination judgment subunit includes a conflict judgment unit.
  • the conflict determination unit is configured to respond to a second thread instruction distribution request among the plurality of thread instruction distribution requests being blocked for a second execution waiting queue in the at least one shared execution waiting queue, and the plurality of thread instructions
  • the number of tokens in the second execution waiting queue required by the third thread instruction distribution request in the distribution request is not greater than the number of tokens currently available in the second execution waiting queue, and the second thread instruction distribution request is determined There is a conflict with the third thread instruction distribution request for the second execution waiting queue.
  • the selection unit includes a candidate selection unit and a priority selection unit.
  • the candidate selection unit is configured to add at least one of the plurality of thread instruction distribution requests to a candidate request set based on the determination.
  • the priority selection unit is configured to select one thread from the candidate request set to instruct the distribution request and respond based on the priorities of the plurality of threads.
  • At least one embodiment of the present disclosure also provides a data processing device, including the instruction distribution device provided in any of the above embodiments, multiple decoding instruction queues, and multiple execution waiting queues.
  • At least one embodiment of the present disclosure also provides a processor, including the data processing device provided in any of the above embodiments.
  • At least one embodiment of the present disclosure also provides an electronic device including a processor and a memory including one or more computer program modules.
  • the one or more computer program modules are stored in the memory and configured to be executed by the processor, and the one or more computer program modules include instructions for executing the instruction distribution method provided by any of the above embodiments. instructions.
  • At least one embodiment of the present disclosure also provides a non-transitory readable storage medium having computer instructions stored thereon.
  • the computer instructions are executed by the processor, the instruction distribution method provided by any of the above embodiments is executed.
  • Figure 1 is a flow chart of an instruction distribution method for a multi-threaded processor provided by at least one embodiment of the present disclosure
  • Figure 2 is an example structural block diagram of a multi-threaded processor provided by at least one embodiment of the present disclosure
  • Figure 3 is a schematic diagram of a multi-thread instruction distribution arbitration provided by at least one embodiment of the present disclosure
  • Figure 4 is a schematic block diagram of a token blocking operation provided by at least one embodiment of the present disclosure
  • Figure 5 is a schematic diagram of adjusting the priorities of multiple threads according to the least recently used (LRU) algorithm provided by at least one embodiment of the present disclosure
  • Figure 6 is a schematic diagram of adjusting the priorities of multiple threads according to a polling algorithm provided by at least one embodiment of the present disclosure
  • Figure 7 is a schematic block diagram of an instruction distribution method provided by at least one embodiment of the present disclosure.
  • Figure 8 is a schematic block diagram of an instruction distribution device provided by at least one embodiment of the present disclosure.
  • Figure 9 is a schematic block diagram of a data processing device provided by at least one embodiment of the present disclosure.
  • Figure 10 is a schematic block diagram of a processor provided by at least one embodiment of the present disclosure.
  • Figure 11 is a schematic block diagram of an electronic device provided by at least one embodiment of the present disclosure.
  • Figure 12 is a schematic block diagram of yet another electronic device provided by at least one embodiment of the present disclosure.
  • Figure 13 is a schematic block diagram of a non-transitory readable storage medium provided by at least one embodiment of the present disclosure.
  • Synchronous multi-threading (SMT) processors need to arbitrate multiple threads during the instruction distribution phase, and select the instructions of one thread to distribute to the back-end execution unit.
  • multi-thread arbitration needs to consider the following two aspects.
  • the first aspect is efficiency.
  • Multi-thread arbitration needs to improve the parallel capabilities of multi-thread processors as much as possible, taking advantage of the multi-thread structure, thereby improving overall performance. For example, some threads require more resources, but the back-end execution unit does not have enough resources. Arbitration should try to avoid selecting such threads, otherwise it will cause distribution blocking and reduce overall performance.
  • Another aspect is fairness. Each thread in multi-threading should have an equal chance of being selected, and there should not be a situation where some threads are always selected and some threads cannot be selected.
  • the SMT of the current processor supports two threads, and the scheduling between the two threads is relatively simple, for example, time slice-based scheduling is used. That is, one thread has priority during a certain period of time, and another thread has priority during another period of time.
  • the above time slice-based scheduling technology has the following two problems.
  • the first problem is low efficiency. When thread A and thread B are not fully loaded, thread A or thread B does not issue instructions in every clock cycle. For example, on time slice 1, only thread A instructions can be issued, but if there are many clock cycles during the period Thread A has no instructions to send, but there are instructions on thread B. During these clock cycles, no instructions will be issued, resulting in a decrease in overall processor performance.
  • the second problem is that there may be a "livelock" phenomenon. For example, a certain thread cannot be sent out. Assume that both thread A and thread B need some common resources, and thread A needs more resources. Every time in time slice 1, A cannot be sent out due to insufficient required resources. When switching to thread B priority, because thread B requires fewer resources, the resources are released a little and then occupied by thread B. Eventually, the resources required by thread A are never enough, resulting in a "livelock" phenomenon.
  • At least one embodiment of the present disclosure provides an instruction distribution method for a multi-thread processor.
  • the multi-thread processor includes multiple decoding instruction queues and multiple execution waiting queues.
  • the multiple decoding instruction queues are respectively For multiple threads, multiple execution waiting queues are respectively used for multiple execution units of corresponding types.
  • the multiple execution waiting queues include at least one shared execution waiting queue shared by multiple threads and multiple independent execution waiting queues respectively used for multiple threads. Execution wait queue.
  • the instruction distribution method includes: receiving multiple thread instruction distribution requests respectively issued by multiple decoding instruction queues, each of the multiple thread instruction distribution requests including multiple instructions that need to be sent to the execution waiting queue of the corresponding type respectively; judging Whether multiple thread instruction distribution requests are blocked or conflicting with multiple execution waiting queues; based on this judgment, select one thread instruction distribution request from the multiple thread instruction distribution requests and respond.
  • At least one embodiment of the present disclosure also provides a data processing device, a processor, an electronic device and a non-transitory readable storage medium corresponding to the above instruction distribution method.
  • Thread instructions distribute requests to avoid the "livelock" phenomenon.
  • one thread instruction distribution request is selected from multiple thread instruction distribution requests respectively issued by multiple decoding instruction queues and responded to, thereby avoiding the situation that a certain decoding instruction queue is selected without issuing instructions.
  • FIG. 1 is an example flowchart of an instruction distribution method for a multi-threaded processor provided by at least one embodiment of the present disclosure.
  • FIG. 2 is an example structural block diagram of a multi-threaded processor provided by at least one embodiment of the present disclosure.
  • Multi-threaded processors include multiple decoding instruction queues and multiple execution wait queues.
  • multiple decoding instruction queues are respectively used for multiple threads
  • multiple execution waiting queues are respectively used for multiple execution units of corresponding types.
  • the plurality of execution wait queues includes at least one shared execution wait queue shared by multiple threads and a plurality of independent execution wait queues respectively for the plurality of threads.
  • the instruction distribution method 10 includes the following steps S101 to S103.
  • Step S101 Receive multiple thread instruction distribution requests respectively issued by multiple decoding instruction queues.
  • Each of the plurality of thread instruction distribution requests includes a plurality of instructions that need to be respectively sent to the execution waiting queue of the corresponding type.
  • Step S102 Determine whether multiple thread instruction distribution requests are blocked or conflict with multiple execution waiting queues.
  • Step S103 Based on the judgment, select one thread instruction distribution request from multiple thread instruction distribution requests and respond.
  • the instruction distribution method 10 shown in FIG. 1 can be applied to various multi-threaded processors, for example, applied to a multi-threaded processor including 2 threads, including 4 threads.
  • Embodiments of the present disclosure do not limit this to multi-threaded processors with one, six, or eight threads, and can be set according to actual needs.
  • a 4-thread processor includes an instruction distribution module, which is used to receive instructions (such as micro-instructions) decoded by the front-end or previously cached instructions (such as micro-instructions) (these instructions/ Microinstructions correspond to multiple threads respectively), and the instructions of one thread (such as microinstructions) are selected and distributed to various execution waiting queues on the back end.
  • instructions such as micro-instructions
  • previously cached instructions such as micro-instructions
  • the multiple execution waiting queues may include the integer calculation instruction queue, address generation queue, read memory queue, write memory queue, instruction retirement queue, Floating point calculation queues, etc.
  • embodiments of the present disclosure do not impose specific restrictions on this and can be set according to actual needs. Different instructions enter different execution waiting queues because of their different instruction types. For example, in some embodiments, integer calculation instructions enter the integer calculation instruction queue, memory read instructions enter the address generation queue and read memory queue, and floating point instructions enter the floating point calculation queue. In this regard, embodiments of the present disclosure No specific restrictions are imposed.
  • each execution waiting queue Since the space of each execution waiting queue is limited, usually in order to avoid overflow of instructions in each execution waiting queue (for example, if a certain execution waiting queue is full and new instructions are still written to the queue), it is necessary to calculate the remaining space of each execution waiting queue.
  • the amount of space or available resources is also referred to in this article as calculating the number of tokens corresponding to each execution waiting queue.
  • the integer calculation instruction queue, address generation queue and read memory queue correspond to token 0, token 1 and token 2 respectively.
  • the write memory queue, instruction retirement queue and floating point instruction queue correspond to the order respectively.
  • These execution waiting queues are respectively used for multiple execution units of corresponding types, such as execution units 0-3, floating point execution units, etc. shown in Figure 2.
  • the first-level cache, integer register group, floating-point register group, etc. work together with the aforementioned execution unit.
  • the plurality of execution waiting queues includes at least one shared execution waiting queue shared by multiple threads and a plurality of independent execution waiting queues respectively used for the plurality of threads.
  • the integer calculation instruction queue, the address generation queue, and the read memory queue are shared by multiple threads, so the integer calculation instruction queue, the address generation queue, and the read memory queue can be called shared Execution wait queue.
  • the write memory queue, instruction retirement queue and floating point instruction queue are set separately for each thread in multi-threading, so the write memory queue, instruction retirement queue and floating point instruction queue can be called independent execution waiting queues.
  • tokens are divided into two types: shared tokens and independent tokens. For example, in the example shown in Figure 2, tokens 0, 1, and 2 are shared tokens, and tokens 3, 4, and 5 are independent tokens.
  • the 4-thread processor shown in Figure 2 is only an example.
  • the multi-threaded processor in the embodiments of the present disclosure may include more or fewer threads, and may also include more or fewer components, which is not limited by the embodiments of the present disclosure.
  • Figure 3 is a schematic diagram of a multi-thread instruction distribution arbitration provided by at least one embodiment of the present disclosure.
  • step S101 multiple thread instruction distribution requests respectively issued by multiple decoding instruction queues are received.
  • Each of the multiple thread instruction distribution requests includes information that needs to be sent to the corresponding type respectively.
  • decoded or cached instructions are stored in multiple decoded instruction queues, and each thread in the multi-thread corresponds to an independent decoded instruction queue.
  • the four threads respectively correspond to the decoding instruction queue T0, the decoding instruction queue T1, the decoding instruction queue T2, and the decoding instruction queue T3.
  • the decoded instruction queue can output multiple instructions in one clock cycle, and different instructions must enter the execution waiting queue of the corresponding type.
  • the decoded instruction queue T1 has a set of instructions to be sent in the current cycle, then the decoded instruction queue T1 can send a thread instruction distribution request to the instruction distribution module.
  • the decoded instruction queue T2 has no instructions that need to be sent in the current cycle, and the decoded instruction queue T2 may not send a thread instruction distribution request to the instruction distribution module. In this way, the instruction distribution module will not select a thread that does not need to distribute instructions, thereby avoiding resource waste and improving the overall performance of the entire processor.
  • determining whether multiple thread instruction distribution requests are blocked or conflicting with multiple execution waiting queues may include: based on the currently available commands in each of the multiple execution waiting queues. The number of cards is used to determine whether multiple thread instruction distribution requests are blocked or conflicting with multiple execution waiting queues. In this way, by comparing the number of tokens required for thread instruction distribution requests and the number of tokens remaining in the execution waiting queue, we can predict whether a set of instructions for the request will cause blocking, so as to avoid selecting threads that will cause blocking to achieve improvement. efficiency purpose.
  • the number of tokens currently available in each execution waiting queue is used to represent the available/remaining space of each execution waiting queue in the current clock cycle.
  • the token blocking determination may include: responding to the first execution waiting queue among the plurality of execution waiting queues required by the first thread instruction distribution request among the plurality of thread instruction distribution requests.
  • the number of tokens in is greater than the number of tokens currently available in the first execution waiting queue, and it is determined that the first thread instruction distribution request is blocked for the first execution waiting queue.
  • the first thread instruction distribution request is used to represent any thread instruction distribution request among multiple thread instruction distribution requests
  • the first execution waiting queue is used to represent Represents any one of multiple execution wait queues.
  • the first thread instruction distribution request nor the first execution waiting queue is limited to a specific thread instruction distribution request or execution waiting queue, nor is it limited to a specific order, and can be set according to actual needs.
  • a set of instructions output by the decoded instruction queue within one clock cycle is regarded as a whole. If a thread (decoding instruction queue) is selected, but the required tokens are insufficient, then a group of instructions of the thread will be blocked as a whole, and part of the instructions of the group of instructions cannot be distributed.
  • the thread it is necessary to calculate the total number of various tokens required based on the type of a set of instructions output by the decoded instruction queue in one clock cycle, and then compare them with each corresponding execution waiting queue The number of remaining/available tokens is compared. If the number of remaining/available tokens in an execution waiting queue is not enough, the thread can only be blocked and cannot be distributed if it is selected by arbitration.
  • Figure 4 is a schematic block diagram of a token blocking operation provided by at least one embodiment of the present disclosure.
  • the decoding instruction queues T0 ⁇ T3 are respectively used for the threads T0 ⁇ T3, and the decoding instruction queue T0 (for simplicity, also referred to as the thread T0) is to be sent.
  • the group instructions include 1 read instruction, 2 addition instructions and 1 floating point instruction.
  • the required number of tokens in each execution waiting queue is: 2 tokens 0 (corresponding to the integer calculation instruction queue), 1 token Token 1 (corresponds to the address generation queue), 1 token 2 (corresponds to the read memory queue), 4 tokens 4 (corresponds to the instruction retirement queue), 1 token 5 (corresponds to the floating point instruction queue).
  • the number of tokens currently available in any execution waiting queue is less than the number of tokens required for the decoding instruction queue T0, it means that there is no way for the group of instructions in the decoding instruction queue T0 to be put into each execution waiting queue, and the corresponding thread can be determined T0 is blocked.
  • decoding the instruction queue T0 requires 4 tokens 4, but the number of tokens currently available in the instruction retirement queue is 2, that is, there are currently 2 tokens 4 left.
  • the number of tokens 4 required to decode the instruction queue T0 is greater than the number of tokens 4 currently available in the instruction retirement queue, then the decoded instruction queue T0 is blocked for the instruction retirement queue, that is, the decoded instruction queue T0 is sent The thread instruction distribution request is blocked in the instruction retirement queue.
  • decoding the instruction queue T0 requires 1 token 1, but the number of tokens currently available in the address generation queue is 2, that is, there are currently 2 tokens 1 left.
  • the number of token 1 required to decode the instruction queue T0 is less than the number of token 1 currently available in the address generation queue, then the decode instruction queue T0 does not block the address generation queue, that is to say, the thread instruction distribution request sent by the decoding instruction queue T0 does not block the address generation queue.
  • the shared token conflict determination includes: responding to a second thread instruction distribution request among the plurality of thread instruction distribution requests, responding to a second execution waiting queue in at least one shared execution waiting queue. There is blocking, and the number of tokens in the second execution waiting queue required by the third thread instruction distribution request among the multiple thread instruction distribution requests is not greater than the number of tokens currently available in the second execution waiting queue, determine the second thread instruction The distribution request conflicts with the second execution wait queue.
  • the second thread instruction distribution request is used to represent any thread instruction distribution request among multiple thread instruction distribution requests
  • the third thread instruction distribution request is used to represent multiple thread instructions. Any thread instruction dispatch request that is different from the second thread instruction dispatch request among the dispatch requests.
  • the second execution waiting queue is used to represent any shared execution waiting queue in the at least one shared execution waiting queue.
  • the second thread instruction distribution request and the third thread instruction distribution request are not limited to specific thread instruction distribution requests, nor are they limited to a specific order, and can be set according to actual needs.
  • the second execution waiting queue is not limited to a specific execution waiting queue, nor is it limited to a specific order, and can be set according to actual needs.
  • the integer calculation instruction queue, the address generation queue and the read memory queue are shared execution waiting queues shared by multiple threads, that is, token 0, Token 1 and Token 2 are shared tokens.
  • the thread instruction distribution request issued by the decoding instruction queue T0 includes a set of instructions as shown in Figure 4, and the decoding instruction queue T0 currently requires 1 shared token 2.
  • the decoding instruction queue T1 In the case where another decoding instruction queue T1 requires 4 shared tokens 2, and there is currently 1 token remaining/available in the read memory instruction queue, the decoding instruction queue T1 is blocked for the read memory instruction queue, and the decoding instruction queue T0 The read memory instruction queue is not blocked, so it is determined that the decode instruction queue T1 has a conflict with the read memory instruction queue, and other decode instruction queues (ie, the decode instruction queue T0) that request the shared execution waiting queue (ie, the read memory instruction queue) are also blocked. A conflict is determined.
  • the decoding instruction queue T0 outputs only one read memory instruction per clock cycle, and the read memory instruction queue releases 1 token per clock cycle. If it is only based on the blocking situation, the decoding instruction queue T0 will be selected each time ( Thread T0), the four tokens 2 needed to decode the instruction queue T1 (thread T1) are never satisfied. This situation is called "livelock".
  • selecting one thread instruction distribution request from multiple thread instruction distribution requests and responding to it includes: based on the judgment, selecting one of the multiple thread instruction distribution requests. At least one is added to the candidate request set; based on the priorities of multiple threads, a thread instruction is selected from the candidate request set to distribute the request and respond.
  • the fourth thread instruction distribution request in response to the presence of a fourth thread instruction distribution request among the plurality of thread instruction distribution requests that is not blocked and does not conflict with the multiple execution waiting queues, is added The set of candidate requests.
  • thread instruction distribution requests issued by non-blocking and non-conflicting threads can be directly put into the candidate request set, if there is blocking or Conflicting threads will not be selected. In this way, processor performance waste is avoided and processing efficiency is improved.
  • Instruction distribution request Add the fifth thread instruction distribution request to the candidate request set.
  • thread instruction distribution requests issued by the conflicting threads may be put into the candidate request set.
  • the thread blocked under the shared token will not cause livelock due to other threads always occupying the shared token. That is to say, in the absence of non-blocking and non-conflicting threads, conflicting threads also have the opportunity to be selected, thereby avoiding livelock.
  • thread T0 and thread T1 conflict for a certain shared token
  • thread T2 blocks for an independent token
  • linear T3 has no conflict and no blocking.
  • only the thread instruction distribution request issued by thread T3 can be added to the candidate request set.
  • thread T0 and thread T1 conflict for a certain shared token
  • thread T2 and thread 3 are blocked.
  • the thread instruction distribution requests issued by the conflicting thread T0 and thread T1 can be added to the candidate request set for selection.
  • the fourth thread instruction distribution request is used to represent any thread instruction distribution request that does not have blocking and conflict
  • the fifth thread instruction distribution request is used to represent any thread instruction distribution request that has a conflict.
  • a thread instruction dispatch request, a fourth thread instruction dispatch request and a fifth thread instruction dispatch request Thread instruction distribution requests are not limited to specific thread instruction distribution requests, nor are they limited to a specific order, and can be set according to actual needs.
  • a thread can be selected from the candidate request set according to the priority of each thread.
  • the current priorities of multiple threads are determined according to the least recently used (LRU) algorithm, and the thread instruction distribution request with the highest priority is selected from the candidate request set.
  • the priority of each thread is adjusted through the LRU algorithm. If thread T1 has not issued instructions recently and other threads have records of issuing instructions, then the priority of thread T1 is adjusted to the highest.
  • LRU least recently used
  • using the LRU algorithm to determine the current priorities of all multiple threads may include: initializing the priorities of the multiple threads; selecting the first of the multiple threads in response to the previous clock cycle One thread, sets the priority of the first thread to the lowest in the current clock cycle, and increments the priorities of other threads among the multiple threads except the first thread.
  • the first thread is used to represent any thread among multiple threads, and is not limited to a specific thread or a specific order, and can be determined according to actual needs. set up.
  • FIG. 5 is a schematic diagram of adjusting priorities of multiple threads using a least recently used (LRU) algorithm provided by at least one embodiment of the present disclosure.
  • LRU least recently used
  • the priority of each thread is first initialized.
  • the priority ordering (from high to low) of multiple thread initialization is T0, T1, T2, T3.
  • the priority of thread T1 is adjusted to the lowest and the priorities of the other threads (threads T0, T2, and T3) are incremented, That is to say, in the second clock cycle, the priority ordering (from high to low) of multiple threads is T0, T2, T3, T1.
  • the priority of thread T0 is adjusted to the lowest, and the priorities of the other threads (threads T1, T2, and T3) are incremented, also That is to say, in the third clock cycle, the priority ordering (from high to low) of multiple threads is T2, T3, T1, T0.
  • the LRU algorithm may be implemented using a queue including multiple units, each unit in the queue storing a line program number. For example, the thread at the head of the queue has the highest priority High, the thread at the end of the queue has the lowest priority. Each time the thread program number selected by arbitration is deleted from the queue and reinserted at the end of the queue, it means that the thread has been selected, and its priority will be adjusted to the lowest later.
  • the current priorities of multiple threads may be determined according to a round-robin algorithm.
  • FIG. 6 is a schematic diagram of adjusting the priorities of multiple threads using a polling algorithm according to at least one embodiment of the present disclosure.
  • the priority of each thread is first initialized.
  • the priority ordering (from high to low) of multiple thread initialization is T0, T1, T2, T3.
  • the priority of thread T1 is adjusted to the lowest, and the priority of the thread next to thread T1 (i.e., thread T2) is set is the highest, and other threads T0 and T3 are sorted in a round-robin manner. That is to say, in the second clock cycle, the priority ordering (from high to low) of multiple threads is T2, T3, T0, T1.
  • the priority of thread T0 is adjusted to the lowest, and the priority of the thread next to thread T0 (i.e., thread T1) is set to Top, other threads T2 and T3 are sorted in a round-robin fashion. That is to say, in the third clock cycle, the priority ordering (from high to low) of multiple threads is T1, T2, T3, T0.
  • the algorithm is simple and ensures the fairness of multi-thread distribution.
  • FIG. 7 is a schematic block diagram of an instruction distribution method provided by at least one embodiment of the present disclosure.
  • multi-thread selection is performed based on the request signal and priority signal issued by each thread (decoding instruction queue), and finally the request signal is selected to be valid and when the request signal is The highest priority thread of one or more active threads.
  • each thread checks whether it is non-blocking and non-conflict, obtains the corresponding judgment signal, and ORs the judgment signals of the 4 threads (expressed as ⁇ 1 in Figure 7) Get a sure signal whether there is a non-blocking and non-conflicting thread among the 4 threads.
  • the confirmation signal that there is a non-blocking and non-conflicting thread among the four threads is inverted and ANDed with the token conflict signal of each thread (expressed as & in Figure 7), and then ORed with the judgment signal of each previous thread, and then ANDed with the command signal to get the final request signal. Then, combining the priority signals of multiple threads (for example, according to the aforementioned LRU algorithm or polling algorithm, etc.), one thread is selected to respond.
  • the instruction distribution method 10 for a multi-thread processor provided by at least one embodiment of the present disclosure can make multi-thread instruction distribution more efficient and fair, and can also avoid the "livelock" phenomenon, thereby improving the performance of the multi-thread processor. overall performance.
  • multi-threading includes at least 3 threads.
  • a 4-thread processor is used as an example to illustrate the instruction distribution method 10 provided by the embodiment of the present disclosure
  • the instruction distribution method 10 provided by the embodiment of the present disclosure is not only applicable to 4-thread processors. It can also be applied to 2-thread processors, 3-thread processors, 5-thread processors, etc., and the embodiments of the present disclosure do not limit this.
  • each step of the instruction distribution method 10 is not limited. Although the execution process of each step is described in a specific order above, this does not constitute a limitation of the present disclosure. Limitations of Examples. Each step in the instruction distribution method 10 can be executed serially or in parallel, which can be determined according to actual requirements. For example, the instruction distribution method 10 may also include more or fewer steps, which are not limited by the embodiments of the present disclosure.
  • FIG. 8 is a schematic block diagram of an instruction distribution device provided by at least one embodiment of the present disclosure.
  • At least one embodiment of the present disclosure provides an instruction distribution device 80.
  • the instruction distribution device 80 is respectively communicatively connected with a plurality of decoding instruction queues 802 and a plurality of execution waiting queues 803 of the multi-thread processor.
  • multiple decoding instruction queues 802 are respectively used for multiple threads
  • multiple execution waiting queues 803 are respectively used for multiple execution units of corresponding types.
  • multiple execution wait queues 803 include at least one shared execution wait queue shared by multiple threads and multiple independent execution wait queues respectively for multiple threads.
  • the instruction distribution device 80 includes a receiving unit 811, a judging unit 812, and a selecting unit 813.
  • the receiving unit 811 is communicatively connected to the multi-thread processor and is configured to receive multiple thread instruction distribution requests respectively issued by multiple decoding instruction queues of the multi-thread processor.
  • Each of the multiple thread instruction distribution requests includes information that needs to be sent to respectively. Execution of the corresponding type waits for multiple instructions in the queue.
  • the receiving unit 811 can implement step S101, and its specific implementation method can refer to step S101. The relevant descriptions will not be repeated here.
  • the determination unit 812 is communicatively connected to the multi-thread processor and is configured to determine whether multiple thread instruction distribution requests are blocked or conflict with multiple execution waiting queues of the multi-thread processor.
  • the judgment unit 812 can implement step S102.
  • step S102 For its specific implementation method, please refer to the relevant description of step S102, which will not be described again here.
  • the selection unit 813 is configured to select one thread instruction distribution request from multiple thread instruction distribution requests based on the above determination and respond.
  • the selection unit 813 can implement step S103.
  • step S103 For its specific implementation method, please refer to the relevant description of step S103, which will not be described again here.
  • the above-mentioned receiving unit 811, judgment unit 812 and selection unit 813 can be implemented by software, hardware, firmware or any combination thereof.
  • the above-mentioned receiving unit 811, judgment unit 812 and selection unit 813 can be respectively implemented as receiving units.
  • the embodiments of the present disclosure do not limit the specific implementation of the circuit 811, the judgment circuit 812 and the selection circuit 813.
  • the judgment unit 812 may include a combination judgment subunit.
  • the combined determination subunit may be configured to determine whether multiple thread instruction distribution requests are blocked or conflicting with multiple execution waiting queues based on the number of tokens currently available in each of the multiple execution waiting queues.
  • the operations that can be implemented by the combination judgment subunit can be referred to the relevant description of the aforementioned instruction distribution method 10, which will not be described again here.
  • the combination judgment subunit may include a blocking judgment unit and a conflict judgment unit.
  • the blocking judgment unit is configured to respond to the number of tokens in the first of the plurality of execution waiting queues required by the first thread instruction distribution request among the plurality of thread instruction distribution requests. is greater than the number of tokens currently available in the first execution waiting queue, it is determined that the first thread instruction distribution request is blocked for the first execution waiting queue.
  • the operations that can be implemented by the blocking judgment unit can be referred to the relevant description of the aforementioned instruction distribution method 10, which will not be described again here.
  • the conflict determination unit is configured to respond to a second thread instruction dispatch request among the plurality of thread instruction dispatch requests being blocked for a second execution wait queue in at least one shared execution wait queue, and the plurality of thread instruction dispatch requests is blocked.
  • the number of tokens in the second execution waiting queue required by the third thread instruction distribution request in the thread instruction distribution request is not greater than the number of tokens currently available in the second execution waiting queue, determine the second thread instruction distribution request and the third thread The instruction dispatch request conflicts with the second execution wait queue.
  • the operations that can be implemented by the conflict judgment unit can be Please refer to the relevant description of the above instruction distribution method 10, which will not be described again here.
  • the selection unit 813 may include a candidate selection unit and a priority selection unit.
  • the candidate selection unit is configured to add at least one of the plurality of thread instruction distribution requests to the candidate request set based on the above determination.
  • the priority selection unit is configured to select a thread instruction from the candidate request set to distribute the request and respond based on the priorities of the multiple threads.
  • the operations that can be implemented by the candidate selection unit and the priority selection unit please refer to the relevant description of the aforementioned instruction distribution method 10, which will not be described again here.
  • candidate selection units may include direct selection units and conflict selection units.
  • the direct selection unit is configured to, in response to the presence of a fourth thread instruction dispatch request among the plurality of thread instruction dispatch requests that does not block the multiple execution wait queues and does not conflict, transfer the fourth thread instruction to the fourth thread instruction dispatch request.
  • the distribution request is added to the candidate request set.
  • the conflict selection unit is configured to respond that there is no fourth thread instruction distribution request among the plurality of thread instruction distribution requests, and there is a conflict for the plurality of execution wait queues among the plurality of thread instruction distribution requests.
  • the fifth thread instruction distribution request adds the fifth thread instruction distribution request to the candidate request set.
  • the priority selection unit may include a setting unit and a distribution unit.
  • the setting unit is configured to determine the current priorities of the multiple threads according to the least recently used LRU algorithm.
  • the dispatch unit is configured to select a thread instruction dispatch request with the highest priority from the set of candidate requests.
  • the operations that can be implemented by the setting unit and the distribution unit please refer to the relevant description of the aforementioned instruction distribution method 10, which will not be described again here.
  • the setting unit may include an initialization unit and an adjustment unit.
  • the initialization unit is configured to initialize priorities of multiple threads.
  • the adjustment unit is configured to, in response to selecting the first thread among the plurality of threads in the previous clock cycle, set the priority of the first thread in the current clock cycle to the lowest, and set the priority of the plurality of threads except the first thread to the lowest. The thread's priority is incremented.
  • the operations that can be implemented by the initialization unit and the adjustment unit can be referred to the relevant description of the aforementioned instruction distribution method 10, which will not be described again here.
  • the priority selection unit may include a setting subunit.
  • the setting subunit is configured to determine the current priorities of multiple threads based on a polling algorithm. For example, to set the operations that can be implemented by the subunit, please refer to the relevant description of the aforementioned instruction distribution method 10, I won’t go into details here.
  • multi-threading includes at least 3 threads.
  • the above-mentioned combination judgment sub-unit, blocking judgment unit, conflict judgment unit, candidate selection unit, priority selection unit, direct selection unit, conflict selection unit, setting unit, distribution unit, initialization unit, adjustment unit and setting sub-unit The unit can be implemented by software, hardware, firmware or any combination thereof, for example, the above-mentioned combination judgment sub-unit, blocking judgment unit, conflict judgment unit, candidate selection unit, priority selection unit, direct selection unit, conflict selection unit, setting unit , the distribution unit, the initialization unit, the adjustment unit and the setting sub-unit can be respectively implemented as a combination judgment sub-circuit, a blocking judgment circuit, a conflict judgment circuit, a candidate selection circuit, a priority selection circuit, a direct selection circuit, a conflict selection circuit, a setting circuit, Distribution circuit, initialization circuit, adjustment circuit and setting sub-circuit, the embodiments of the present disclosure do not limit their specific implementation.
  • the instruction distribution device 80 provided by at least one embodiment of the present disclosure can implement the foregoing instruction distribution method 10 of a multi-thread processor, and can also achieve similar technical effects to the foregoing instruction distribution method 10 .
  • the efficiency and fairness of multi-thread instruction distribution can be effectively improved, and the livelock phenomenon can be avoided.
  • the instruction distribution device 80 may include more or less circuits or units, and the connection relationship between the various circuits or units is not limited and may be determined according to actual needs. .
  • the specific construction method of each circuit is not limited. It can be composed of analog devices according to the circuit principle, or it can be composed of digital chips, or it can be constructed in other suitable ways.
  • FIG. 9 is a schematic block diagram of a data processing device 70 provided by at least one embodiment of the present disclosure.
  • the data processing device 70 includes an instruction distribution device 701 , a plurality of decoding instruction queues 702 and a plurality of execution waiting queues 703 .
  • the data processing device 70 includes the instruction distribution device 801, a plurality of decoding instruction queues 802, and a plurality of execution waiting queues 803 shown in FIG. 8 .
  • the data processing device 70 may include more or less circuits or units, and the connection relationship between the various circuits or units is not limited and may be determined according to actual needs. .
  • the specific construction method of each circuit is not limited. It can be composed of analog devices according to the circuit principle, or it can be composed of digital chips, or it can be constructed in other suitable ways.
  • the data processing device 70 provided by at least one embodiment of the present disclosure can implement the aforementioned instruction distribution method 10 of a multi-thread processor, and can also implement the same instruction distribution method 10 as the aforementioned instruction distribution method 10 . Similar technical effects. For example, through the data processing device 70 provided by at least one embodiment of the present disclosure, the efficiency and fairness of multi-thread instruction distribution can be effectively improved, and the livelock phenomenon can be avoided, thereby improving the overall performance of the multi-thread processor.
  • Figure 10 is a schematic block diagram of a processor provided by at least one embodiment of the present disclosure.
  • processor 90 includes the data processing device 70 described in any of the above embodiments.
  • processor 90 may be a multi-threaded processor, such as a 4-thread processor.
  • the processor 90 provided by at least one embodiment of the present disclosure can implement the foregoing instruction distribution method 10 of a multi-thread processor, and can also achieve similar technical effects to the foregoing instruction distribution method 10 .
  • the efficiency and fairness of multi-thread instruction distribution can be effectively improved, and the livelock phenomenon can be avoided, thereby improving the overall performance of the multi-thread processor.
  • Figure 11 is a schematic block diagram of an electronic device provided by at least one embodiment of the present disclosure.
  • At least one embodiment of the present disclosure also provides an electronic device 20.
  • the electronic device 20 includes a processor 210 and a memory 220 .
  • Memory 220 includes one or more computer program modules 221.
  • One or more computer program modules 221 are stored in the memory 220 and configured to be executed by the processor 210.
  • the one or more computer program modules 221 include instructions for executing the instruction distribution method 10 provided by at least one embodiment of the present disclosure.
  • the instructions, when executed by the processor 210, may perform one or more steps in the instruction distribution method 10 provided by at least one embodiment of the present disclosure.
  • Memory 220 and processor 210 may be interconnected by a bus system and/or other forms of connection mechanisms (not shown).
  • the processor 210 may be a central processing unit (CPU), a digital signal processor (DSP), or other forms of processing units with data processing capabilities and/or program execution capabilities, such as a field programmable gate array (FPGA), and the like.
  • the central processing unit (CPU) may be of X86 or ARM architecture.
  • the processor 210 may be a general-purpose processor or a special-purpose processor that may control other components in the electronic device 20 to perform desired functions.
  • memory 220 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • Volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache), etc.
  • Non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disk read-only memory Memory (CD-ROM), USB memory, flash memory, etc.
  • One or more computer program modules 221 may be stored on the computer-readable storage medium, and the processor 210 may run the one or more computer program modules 221 to implement various functions of the electronic device 20 .
  • Figure 12 is a schematic block diagram of yet another electronic device provided by at least one embodiment of the present disclosure.
  • the electronic device 300 shown in FIG. 12 is only an example and should not bring any limitations to the functions and usage scope of the embodiments of the present disclosure.
  • the electronic device 300 includes a processing device (such as a central processing unit, a graphics processor, etc.) 301, which can be configured according to a program stored in a read-only memory (ROM) 302 or from a storage device.
  • ROM read-only memory
  • Device 308 loads a program into random access memory (RAM) 303 to perform various appropriate actions and processes.
  • RAM 303 various programs and data required for the operation of the computer system are also stored.
  • Processing device 301, ROM 302 and RAM 303 are connected via bus 304.
  • An input/output (I/O) interface 305 is also connected to bus 304.
  • the following components may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration an output device 307 such as a processor; a storage device 308 including a magnetic tape, a hard disk, etc.; and a communication device 309 including a network interface card such as a LAN card, a modem, etc.
  • the communication device 309 may allow the electronic device 300 to communicate wirelessly or wiredly with other devices to exchange data, performing communication processing via a network such as the Internet.
  • Driver 310 is also connected to I/O interface 305 as needed.
  • FIG. 12 illustrates electronic device 300 including various means, it should be understood that implementation or inclusion of all illustrated means is not required. More or fewer means may alternatively be implemented or included.
  • the electronic device 300 may further include a peripheral interface (not shown in the figure) and the like.
  • the peripheral interface can be various types of interfaces, such as a USB interface, a lightning interface, etc.
  • the communication device 309 may communicate via wireless communication with a network and other devices, such as the Internet, an intranet, and/or a wireless network such as a cellular telephone network, a wireless local area network (LAN) and/or metropolitan area network (MAN).
  • Wireless communications can use any of a variety of communications standards, protocols and technologies, including but not limited to Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), Wideband Code Division Multiple Access (W-CDMA) , Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Bluetooth, Wi-Fi (e.g.
  • GSM Global System for Mobile Communications
  • EDGE Enhanced Data GSM Environment
  • W-CDMA Wideband Code Division Multiple Access
  • CDMA Code Division Multiple Access
  • TDMA Time Division Multiple Access
  • Bluetooth Wi-Fi
  • IEEE 802.11a based on IEEE 802.11a, IEEE 802.11b, IEEE 802.11g and/or IEEE 802.11n standards
  • VoIP Internet Protocol-based voice transmission
  • Wi-MAX protocols for email, instant messaging and/or short message service (SMS), or any other suitable communications protocol.
  • the electronic device 300 can be any device such as a mobile phone, a tablet computer, a notebook computer, an e-book, a game console, a television, a digital photo frame, a navigator, etc., or it can be any combination of data processing devices and hardware.
  • a mobile phone such as a tablet computer, a notebook computer, an e-book, a game console, a television, a digital photo frame, a navigator, etc.
  • a game console such as a computer, a computer, a game console, a television, a digital photo frame, a navigator, etc.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communication device 309, or from storage device 308, or from ROM 302.
  • the instruction distribution method 10 disclosed in the embodiment of the present disclosure is executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programmd read-only memory (EPROM or flash memory), fiber optics, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, in which computer-readable program code is carried. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • computer A readable signal medium may also be any computer-readable medium other than computer-readable storage media that can send, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.
  • the computer-readable medium may be included in the electronic device 300; it may also exist independently without being assembled into the electronic device 300.
  • Figure 13 is a schematic block diagram of a non-transitory readable storage medium provided by at least one embodiment of the present disclosure.
  • Embodiments of the present disclosure also provide a non-transitory readable storage medium.
  • Figure 13 is a schematic block diagram of a non-transitory readable storage medium according to at least one embodiment of the present disclosure. As shown in FIG. 13 , computer instructions 111 are stored on the non-transitory readable storage medium 100 , and when executed by the processor, the computer instructions 111 perform one or more steps in the instruction distribution method 10 as described above.
  • the non-transitory readable storage medium 100 may be any combination of one or more computer-readable storage media.
  • one computer-readable storage medium may contain multiple thread instructions for receiving and distributing instructions issued by multiple decoding instruction queues respectively.
  • Computer-readable program code is requested, and another computer-readable storage medium contains computer-readable program code for determining whether multiple thread instruction distribution requests are blocked or conflicting with multiple execution waiting queues.
  • Yet another computer-readable storage medium contains computer-readable program code for selecting and responding to a thread instruction distribution request from multiple thread instruction distribution requests based on the above determination.
  • each of the above program codes can also be stored in the same computer-readable medium, and embodiments of the present disclosure do not limit this.
  • the computer when the program code is read by a computer, the computer can execute the program code stored in the computer storage medium, and perform, for example, the instruction distribution method 10 provided by any embodiment of the present disclosure.
  • the storage medium may include a memory card of a smartphone, a storage component of a tablet computer, a hard drive of a personal computer, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM), Portable compact disk read-only memory (CD-ROM), flash memory, or any combination of the above storage media can also be other suitable storage media.
  • the readable storage medium may also be the memory 220 in FIG. 11. For related descriptions, reference may be made to the foregoing content, which will not be described again here.
  • the term “plurality” refers to two or more than two, unless expressly limited otherwise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

一种多线程处理器的指令分发方法及装置和存储介质。指令分发方法包括:接收多线程处理器的多个解码指令队列分别发出的多个线程指令分发请求,多个线程指令分发请求中每一个包括需要被分别发送到对应类型的执行等待队列的多条指令;判断多个线程指令分发请求对于多线程处理器的多个执行等待队列是否存在阻塞以及冲突;基于判断,从多个线程指令分发请求选择一个线程指令分发请求并响应。该方法使得多线程指令分发更有效率、更具有公平性,有效避免"活锁"现象。

Description

多线程处理器的指令分发方法、装置和存储介质
本申请要求于2022年8月26日递交的中国专利申请第202211033483.6号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。
技术领域
本公开的实施例涉及一种多线程处理器的指令分发方法、指令分发装置、数据处理装置、处理器、电子设备和非瞬时可读存储介质。
背景技术
同步多线程(SMT)是一种在一个中央处理单元(CPU)的时钟周期内能够执行来自多个线程的指令的硬件多线程技术。本质上,同步多线程是一种将线程级并行处理(多个CPU)转化为指令级并行处理(同一CPU)的方法。同步多线程是单个物理处理器从多个硬件线程上下文同时分派指令的能力。同步多线程用于在商用环境中及为周期/指令计数较高的工作负载创造性能优势。处理器采用超标量结构,适用于以并行方式读取及运行指令。同步多线程使得同一处理器上可同时调度运行两个应用程序,从而利用处理器的超标量结构性质。
发明内容
本公开至少一个实施例提供一种多线程处理器的指令分发方法。所述指令分发方法包括:接收所述多线程处理器的多个解码指令队列分别发出的多个线程指令分发请求;判断所述多个线程指令分发请求对于所述多线程处理器的多个执行等待队列是否存在阻塞以及冲突;基于所述判断,从所述多个线程指令分发请求选择一个线程指令分发请求并响应。其中,所述多个线程指令分发请求中每一个包括需要被分别发送到对应类型的执行等待队列的多条指令。
例如,在本公开至少一个实施例提供的方法中,判断所述多个线程指令 分发请求对于所述多个执行等待队列是否存在阻塞以及冲突,包括:基于所述多个执行等待队列中每一个当前可用的令牌数量,判断所述多个线程指令分发请求对于所述多个执行等待队列是否存在阻塞以及冲突。
例如,在本公开至少一个实施例提供的方法中,判断所述多个线程指令分发请求对于所述多个执行等待队列是否存在阻塞,包括:响应于所述多个线程指令分发请求中的第一线程指令分发请求需要的所述多个执行等待队列中的第一执行等待队列中的令牌数量大于所述第一执行等待队列中当前可用的令牌数量,确定所述第一线程指令分发请求对于所述第一执行等待队列存在阻塞。
例如,在本公开至少一个实施例提供的方法中,所述多个执行等待队列包括被所述多个线程共享的至少一个共享执行等待队列。判断所述多个线程指令分发请求对于所述多个执行等待队列是否存在冲突,包括:响应于所述多个线程指令分发请求中的第二线程指令分发请求对于所述至少一个共享执行等待队列中的第二执行等待队列存在阻塞,并且所述多个线程指令分发请求中的第三线程指令分发请求需要的第二执行等待队列中的令牌数量不大于所述第二执行等待队列中当前可用的令牌数量,确定所述第二线程指令分发请求和第三线程指令分发请求对于所述第二执行等待队列存在冲突。
例如,在本公开至少一个实施例提供的方法中,基于所述判断,从所述多个线程指令分发请求选择一个线程指令分发请求并响应,包括:基于所述判断,将所述多个线程指令分发请求中的至少一个加入候选请求集合;基于所述多个线程的优先级,从所述候选请求集合中选择一个线程指令分发请求并响应。
例如,在本公开至少一个实施例提供的方法中,基于所述判断,将所述多个线程指令分发请求中的至少一个加入所述候选请求集合,包括:响应于所述多个线程指令分发请求中存在对于所述多个执行等待队列不存在阻塞且不存在冲突的第四线程指令分发请求,将所述第四线程指令分发请求加入所述候选请求集合。
例如,在本公开至少一个实施例提供的方法中,基于所述判断,将所述多个线程指令分发请求中的至少一个加入所述候选请求集合,包括:响应于所述多个线程指令分发请求中不存在所述第四线程指令分发请求,并且所述 多个线程指令分发请求中存在对于所述多个执行等待队列存在冲突的第五线程指令分发请求,将所述第五线程指令分发请求加入所述候选请求集合。
例如,在本公开至少一个实施例提供的方法中,基于所述多个线程的优先级,从所述候选请求集合中选择一个线程指令分发请求,包括:利用最近最少使用算法,确定所述多个线程当前的优先级,从所述候选请求集合中选择优先级最高的线程指令分发请求。
例如,在本公开至少一个实施例提供的方法中,利用最近最少使用LRU算法,确定所述多个线程当前的优先级,包括:初始化所述多个线程的优先级;响应于上一个时钟周期选择了所述多个线程中的第一线程,将所述第一线程在当前时钟周期的优先级设置为最低,并将所述多个线程中除了所述第一线程以外的其他线程的优先级递增。
例如,在本公开至少一个实施例提供的方法中,基于所述多个线程的优先级,从所述候选请求集合中选择一个线程指令分发请求,包括:利用轮询算法,确定所述多个线程当前的优先级,从所述候选请求集合中选择优先级最高的线程指令分发请求。
例如,在本公开至少一个实施例提供的方法中,所述多线程包括至少3个线程。
本公开至少一个实施例还提供了一种指令分发装置。所述指令分发装置包括接收单元、判断单元和选择单元。接收单元与多线程处理器通信连接,被配置为接收所述多线程处理器的多个解码指令队列分别发出的多个线程指令分发请求。所述多个线程指令分发请求中每一个包括需要被分别发送到对应类型的执行等待队列的多条指令。判断单元与多线程处理器通信连接,被配置为判断所述多线程处理器的多个线程指令分发请求对于所述多个执行等待队列是否存在阻塞以及冲突。选择单元被配置为基于所述判断,从所述多个线程指令分发请求选择一个线程指令分发请求并响应。
例如,在本公开至少一个实施例提供的指令分发装置中,判断单元包括组合判断子单元。组合判断子单元被配置为基于所述多个执行等待队列中每一个当前可用的令牌数量,判断所述多个线程指令分发请求对于所述多个执行等待队列是否存在阻塞以及冲突。
例如,在本公开至少一个实施例提供的指令分发装置中,组合判断子单 元包括阻塞判断单元。阻塞判断单元被配置为响应于所述多个线程指令分发请求中的第一线程指令分发请求需要的所述多个执行等待队列中的第一执行等待队列中的令牌数量大于所述第一执行等待队列中当前可用的令牌数量,确定所述第一线程指令分发请求对于所述第一执行等待队列存在阻塞。
例如,在本公开至少一个实施例提供的指令分发装置中,多个执行等待队列包括被多个线程共享的至少一个共享执行等待队列,所述组合判断子单元包括冲突判断单元。冲突判断单元被配置为响应于所述多个线程指令分发请求中的第二线程指令分发请求对于所述至少一个共享执行等待队列中的第二执行等待队列存在阻塞,并且所述多个线程指令分发请求中的第三线程指令分发请求需要的所述第二执行等待队列中的令牌数量不大于所述第二执行等待队列中当前可用的令牌数量,确定所述第二线程指令分发请求和所述第三线程指令分发请求对于所述第二执行等待队列存在冲突。
例如,在本公开至少一个实施例提供的指令分发装置中,选择单元包括候选选择单元和优先级选择单元。候选选择单元配置为基于所述判断,将所述多个线程指令分发请求中的至少一个加入候选请求集合。优先级选择单元配置为基于所述多个线程的优先级,从所述候选请求集合中选择一个线程指令分发请求并响应。
本公开至少一个实施例还提供了一种数据处理装置,包括上述任一实施例提供的指令分发装置、多个解码指令队列和多个执行等待队列。
本公开至少一个实施例还提供了一种处理器,包括上述任一实施例提供的数据处理装置。
本公开至少一个实施例还提供了一种电子设备,包括处理器和存储器,该存储器包括一个或多个计算机程序模块。所述一个或多个计算机程序模块被存储在所述存储器中并被配置为由所述处理器执行,所述一个或多个计算机程序模块包括用于执行上述任一实施例提供的指令分发方法的指令。
本公开至少一个实施例还提供了一种非瞬时可读存储介质,其上存储有计算机指令。所述计算机指令被处理器执行时执行上述任一实施例提供的指令分发方法。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例的附图作简单地介绍,显而易见地,下面描述的附图仅仅涉及本公开的一些实施例,而非对本公开的限制。
图1为本公开至少一个实施例提供的一种多线程处理器的指令分发方法的流程图;
图2为本公开至少一个实施例提供的一种多线程处理器的示例结构框图;
图3为本公开至少一个实施例提供的一种多线程指令分发仲裁的示意图;
图4为本公开至少一个实施例提供的一种判断令牌阻塞操作的示意框图;
图5为本公开至少一个实施例提供的根据最近最少使用(LRU)算法调整多个线程的优先级的示意图;
图6为本公开至少一个实施例提供的根据轮询算法调整多个线程的优先级的示意图;
图7为本公开至少一个实施例提供的一种指令分发方法的示意框图;
图8为本公开至少一个实施例提供的一种指令分发装置的示意框图;
图9为本公开至少一个实施例提供的一种数据处理装置的示意框图;
图10为本公开至少一个实施例提供的一种处理器的示意框图;
图11为本公开至少一个实施例提供的一种电子设备的示意框图;
图12为本公开至少一个实施例提供的又一种电子设备的示意框图;以及
图13为本公开至少一个实施例提供的一种非瞬时可读存储介质的示意框图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合附图,对本公开实施例的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
本公开中使用了流程图来说明根据本申请的实施例的系统所执行的操作。应当理解的是,前面或下面操作不一定按照顺序来精确地执行。相反, 根据需要,可以按照倒序或同时处理各种步骤。同时,也可以将其他操作添加到这些过程中,或从这些过程移除某一步或数步操作。
除非另外定义,本公开使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。同样,“一个”、“一”或者“该”等类似词语也不表示数量限制,而是表示存在至少一个。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。
同步多线程(SMT)处理器需要在指令分发阶段对多个线程进行仲裁,选出其中一个线程的指令分发到后端的执行单元。通常,多线程仲裁需要考虑以下两个方面。第一个方面是效率。多线程仲裁需要尽可能提高多线程处理器的并行能力,发挥出多线程结构的优势,从而提高整体性能。例如,一些线程因为需要的资源较多,但是后端的执行单元没有足够的资源。仲裁应该尽量避免选择这种线程,否则会导致分发阻塞,降低整体性能。另一方面是公平性。多线程中的各个线程都应该有平等的被选中的机会,不应出现某些线程一直被选择而有些线程始终不能被选择的情况。
发明人注意到,目前的处理器的SMT支持2个线程,这2个线程间的调度也比较简单,例如采用基于时间片的调度。也就是在一定时间内某一个线程优先,在另一时间内另一个线程优先。具体而言,线程A和线程B分别对应时间片1和时间片2。开始为线程A优先,并根据上一次线程B是否占用额外的时间对时间片1进行调整,新的时间片1=旧的时间片1+上次线程B额外占用时间。在新的时间片1内,只能分发线程A,阻止线程B的分发,直到时间片1结束。之后切换为线程B优先,直到线程B分发成功。然而,由于线程B需要的资源可能已经被之前的线程A占用而暂时无法释放,线程B分发的时间可能超过时间片2。这个超过的时间被记录下来,并用于更新时间片1。
上述基于时间片的调度技术存在以下两个问题。第一个问题是效率较低。当线程A和线程B不是满负载时,并不是每一个时钟周期线程A或线程B都有指令发出,例如在时间片1上,只能发线程A的指令,但如果期间有很多个时钟周期线程A没有指令要发,但是线程B上有指令。在这些时钟周期上,将没有指令发出,造成处理器整体性能的下降。第二个问题是可能存在“活锁”现象。例如,某个线程始终发不出去。假设线程A和线程B都需要一些共同的资源,线程A需要的资源较多。每次在时间片1内A都因为需要的资源不足而不能发出去。而切换到线程B优先时,因为线程B需要的资源较少,资源被释放一点又被线程B占用。最终导致线程A需要的资源始终不够,从而造成“活锁”现象。
至少为了克服上述技术问题,本公开至少一个实施例提供了一种多线程处理器的指令分发方法,多线程处理器包括多个解码指令队列和多个执行等待队列,多个解码指令队列分别用于多个线程,多个执行等待队列分别用于对应类型的多个执行单元,多个执行等待队列包括被多个线程共享的至少一个共享执行等待队列和分别用于多个线程的多个独立执行等待队列。该指令分发方法包括:接收多个解码指令队列分别发出的多个线程指令分发请求,该多个线程指令分发请求中每一个包括需要被分别发送到对应类型的执行等待队列的多条指令;判断多个线程指令分发请求对于多个执行等待队列是否存在阻塞以及冲突;基于该判断,从多个线程指令分发请求选择一个线程指令分发请求并响应。
相应地,本公开至少一个实施例还提供了一种对应于上述指令分发方法的数据处理装置、处理器、电子设备和非瞬时可读存储介质。
在本公开至少一个实施例提供的指令分发方法中,可以通过判断多个解码指令队列分别发出的多个线程指令分发请求对于多个执行等待队列是否存在阻塞以及冲突,并基于该判断来选择一个线程指令分发请求,从而避免“活锁”现象的出现。在一些实施例中,还通过从多个解码指令队列分别发出的多个线程指令分发请求中选择一个线程指令分发请求并响应,从而避免某一解码指令队列没有指令发出却被选中的情况。
下面通过多个实施例及其示例对根据本公开提供的多线程处理器的指令分发方法进行非限制性的说明,如下面所描述的,在不相互抵触的情况下 这些具体示例或实施例中不同特征可以相互组合,从而得到新的示例或实施例,这些新的示例或实施例也都属于本公开保护的范围。
图1为本公开至少一个实施例提供的一种多线程处理器的指令分发方法的示例流程图,图2为本公开至少一个实施例提供的一种多线程处理器的示例结构框图。
例如,如图1所示,本公开至少一个实施例提供了一种多线程处理器的指令分发方法10。多线程处理器包括多个解码指令队列和多个执行等待队列。例如,在一些示例中,多个解码指令队列分别用于多个线程,多个执行等待队列分别用于对应类型的多个执行单元。例如,在一些示例中,多个执行等待队列包括被多个线程共享的至少一个共享执行等待队列和分别用于多个线程的多个独立执行等待队列。如图1所示,该指令分发方法10包括如下步骤S101至S103。
步骤S101:接收多个解码指令队列分别发出的多个线程指令分发请求。该多个线程指令分发请求中每一个包括需要被分别发送到对应类型的执行等待队列的多条指令。
步骤S102:判断多个线程指令分发请求对于多个执行等待队列是否存在阻塞以及冲突。
步骤S103:基于该判断,从多个线程指令分发请求选择一个线程指令分发请求并响应。
需要说明的是,在本公开的实施例中,图1所示的指令分发方法10可以应用于各种各样的多线程处理器,例如应用于包括2个线程的多线程处理器,包括4个、6个、8个线程的多线程处理器等,本公开的实施例对此不作限制,可以根据实际需求来设置。
例如,在本公开至少一个实施例中,图1所示的指令分发方法10可以应用于图2所示的多线程处理器。例如,如图2所示,一种4线程处理器包括指令分发模块,该指令分发模块用于接收前端解码得到的指令(例如微指令)或者之前缓存的指令(例如微指令)(这些指令/微指令分别对应于多个线程),并选择一个线程的指令(例如微指令)分发到后端的各个执行等待队列中。例如,在一些实施例中,多个执行等待队列可以包括图2中所示的整型计算指令队列、地址生成队列、读内存队列、写内存队列、指令退休队列、 浮点计算队列等,本公开的实施例对此不作具体限制,可以根据实际需求来设置。不同的指令因为其指令类型不同,要进入不同的执行等待队列。例如,在一些实施例中,整型计算指令要进入整型计算指令队列,读内存指令要进入地址生成队列和读内存队列,浮点指令要进入浮点计算队列,本公开的实施例对此不作具体限制。
由于各个执行等待队列的空间有限,通常为了避免各个执行等待队列里的指令溢出(例如,某一执行等待队列已满仍将新的指令写入该队列),需要在计算各个执行等待队列的剩余空间或者可用资源的数量,在本文中也称为计算各个执行等待队列对应的令牌的数量。
如图2所示,整型计算指令队列、地址生成队列和读内存队列分别对应于令牌0、令牌1和令牌2,写内存队列、指令退休队列和浮点指令队列分别对应于令牌3、令牌4和令牌5。这些执行等待队列分别用于对应类型的多个执行单元,例如图2所示的执行单元0-3、浮点执行单元等。一级缓存、整型寄存器组、浮点寄存器组等与前述执行单元配合工作。
多个执行等待队列包括被多个线程共享的至少一个共享执行等待队列和分别用于多个线程的多个独立执行等待队列。例如,在如图2所示的实施例中,整型计算指令队列、地址生成队列和读内存队列为多线程共享,因此整型计算计算指令队列、地址生成队列和读内存队列可以称为共享执行等待队列。写内存队列、指令退休队列和浮点指令队列是为多线程中每一个线程单独设置的,因此写内存队列、指令退休队列和浮点指令队列可以称为独立执行等待队列。例如,在一些实施例中,令牌分为2种:共享令牌和独立令牌。例如,在图2所示的示例中,令牌0,1,2为共享令牌,令牌3,4,5为独立令牌。
需要说明的是,图2所示的4线程处理器仅仅是示例性的。本公开的实施例中的多线程处理器可以包括更多或更少的线程,也可以包括更多或更少的部件,本公开的实施例对此不作限制。
图3为本公开至少一个实施例提供的一种多线程指令分发仲裁的示意图。
例如,在本公开至少一个实施例中,对于步骤S101,接收多个解码指令队列分别发出的多个线程指令分发请求,该多个线程指令分发请求中每一个包括需要被分别发送到对应类型的执行等待队列的多条指令。例如,在一些 实施例中,解码后或者缓存的指令存放在多个解码指令队列中,多线程中每一个线程对应于独立的解码指令队列。例如,在图3所示的实施例中,4个线程分别对应于解码指令队列T0、解码指令队列T1、解码指令队列T2和解码指令队列T3。解码指令队列可以在一个时钟周期内输出多个指令,不同的指令要进入对应类型的执行等待队列。
例如,在一个示例中,解码指令队列T1在当前周期具有待发送的一组指令,则解码指令队列T1可以发送一个线程指令分发请求至指令分发模块。例如,在另一个示例中,解码指令队列T2在当前周期没有需要发送的指令,则解码指令队列T2可以不向指令分发模块发送线程指令分发请求。如此,不会出现指令分发模块选择了没有指令需要分发的线程,从而避免资源浪费,提高整个处理器的整体性能。
例如,在本公开至少一个实施例中,对于步骤S102,判断多个线程指令分发请求对于多个执行等待队列是否存在阻塞以及冲突,可以包括:基于多个执行等待队列中每一个当前可用的令牌数量,判断多个线程指令分发请求对于多个执行等待队列是否存在阻塞以及冲突。如此,通过比较线程指令分发请求所需要的令牌数和执行等待队列剩余的令牌数,预判该请求的一组指令是否会导致阻塞,从而尽量避免选择会导致阻塞的线程,以达到提高效率的目的。
需要说明的是,在本公开的实施例中,各个执行等待队列当前可用的令牌数量用于表示各个执行等待队列在当前时钟周期可用/剩余的空间。
例如,如图3所示的实施例中,接收到多个解码指令队列T0/T1/T2/T3分别发出的多个线程指令分发请求之后,在进行多线程指令分发仲裁时,需要执行令牌阻塞判断和共享令牌冲突判断。
例如,在本公开至少一个实施例中,对于令牌阻塞判断,可以包括:响应于多个线程指令分发请求中的第一线程指令分发请求需要的多个执行等待队列中的第一执行等待队列中的令牌数量大于第一执行等待队列中当前可用的令牌数量,确定第一线程指令分发请求对于第一执行等待队列存在阻塞。
需要说明的是,在本公开的实施例中,第一线程指令分发请求用于表示多个线程指令分发请求中的任一线程指令分发请求,第一执行等待队列用于 表示多个执行等待队列中的任一执行等待队列。第一线程指令分发请求或者第一执行等待队列均不受限于特定的线程指令分发请求或者执行等待队列,也不受限于特定的顺序,可以根据实际需求来设置。
例如,在本公开至少一个实施例中,对于令牌阻塞判断,将解码指令队列在一个时钟周期内输出的一组指令视为一个整体。如果某一个线程(解码指令队列)被选中,但是所需要的令牌不足,那么该线程的一组指令会被整体阻塞,而不能分发出该组指令中的一部分指令。
例如,在本公开至少一个实施例中,需要根据解码指令队列在一个时钟周期内输出的一组指令的类型,计算出总共需要的各种令牌的数量,然后和各个相应的执行等待队列中剩余/可用的令牌数量作比较。如果某一执行等待队列中剩余/可用的令牌数量不够,则该线程如果被仲裁选中,也只能阻塞,不能进行分发。
图4为本公开至少一个实施例提供的一种判断令牌阻塞操作的示意框图。
例如,在本公开至少一个实施例中,如图4所示,解码指令队列T0~T3分别用于线程T0~T3,解码指令队列T0(为了简便起见,也称为线程T0)要发送的一组指令包括1个读指令、2个加法指令和1个浮点指令,所需要的各个执行等待队列的令牌数为:2个令牌0(对应于整型计算指令队列),1个令牌1(对应于地址生成队列),1个令牌2(对应于读内存队列),4个令牌4(对应于指令退休队列),1个令牌5(对应于浮点指令队列)。如果任一执行等待队列的当前可用的令牌数量小于解码指令队列T0所需要的令牌数量,则表示解码指令队列T0的该组指令没有办法全部放进各个执行等待队列,可以判断对应的线程T0存在阻塞。
例如,在一个示例中,解码指令队列T0需要4个令牌4,然而指令退休队列当前可用的令牌数量为2,即当前剩余2个令牌4。在这种情况下,解码指令队列T0所需要的令牌4数量大于指令退休队列当前可用的令牌4数量,则解码指令队列T0对于指令退休队列存在阻塞,也就是说,解码指令队列T0发送的线程指令分发请求对于指令退休队列存在阻塞。又例如,在另一个示例中,解码指令队列T0需要1个令牌1,然而地址生成队列当前可用的令牌数量为2,即当前剩余2个令牌1。在这种情况下,解码指令队列T0所需要的令牌1数量小于地址生成队列当前可用的令牌1数量,则解码指令队列 T0对于地址生成队列不存在阻塞,也就是说,解码指令队列T0发送的线程指令分发请求对于地址生成队列不存在阻塞。
例如,在本公开至少一个实施例中,对于共享令牌冲突判断,包括:响应于多个线程指令分发请求中的第二线程指令分发请求对于至少一个共享执行等待队列中的第二执行等待队列存在阻塞,并且多个线程指令分发请求中的第三线程指令分发请求需要的第二执行等待队列中的令牌数量不大于第二执行等待队列中当前可用的令牌数量,确定第二线程指令分发请求对于第二执行等待队列存在冲突。
需要说明的是,在本公开的实施例中,第二线程指令分发请求用于表示多个线程指令分发请求中的任一线程指令分发请求,第三线程指令分发请求用于表示多个线程指令分发请求中的不同于第二线程指令分发请求的任一线程指令分发请求。第二执行等待队列用于表示至少一个共享执行等待队列中的任一共享执行等待队列。第二线程指令分发请求和第三线程指令分发请求不受限于特定的线程指令分发请求,也不受限于特定的顺序,可以根据实际需求来设置。第二执行等待队列不受限于特定的执行等待队列,也不受限于特定的顺序,可以根据实际需求来设置。
例如,在本公开至少一个实施例中,参考图2和图4,整型计算指令队列、地址生成队列和读内存队列为多个线程共享的共享执行等待队列,也就是说,令牌0、令牌1和令牌2为共享令牌。例如,在一个示例中,解码指令队列T0发出的线程指令分发请求包括的一组指令如图4所示,解码指令队列T0当前需要1个共享令牌2。在另一个解码指令队列T1需要4个共享令牌2,并且读内存指令队列中当前剩余/可用1个令牌的情况下,解码指令队列T1对于读内存指令队列存在阻塞,而解码指令队列T0对于读内存指令队列不阻塞,因此确定解码指令队列T1对于读内存指令队列存在冲突,并且其他请求该共享执行等待队列(即读内存指令队列)的解码指令队列(即解码指令队列T0)也被确定为存在冲突。
假设后续解码指令队列T0每个时钟周期输出的指令都只有1个读内存指令,且读内存指令队列每个时钟周期释放1个令牌,如果只根据阻塞情况则每次选择解码指令队列T0(线程T0),解码指令队列T1(线程T1)需要的4个令牌2始终得不到满足。这种情况被称为“活锁”。
例如,在本公开至少一个实施例中,对于步骤S103,基于以上判断,从多个线程指令分发请求选择一个线程指令分发请求并响应,包括:基于该判断,将多个线程指令分发请求中的至少一个加入候选请求集合;基于多个线程的优先级,从候选请求集合中选择一个线程指令分发请求并响应。
例如,在本公开至少一个实施例中,响应于多个线程指令分发请求中存在对于多个执行等待队列不存在阻塞且不存在冲突的第四线程指令分发请求,将第四线程指令分发请求加入该候选请求集合。
例如,在一些实施例中,在多个线程中存在非阻塞且非冲突的线程的情况下,可以直接将非阻塞且非冲突的线程发出的线程指令分发请求放入候选请求集合,存在阻塞或者冲突的线程不会被选中。如此,避免了处理器的性能浪费,提高处理效率。
例如,在本公开至少一个实施例中,响应于多个线程指令分发请求中不存在第四线程指令分发请求,并且多个线程指令分发请求中存在对于多个执行等待队列存在冲突的第五线程指令分发请求,将第五线程指令分发请求加入候选请求集合。
例如,在一些实施例中,在多个线程中不存在非阻塞且非冲突的线程的情况下,如果存在冲突的线程,可以将存在冲突的线程发出的线程指令分发请求放入候选请求集合。如此,阻塞在共享令牌下的线程不会因为其他线程一直占用该共享令牌而导致活锁现象。也就是说,在不存在非阻塞且非冲突线程的情况下,存在冲突的线程也有被选择的机会,从而避免活锁。
例如,在一个示例中,线程T0和线程T1对于某一共享令牌存在冲突,线程T2对于某一独立令牌存在阻塞,线性T3不存在冲突且不存在阻塞。在前述情况下,可以仅将线程T3发出的线程指令分发请求加入到候选请求集合中。又例如,在另一个示例中,线程T0和线程T1对于某一共享令牌存在冲突,线程T2和线程3存在阻塞。在前述不存在非阻塞且非冲突的线程的情况下,可以将存在冲突的线程T0和线程T1发出的线程指令分发请求加入到候选请求集合中,以供选择。
需要说明的是,在本公开的实施例中,第四线程指令分发请求用于表示不存在阻塞且不存在冲突的任一线程指令分发请求,第五线程指令分发请求用于表示存在冲突的任一线程指令分发请求,第四线程指令分发请求和第五 线程指令分发请求均不受限于特定的线程指令分发请求,也不受限于特定的顺序,可以根据实际需求来设置。
例如,在本公开至少一个实施例中,在基于上述判断得到候选请求集合后,可以根据各个线程的优先级,在候选请求集合中选择一个线程。
例如,在本公开至少一个实施例中,根据最近最少使用(LRU)算法,确定多个线程当前的优先级,从候选请求集合中选择优先级最高的线程指令分发请求。通过LRU算法调整各个线程的优先级,如果线程T1最近一直没有发出指令,而其他线程都有发出指令的记录,那么将该线程T1的优先级调整为最高。
例如,在本公开至少一个实施例中,利用LRU算法,确定所多个线程当前的优先级,可以包括:初始化多个线程的优先级;响应于上一个时钟周期选择了多个线程中的第一线程,将第一线程在当前时钟周期的优先级设置为最低,并将多个线程中除了第一线程以外的其他线程的优先级递增。
需要说明的是,在本公开的实施例中,第一线程用于表示多个线程中的任一线程,不受限于特定的线程,也不受限于特定的顺序,可以根据实际需求来设置。
图5为本公开至少一个实施例提供的利用最近最少使用(LRU)算法调整多个线程的优先级的示意图。
例如,在本公开至少一个实施例中,如图5所示,对于多个线程T0、T1、T2和T3,首先初始化每个线程的优先级。例如,多个线程初始化的优先级排序(从高到低)是T0、T1、T2、T3。响应于在第一个时钟周期选择了线程T1,在下一个时钟周期(第二个时钟周期),线程T1的优先级被调整为最低,其他线程(线程T0、T2和T3)的优先级递增,也就是说,在第二个时钟周期,多个线程的优先级排序(从高到低)是T0、T2、T3、T1。响应于第二个时钟周期选择了线程T0,在下一个时钟周期(第三个时钟周期),线程T0的优先级被调整为最低,其他线程(线程T1、T2和T3)的优先级递增,也就是说,在第三个时钟周期,多个线程的优先级排序(从高到低)是T2、T3、T1、T0。
例如,在本公开至少一个实施例中,可以使用包括多个单元的队列来实现LRU算法,队列中每个单元存储线程序号。例如,队列头的线程优先级最 高,队列尾的线程优先级最低。每次将仲裁选中的线程序号从队列中删除并重新插入队列尾,表示该线程已经被选中过,后面将其优先级调整至最低。
通过利用最近最少使用LRU算法调整各个线程的优先级,将最近最少使用的线程设为最高优先级,可以保证多线程分发的公平性。
例如,在本公开至少一个实施例中,可以根据轮询(round-robin)算法,确定多个线程当前的优先级。
图6为本公开至少一个实施例提供的利用轮询算法调整多个线程的优先级的示意图。
例如,在本公开至少一个实施例中,如图6所示,对于多个线程T0、T1、T2和T3,首先初始化每个线程的优先级。例如,多个线程初始化的优先级排序(从高到低)是T0、T1、T2、T3。响应于在第一个时钟周期选择了线程T1,在下一个时钟周期(第二个时钟周期),线程T1的优先级被调整为最低,线程T1的下一个线程(即线程T2)的优先级设置为最高,其他线程T0和T3以循环的方式排序。也就是说,在第二个时钟周期,多个线程的优先级排序(从高到低)是T2、T3、T0、T1。响应于第二个时钟周期选择了线程T0,在下一个时钟周期(第三个时钟周期),线程T0的优先级被调整为最低,线程T0的下一个线程(即线程T1)的优先级设置为最高,其他线程T2和T3以循环的方式排序。也就是说,在第三个时钟周期,多个线程的优先级排序(从高到低)是T1、T2、T3、T0。
通过采用轮询算法调整各个线程的优先级,算法简单,也保证了多线程分发的公平性。
需要说明的是,在本公开的实施例中,除了采用LRU算法、轮询算法之外,可以采用其他优先级设定算法,本公开的实施例对此不作限制,可以根据实际需求来设定。
图7为本公开至少一个实施例提供的一种指令分发方法的示意框图。
例如,在本公开至少一个实施例中,如图7所示,根据各个线程(解码指令队列)发出的请求信号和优先级信号进行多线程选择,最终选择出请求信号为有效且在请求信号为有效的一个或多个线程中的优先级最高的线程。例如,对于4线程处理器,如上所述,每个线程检查是否是非阻塞且非冲突,得到相应的判断信号,将4个线程各自的判断信号作或(图7中表示为≥1) 得到4个线程中是否存在非阻塞且非冲突的线程的确定信号。将4个线程中存在非阻塞且非冲突线程的确定信号取反与每个线程的令牌冲突信号作与(图7中表示为&),再与前面的各个线程的判断信号作或,然后与有指令信号作与,得到最终的请求信号。接着,结合多个线程的优先级信号(例如,根据前述LRU算法或者轮询算法等),选择一个线程进行响应。
因此,通过本公开至少一个实施例提供的多线程处理器的指令分发方法10可以使得多线程指令分发更有效率、更具有公平性,还能避免“活锁”现象,从而提高多线程处理器的整体性能。
例如,在本公开至少一个实施例中,多线程包括至少3个线程。需要说明的是,虽然本文以4线程处理器为示例来说明本公开的实施例提供的指令分发方法10,但是本公开的实施例提供的指令分发方法10并不仅仅适用于4线程处理器,也可以适用于2线程处理器、3线程处理器、5线程处理器等,本公开的实施例对此不作限制。
还需要说明的是,在本公开的各个实施例中,指令分发方法10的各个步骤的执行顺序不受限制,虽然上文以特定顺序描述了各个步骤的执行过程,但这并不构成对本公开实施例的限制。指令分发方法10中的各个步骤可以串行执行或并行执行,这可以根据实际需求而定。例如,指令分发方法10还可以包括更多或更少的步骤,本公开的实施例对此不作限制。
图8为本公开至少一个实施例提供的一种指令分发装置的示意框图。
例如,如图8所示,本公开至少一个实施例提供了一种指令分发装置80。指令分发装置80分别与多线程处理器的多个解码指令队列802和多个执行等待队列803通信连接。例如,多个解码指令队列802分别用于多个线程,多个执行等待队列803分别用于对应类型的多个执行单元。例如,在一些示例中,多个执行等待队列803包括被多个线程共享的至少一个共享执行等待队列和分别用于多个线程的多个独立执行等待队列。指令分发装置80包括接收单元811、判断单元812和选择单元813。
接收单元811与多线程处理器通信连接,被配置为接收多线程处理器的多个解码指令队列分别发出的多个线程指令分发请求,多个线程指令分发请求中每一个包括需要被分别发送到对应类型的执行等待队列的多条指令。例如,接收单元811可以实现步骤S101,其具体实现方法可以参考步骤S101 的相关描述,在此不再赘述。
判断单元812与多线程处理器通信连接,被配置为判断多个线程指令分发请求对于多线程处理器的多个执行等待队列是否存在阻塞以及冲突。例如,判断单元812可以实现步骤S102,其具体实现方法可以参考步骤S102的相关描述,在此不再赘述。
选择单元813被配置为基于上述判断,从多个线程指令分发请求选择一个线程指令分发请求并响应。例如,选择单元813可以实现步骤S103,其具体实现方法可以参考步骤S103的相关描述,在此不再赘述。
需要说明的是,上述接收单元811、判断单元812和选择单元813可以通过软件、硬件、固件或它们的任意组合实现,例如,上述接收单元811、判断单元812和选择单元813可以分别实现为接收电路811、判断电路812和选择电路813,本公开的实施例对它们的具体实施方式不作限制。
例如,在本公开至少一个实施例中,判断单元812可以包括组合判断子单元。例如,该组合判断子单元可以配置为基于多个执行等待队列中每一个当前可用的令牌数量,判断多个线程指令分发请求对于多个执行等待队列是否存在阻塞以及冲突。例如,该组合判断子单元可以实现的操作可以参考前述指令分发方法10的相关描述,在此不再赘述。
例如,在本公开至少一个实施例中,组合判断子单元可以包括阻塞判断单元和冲突判断单元。例如,在一些实施例中,该阻塞判断单元被配置为响应于多个线程指令分发请求中的第一线程指令分发请求需要的多个执行等待队列中的第一执行等待队列中的令牌数量大于第一执行等待队列中当前可用的令牌数量,确定第一线程指令分发请求对于第一执行等待队列存在阻塞。例如,该阻塞判断单元可以实现的操作可以参考前述指令分发方法10的相关描述,在此不再赘述。
例如,在一些实施例中,冲突判断单元被配置为响应于多个线程指令分发请求中的第二线程指令分发请求对于至少一个共享执行等待队列中的第二执行等待队列存在阻塞,并且多个线程指令分发请求中的第三线程指令分发请求需要的第二执行等待队列中的令牌数量不大于第二执行等待队列中当前可用的令牌数量,确定第二线程指令分发请求和第三线程指令分发请求对于第二执行等待队列存在冲突。例如,冲突判断单元可以实现的操作可以 参考前述指令分发方法10的相关描述,在此不再赘述。
例如,在本公开至少一个实施例中,选择单元813可以包括候选选择单元和优先级选择单元。例如,在一些实施例中,候选选择单元被配置为基于上述判断,将多个线程指令分发请求中的至少一个加入候选请求集合。优先级选择单元被配置为基于多个线程的优先级,从候选请求集合中选择一个线程指令分发请求并响应。例如,候选选择单元和优先级选择单元可以实现的操作可以参考前述指令分发方法10的相关描述,在此不再赘述。
例如,在本公开至少一个实施例中,候选选择单元可以包括直接选择单元和冲突选择单元。例如,在一些实施例中,直接选择单元被配置为响应于多个线程指令分发请求中存在对于多个执行等待队列不存在阻塞且不存在冲突的第四线程指令分发请求,将第四线程指令分发请求加入候选请求集合。例如,在一些实施例中,冲突选择单元被配置为响应于多个线程指令分发请求中不存在第四线程指令分发请求,并且多个线程指令分发请求中存在对于多个执行等待队列存在冲突的第五线程指令分发请求,将第五线程指令分发请求加入候选请求集合。例如,直接选择单元和冲突选择单元可以实现的操作可以参考前述指令分发方法10的相关描述,在此不再赘述。
例如,在本公开至少一个实施例中,优先级选择单元可以包括设置单元和分发单元。例如,在一些实施例中,设置单元被配置为根据最近最少使用LRU算法,确定多个线程当前的优先级。分发单元被配置为从候选请求集合中选择优先级最高的线程指令分发请求。例如,设置单元和分发单元可以实现的操作可以参考前述指令分发方法10的相关描述,在此不再赘述。
例如,在本公开至少一个实施例中,设置单元可以包括初始化单元和调整单元。例如,在一些实施例中,初始化单元被配置为初始化多个线程的优先级。调整单元被配置为响应于上一个时钟周期选择了多个线程中的第一线程,将第一线程在当前时钟周期的优先级设置为最低,并将多个线程中除了第一线程以外的其他线程的优先级递增。例如,初始化单元和调整单元可以实现的操作可以参考前述指令分发方法10的相关描述,在此不再赘述。
例如,在本公开至少一个实施例中,优先级选择单元可以包括设置子单元。该设置子单元被配置为根据轮询算法,确定多个线程当前的优先级。例如,设置子单元可以实现的操作可以参考前述指令分发方法10的相关描述, 在此不再赘述。
例如,在本公开至少一个实施例中,多线程包括至少3个线程。
需要说明的是,上述组合判断子单元、阻塞判断单元、冲突判断单元、候选选择单元、优先级选择单元、直接选择单元、冲突选择单元、设置单元、分发单元、初始化单元、调整单元和设置子单元可以通过软件、硬件、固件或它们的任意组合实现,例如,上述组合判断子单元、阻塞判断单元、冲突判断单元、候选选择单元、优先级选择单元、直接选择单元、冲突选择单元、设置单元、分发单元、初始化单元、调整单元和设置子单元可以分别实现为组合判断子电路、阻塞判断电路、冲突判断电路、候选选择电路、优先级选择电路、直接选择电路、冲突选择电路、设置电路、分发电路、初始化电路、调整电路和设置子电路,本公开的实施例对它们的具体实施方式不作限制。
应当理解的是,本公开至少一个实施例提供的指令分发装置80可以实施前述多线程处理器的指令分发方法10,也可以实现与前述指令分发方法10相似的技术效果。例如,通过本公开至少一个实施例提供的指令分发装置801,可以有效提高多线程指令分发的效率、公平性,避免活锁现象。
需要注意的是,在本公开的实施例中,该指令分发装置80可以包括更多或更少的电路或单元,并且各个电路或单元之间的连接关系不受限制,可以根据实际需求而定。各个电路的具体构成方式不受限制,可以根据电路原理由模拟器件构成,也可以由数字芯片构成,或者以其他适用的方式构成。
图9为本公开至少一个实施例提供的一种数据处理装置70的示意框图。
例如,本公开至少一个实施例还提供了一种数据处理装置70。如图9所示,数据处理装置70包括指令分发装置701、多个解码指令队列702和多个执行等待队列703。例如,在一些示例中,数据处理装置70包括图8中所示的指令分发装置801、多个解码指令队列802和多个执行等待队列803。
需要注意的是,在本公开的实施例中,该数据处理装置70可以包括更多或更少的电路或单元,并且各个电路或单元之间的连接关系不受限制,可以根据实际需求而定。各个电路的具体构成方式不受限制,可以根据电路原理由模拟器件构成,也可以由数字芯片构成,或者以其他适用的方式构成。
应当理解的是,本公开至少一个实施例提供的数据处理装置70可以实施前述多线程处理器的指令分发方法10,也可以实现与前述指令分发方法10 相似的技术效果。例如,通过本公开至少一个实施例提供的数据处理装置70,可以有效提高多线程指令分发的效率、公平性,避免活锁现象,从而提高多线程处理器的整体性能。
图10为本公开至少一个实施例提供的一种处理器的示意框图。
例如,本公开至少一个实施例还提供了一种处理器90。如图10所示,处理器90包括上述任一实施例所述的数据处理装置70。例如,在一些实施例中,处理器90可以是一种多线程处理器,例如4线程处理器。应当理解的是,本公开至少一个实施例提供的处理器90可以实施前述多线程处理器的指令分发方法10,也可以实现与前述指令分发方法10相似的技术效果。例如,通过本公开至少一个实施例提供的处理器90,可以有效提高多线程指令分发的效率、公平性,避免活锁现象,从而提高多线程处理器的整体性能。
图11为本公开至少一个实施例提供的一种电子设备的示意框图。
本公开至少一个实施例还提供了一种电子设备20。如图11所示,电子设备20包括处理器210和存储器220。存储器220包括一个或多个计算机程序模块221。一个或多个计算机程序模块221被存储在存储器220中并被配置为由处理器210执行,该一个或多个计算机程序模块221包括用于执行本公开的至少一个实施例提供的指令分发方法10的指令,其被处理器210执行时,可以执行本公开的至少一个实施例提供的指令分发方法10中的一个或多个步骤。存储器220和处理器210可以通过总线系统和/或其它形式的连接机构(未示出)互连。
例如,处理器210可以是中央处理单元(CPU)、数字信号处理器(DSP)或者具有数据处理能力和/或程序执行能力的其它形式的处理单元,例如现场可编程门阵列(FPGA)等。例如,中央处理单元(CPU)可以为X86或ARM架构等。处理器210可以为通用处理器或专用处理器,可以控制电子设备20中的其它组件以执行期望的功能。
例如,存储器220可以包括一个或多个计算机程序产品的任意组合,计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。非易失性存储器例如可以包括只读存储器(ROM)、硬盘、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读 存储器(CD-ROM)、USB存储器、闪存等。在计算机可读存储介质上可以存储一个或多个计算机程序模块221,处理器210可以运行一个或多个计算机程序模块221,以实现电子设备20的各种功能。在计算机可读存储介质中还可以存储各种应用程序和各种数据以及应用程序使用和/或产生的各种数据等。电子设备20的具体功能和技术效果可以参考上文中关于指令分发方法10的描述,此处不再赘述。
图12为本公开至少一个实施例提供的又一种电子设备的示意框图。
图12示出的电子设备300仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。例如,如图12所示,在一些示例中,电子设备300包括处理装置(例如中央处理器、图形处理器等)301,其可以根据存储在只读存储器(ROM)302中的程序或者从存储装置308加载到随机访问存储器(RAM)303中的程序而执行各种适当的动作和处理。在RAM 303中,还存储有计算机系统操作所需的各种程序和数据。处理装置301、ROM 302以及RAM 303通过总线304被此相连。输入/输出(I/O)接口305也连接至总线304。
例如,以下部件可以连接至I/O接口305:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置306;包括诸如液晶显示器(LCD)、扬声器、振动器等的输出装置307;包括例如磁带、硬盘等的存储装置308;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信装置309。通信装置309可以允许电子设备300与其他设备进行无线或有线通信以交换数据,经由诸如因特网的网络执行通信处理。驱动器310也根据需要连接至I/O接口305。可拆卸介质311,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器310上,以便于从其上读出的计算机程序根据需要被安装入存储装置308。虽然图12示出了包括各种装置的电子设备300,但是应理解的是,并不要求实施或包括所有示出的装置。可以替代地实施或包括更多或更少的装置。
例如,该电子设备300还可以进一步包括外设接口(图中未示出)等。该外设接口可以为各种类型的接口,例如为USB接口、闪电(lighting)接口等。该通信装置309可以通过无线通信来与网络和其他设备进行通信,该网络例如为因特网、内部网和/或诸如蜂窝电话网络之类的无线网络、无线局域 网(LAN)和/或城域网(MAN)。无线通信可以使用多种通信标准、协议和技术中的任何一种,包括但不局限于全球移动通信系统(GSM)、增强型数据GSM环境(EDGE)、宽带码分多址(W-CDMA)、码分多址(CDMA)、时分多址(TDMA)、蓝牙、Wi-Fi(例如基于IEEE 802.11a、IEEE 802.11b、IEEE 802.11g和/或IEEE 802.11n标准)、基于因特网协议的语音传输(VoIP)、Wi-MAX,用于电子邮件、即时消息传递和/或短消息服务(SMS)的协议,或任何其他合适的通信协议。
例如,电子设备300可以为手机、平板电脑、笔记本电脑、电子书、游戏机、电视机、数码相框、导航仪等任何设备,也可以为任意的数据处理装置及硬件的组合,本公开的实施例对此不作限制。
例如,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置309从网络上被下载和安装,或者从存储装置308被安装,或者从ROM 302被安装。在该计算机程序被处理装置301执行时,执行本公开实施例所公开的指令分发方法10。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开的实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开的实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机 可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述电子设备300中所包含的;也可以是单独存在,而未装配入该电子设备300中。
图13为本公开至少一个实施例提供的一种非瞬时可读存储介质的示意框图。
本公开的实施例还提供一种非瞬时可读存储介质。图13是根据本公开至少一个实施例的一种非瞬时可读存储介质的示意框图。如图13所示,非瞬时可读存储介质100上存储有计算机指令111,该计算机指令111被处理器执行时执行如上所述的指令分发方法10中的一个或多个步骤。
例如,该非瞬时可读存储介质100可以是一个或多个计算机可读存储介质的任意组合,例如,一个计算机可读存储介质包含用于接收多个解码指令队列分别发出的多个线程指令分发请求的计算机可读的程序代码,另一个计算机可读存储介质包含用于判断多个线程指令分发请求对于多个执行等待队列是否存在阻塞以及冲突的计算机可读的程序代码。又一个计算机可读存储介质包含用于基于上述判断,从多个线程指令分发请求选择一个线程指令分发请求并响应的计算机可读的程序代码。当然,上述各个程序代码也可以存储在同一个计算机可读介质中,本公开的实施例对此不作限制。
例如,当该程序代码由计算机读取时,计算机可以执行该计算机存储介质中存储的程序代码,执行例如本公开任一实施例提供的指令分发方法10。
例如,存储介质可以包括智能电话的存储卡、平板电脑的存储部件、个人计算机的硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、闪存、或者上述存储介质的任意组合,也可以为其他适用的存储介质。例如,该可读存储介质也可以为图11中的存储器220,相关描述可以参考前述内容,此处不再赘述。
在本公开中,术语“多个”指两个或两个以上,除非另有明确的限定。
本领域技术人员在考虑说明书及实践这里公开的公开后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (20)

  1. 一种多线程处理器的指令分发方法,包括:
    接收所述多线程处理器的多个解码指令队列分别发出的多个线程指令分发请求,其中,所述多个线程指令分发请求中每一个包括需要被分别发送到对应类型的执行等待队列的多条指令;
    判断所述多个线程指令分发请求对于所述多线程处理器的多个执行等待队列是否存在阻塞以及冲突;
    基于所述判断,从所述多个线程指令分发请求选择一个线程指令分发请求并响应。
  2. 根据权利要求1所述的方法,其中,所述判断所述多个线程指令分发请求对于所述多个执行等待队列是否存在阻塞以及冲突,包括:
    基于所述多个执行等待队列中每一个当前可用的令牌数量,判断所述多个线程指令分发请求对于所述多个执行等待队列是否存在阻塞以及冲突。
  3. 根据权利要求2所述的方法,其中,所述基于所述多个执行等待队列中每一个当前可用的令牌数量,判断所述多个线程指令分发请求对于所述多个执行等待队列是否存在阻塞,包括:
    响应于所述多个线程指令分发请求中的第一线程指令分发请求需要的所述多个执行等待队列中的第一执行等待队列中的令牌数量大于所述第一执行等待队列中当前可用的令牌数量,确定所述第一线程指令分发请求对于所述第一执行等待队列存在阻塞。
  4. 根据权利要求2所述的方法,其中,所述多个执行等待队列包括被所述多个线程共享的至少一个共享执行等待队列,
    其中,所述基于所述多个执行等待队列中每一个当前可用的令牌数量,判断所述多个线程指令分发请求对于所述多个执行等待队列是否存在冲突,包括:
    响应于所述多个线程指令分发请求中的第二线程指令分发请求对于所述至少一个共享执行等待队列中的第二执行等待队列存在阻塞,并且所述多个线程指令分发请求中的第三线程指令分发请求需要的所述第二执行等待队列中的令牌数量不大于所述第二执行等待队列中当前可用的令牌数量,确 定所述第二线程指令分发请求和所述第三线程指令分发请求对于所述第二执行等待队列存在冲突。
  5. 根据权利要求1-4中任一项所述的方法,其中,所述基于所述判断,从所述多个线程指令分发请求选择一个线程指令分发请求并响应,包括:
    基于所述判断,将所述多个线程指令分发请求中的至少一个加入候选请求集合;
    基于所述多个线程的优先级,从所述候选请求集合中选择一个线程指令分发请求并响应。
  6. 根据权利要求5所述的方法,其中,所述基于所述判断,将所述多个线程指令分发请求中的至少一个加入所述候选请求集合,包括:
    响应于所述多个线程指令分发请求中存在对于所述多个执行等待队列不存在阻塞且不存在冲突的第四线程指令分发请求,将所述第四线程指令分发请求加入所述候选请求集合。
  7. 根据权利要求6所述的方法,其中,所述基于所述判断,将所述多个线程指令分发请求中的至少一个加入所述候选请求集合,包括:
    响应于所述多个线程指令分发请求中不存在所述第四线程指令分发请求,并且所述多个线程指令分发请求中存在对于所述多个执行等待队列存在冲突的第五线程指令分发请求,将所述第五线程指令分发请求加入所述候选请求集合。
  8. 根据权利要求5-7中任一项所述的方法,其中,所述基于所述多个线程的优先级,从所述候选请求集合中选择一个线程指令分发请求,包括:
    利用最近最少使用算法,确定所述多个线程当前的优先级,
    从所述候选请求集合中选择优先级最高的线程指令分发请求。
  9. 根据权利要求8所述的方法,其中,所述利用最近最少使用算法,确定所述多个线程当前的优先级,包括:
    初始化所述多个线程的优先级;
    响应于上一个时钟周期选择了所述多个线程中的第一线程,将所述第一线程在当前时钟周期的优先级设置为最低,并将所述多个线程中除了所述第一线程以外的其他线程的优先级递增。
  10. 根据权利要求5-7中任一项所述的方法,其中,所述基于所述多个 线程的优先级,从所述候选请求集合中选择一个线程指令分发请求,包括:
    利用轮询算法,确定所述多个线程当前的优先级。
  11. 根据权利要求1-10中任一项所述的方法,其中,所述多线程包括至少3个线程。
  12. 一种指令分发装置,包括:
    接收单元,与多线程处理器通信连接,被配置为接收所述多线程处理器的多个解码指令队列分别发出的多个线程指令分发请求,其中,所述多个线程指令分发请求中每一个包括需要被分别发送到对应类型的执行等待队列的多条指令;
    判断单元,与所述多线程处理器通信连接,被配置为判断所述多个线程指令分发请求对于所述多线程处理器的多个执行等待队列是否存在阻塞以及冲突;
    选择单元,被配置为基于所述判断,从所述多个线程指令分发请求选择一个线程指令分发请求并响应。
  13. 根据权利要求12所述的指令分发装置,其中,所述判断单元包括组合判断子单元,
    所述组合判断子单元被配置为基于所述多个执行等待队列中每一个当前可用的令牌数量,判断所述多个线程指令分发请求对于所述多个执行等待队列是否存在阻塞以及冲突。
  14. 根据权利要求13所述的指令分发装置,其中,所述组合判断子单元包括阻塞判断单元,
    所述阻塞判断单元被配置为响应于所述多个线程指令分发请求中的第一线程指令分发请求需要的所述多个执行等待队列中的第一执行等待队列中的令牌数量大于所述第一执行等待队列中当前可用的令牌数量,确定所述第一线程指令分发请求对于所述第一执行等待队列存在阻塞。
  15. 根据权利要求13所述的指令分发装置,其中,所述多个执行等待队列包括被所述多个线程共享的至少一个共享执行等待队列,所述组合判断子单元包括冲突判断单元,
    所述冲突判断单元被配置为响应于所述多个线程指令分发请求中的第二线程指令分发请求对于所述至少一个共享执行等待队列中的第二执行等 待队列存在阻塞,并且所述多个线程指令分发请求中的第三线程指令分发请求需要的所述第二执行等待队列中的令牌数量不大于所述第二执行等待队列中当前可用的令牌数量,确定所述第二线程指令分发请求和所述第三线程指令分发请求对于所述第二执行等待队列存在冲突。
  16. 根据权利要求12-15中任一项所述的指令分发装置,其中,所述选择单元包括候选选择单元和优先级选择单元,
    所述候选选择单元配置为基于所述判断,将所述多个线程指令分发请求中的至少一个加入候选请求集合;
    所述优先级选择单元配置为基于所述多个线程的优先级,从所述候选请求集合中选择一个线程指令分发请求并响应。
  17. 一种数据处理装置,包括如权利要求12-16中任一项所述的指令分发装置,多个解码指令队列和多个执行等待队列。
  18. 一种处理器,包括如权利要求17所述的数据处理装置。
  19. 一种电子设备,包括
    处理器;
    存储器,包括一个或多个计算机程序模块,
    其中,所述一个或多个计算机程序模块被存储在所述存储器中并被配置为由所述处理器执行,所述一个或多个计算机程序模块包括用于执行权利要求1-11中任一项所述的指令分发方法的指令。
  20. 一种非瞬时可读存储介质,其上存储有计算机指令,
    其中,所述计算机指令被处理器执行时执行如权利要求1-11中任一项所述的指令分发方法。
PCT/CN2023/114840 2022-08-26 2023-08-25 多线程处理器的指令分发方法、装置和存储介质 WO2024041625A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211033483.6A CN115408153B (zh) 2022-08-26 2022-08-26 多线程处理器的指令分发方法、装置和存储介质
CN202211033483.6 2022-08-26

Publications (1)

Publication Number Publication Date
WO2024041625A1 true WO2024041625A1 (zh) 2024-02-29

Family

ID=84161714

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/114840 WO2024041625A1 (zh) 2022-08-26 2023-08-25 多线程处理器的指令分发方法、装置和存储介质

Country Status (2)

Country Link
CN (1) CN115408153B (zh)
WO (1) WO2024041625A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115408153B (zh) * 2022-08-26 2023-06-30 海光信息技术股份有限公司 多线程处理器的指令分发方法、装置和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190108032A1 (en) * 2017-10-06 2019-04-11 International Business Machines Corporation Load-store unit with partitioned reorder queues with single cam port
CN110647357A (zh) * 2018-06-27 2020-01-03 展讯通信(上海)有限公司 同步多线程处理器
CN112789593A (zh) * 2018-12-24 2021-05-11 华为技术有限公司 一种基于多线程的指令处理方法及装置
CN115408153A (zh) * 2022-08-26 2022-11-29 海光信息技术股份有限公司 多线程处理器的指令分发方法、装置和存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100498482B1 (ko) * 2003-01-27 2005-07-01 삼성전자주식회사 명령어수에 수행 주기 회수를 가중치로 사용하여 쓰레드를페치하는 동시 다중 쓰레딩 프로세서 및 그 방법
US20040226011A1 (en) * 2003-05-08 2004-11-11 International Business Machines Corporation Multi-threaded microprocessor with queue flushing
US7748001B2 (en) * 2004-09-23 2010-06-29 Intel Corporation Multi-thread processing system for detecting and handling live-lock conditions by arbitrating livelock priority of logical processors based on a predertermined amount of time
CN109766201A (zh) * 2019-01-04 2019-05-17 中国联合网络通信集团有限公司 任务分发方法、服务器、客户端和系统
CN111552574A (zh) * 2019-09-25 2020-08-18 华为技术有限公司 一种多线程同步方法及电子设备
WO2021217300A1 (zh) * 2020-04-26 2021-11-04 深圳市大疆创新科技有限公司 并行执行单元的管理装置、方法及电子设备
CN112395093A (zh) * 2020-12-04 2021-02-23 龙芯中科(合肥)技术有限公司 多线程数据处理方法、装置、电子设备及可读存储介质
CN112672440A (zh) * 2020-12-18 2021-04-16 中兴通讯股份有限公司 指令执行方法、系统、网络设备及存储介质
CN114143265A (zh) * 2021-11-26 2022-03-04 杭州安恒信息技术股份有限公司 一种网络流量限流方法、装置、设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190108032A1 (en) * 2017-10-06 2019-04-11 International Business Machines Corporation Load-store unit with partitioned reorder queues with single cam port
CN110647357A (zh) * 2018-06-27 2020-01-03 展讯通信(上海)有限公司 同步多线程处理器
CN112789593A (zh) * 2018-12-24 2021-05-11 华为技术有限公司 一种基于多线程的指令处理方法及装置
CN115408153A (zh) * 2022-08-26 2022-11-29 海光信息技术股份有限公司 多线程处理器的指令分发方法、装置和存储介质

Also Published As

Publication number Publication date
CN115408153B (zh) 2023-06-30
CN115408153A (zh) 2022-11-29

Similar Documents

Publication Publication Date Title
US8963933B2 (en) Method for urgency-based preemption of a process
US7979861B2 (en) Multi-processor system and program for causing computer to execute controlling method of multi-processor system
US8108571B1 (en) Multithreaded DMA controller
US9858115B2 (en) Task scheduling method for dispatching tasks based on computing power of different processor cores in heterogeneous multi-core processor system and related non-transitory computer readable medium
US9176795B2 (en) Graphics processing dispatch from user mode
US20170076421A1 (en) Preemptive context switching of processes on an accelerated processing device (apd) based on time quanta
WO2011028896A1 (en) Hardware-based scheduling of gpu work
US20150121387A1 (en) Task scheduling method for dispatching tasks based on computing power of different processor cores in heterogeneous multi-core system and related non-transitory computer readable medium
WO2024041625A1 (zh) 多线程处理器的指令分发方法、装置和存储介质
US20140068625A1 (en) Data processing systems
US9286125B2 (en) Processing engine implementing job arbitration with ordering status
WO2024040750A1 (zh) 标量处理单元的访问控制方法及标量处理单元
CN114637536A (zh) 任务处理方法、计算协处理器、芯片及计算机设备
US7007138B2 (en) Apparatus, method, and computer program for resource request arbitration
CN112540796B (zh) 一种指令处理装置、处理器及其处理方法
US20140258680A1 (en) Parallel dispatch of coprocessor instructions in a multi-thread processor
CN111597044A (zh) 任务调度方法、装置、存储介质及电子设备
JP5805783B2 (ja) コンピュータシステムインタラプト処理
US10152329B2 (en) Pre-scheduled replays of divergent operations
CN112789593A (zh) 一种基于多线程的指令处理方法及装置
US11256543B2 (en) Processor and instruction scheduling method
US11301304B2 (en) Method and apparatus for managing kernel services in multi-core system
US8910181B2 (en) Divided central data processing
US8560784B2 (en) Memory control device and method
JP4789269B2 (ja) ベクトル処理装置及びベクトル処理方法

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2023856712

Country of ref document: EP

Effective date: 20240312

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23856712

Country of ref document: EP

Kind code of ref document: A1