WO2018171319A1 - 处理器和指令调度方法 - Google Patents

处理器和指令调度方法 Download PDF

Info

Publication number
WO2018171319A1
WO2018171319A1 PCT/CN2018/073200 CN2018073200W WO2018171319A1 WO 2018171319 A1 WO2018171319 A1 WO 2018171319A1 CN 2018073200 W CN2018073200 W CN 2018073200W WO 2018171319 A1 WO2018171319 A1 WO 2018171319A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
decoding
jth
processing unit
thread
Prior art date
Application number
PCT/CN2018/073200
Other languages
English (en)
French (fr)
Inventor
京昭倫
高也
稲守真理
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP18770207.1A priority Critical patent/EP3591518B1/en
Publication of WO2018171319A1 publication Critical patent/WO2018171319A1/zh
Priority to US16/577,092 priority patent/US11256543B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • G06F9/3009Thread control instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30101Special purpose registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • G06F9/30189Instruction operation extension or modification according to execution mode, e.g. mode flag
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3818Decoding for concurrent execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • G06F9/3869Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • G06F9/3895Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/32Address formation of the next instruction, e.g. by incrementing the instruction counter
    • G06F9/322Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address
    • G06F9/325Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address for loops, e.g. loop detection or loop counter

Definitions

  • the present application relates to the field of computer technologies, and in particular, to a processor and an instruction scheduling method.
  • IMT Interleaved Multi-Threading
  • IMT technology is an instruction scheduling mechanism that utilizes Thread Level Parallelism (TLP).
  • TLP Thread Level Parallelism
  • FIG. 1 shows a schematic diagram of an instruction scheduling process of a processor supporting 4-way IMT. That is, the maximum number of threads X supported by the processor is 4.
  • the processor includes an instruction decoder and a data path; wherein the instruction decoder is configured to decode the instruction to obtain a decoding result and send the decoding result to the data path, where the data path is used according to The decoding result executes the instruction.
  • PC0, PC1, PC2, and PC3 respectively represent Program Counters (PCs) of four independent threads.
  • the instruction decoder schedules instructions in each thread in the following order: in the first time period, the decoding unit acquires and decodes the first instruction in the thread corresponding to PC0; in the second time period, decodes The unit acquires and decodes the first instruction in the thread corresponding to PC1; in the third time period, the decoding unit acquires and decodes the first instruction in the thread corresponding to PC2; in the fourth time Cycle, the decoding unit acquires and decodes the first instruction in the thread corresponding to PC3; then, in the fifth time period, the decoding unit returns to PC0 to obtain the second instruction in the thread corresponding to PC0 and It is decoded and looped.
  • S-IMT Static IMT
  • the instruction decoder When the number of threads Z actually executed by the processor is less than the maximum number of threads X it supports, the instruction decoder still cycles through the scheduling instructions in X threads in the above fixed order. Referring to FIG. 2, still taking a processor supporting 4-way IMT as an example, when the number of threads Z actually executed by the processor is 2, the instruction scheduling order of the instruction decoder is still as described above, at PC0, PC1, PC2, and The four threads corresponding to PC3 cycle back and forth. Since PC2 and PC3 do not have corresponding threads executing, there are always two time periods in each round of the loop, and the instruction decoder does not perform the decoding operation, and does not send the decoding result to the data path, thereby causing the data. There are also idle periods in the path that do not execute instructions.
  • the embodiment of the present application provides a processor and an instruction scheduling method, which are used to solve the existing processor using the S-IMT technology.
  • the number of threads actually executed is less than the maximum number of threads supported by the processor, the data cannot be fully utilized.
  • the path leads to a problem that the use efficiency of the data path is lowered and the performance is not fully utilized.
  • a processor that supports an X-way IMT, the processor including a decoding unit and a processing unit, X being an integer greater than one; and a decoding unit for pre-performing each cycle
  • the processing unit is configured to execute the instruction according to the decoding result.
  • the decoding unit in a case where the number of threads actually executed by the processor is smaller than the maximum number of threads supported by the processor, the decoding unit repeatedly transmits the decoding result to the processing unit in multiple transmission periods, so that the decoding unit repeatedly transmits the decoding result to the processing unit in multiple transmission periods.
  • the decoding unit sends a decoding result to the processing unit in each transmission cycle, so the processing unit does not have an idle period in which the instruction is not executed, so that the processing unit is fully utilized, thereby ensuring high utilization efficiency of the processing unit. Performance can be fully utilized.
  • the technical solution provided by the embodiment of the present application can always ensure that the instruction processing intervals of two adjacent instructions in each thread are the same, and the instruction processing intervals corresponding to different threads are also the same, thereby avoiding an increase due to the uneven processing interval of the instructions.
  • the complexity of the processor logic circuit makes the processor logic circuit simple and facilitates the improvement of the processor frequency.
  • the processor also includes a Z register; the Z register is used to store the value of the Z.
  • a processing unit includes a data path and X sets of data registers, each set of data registers including at least one data register; and a data path for the ith transmission period of each cycle of the decoding unit
  • the decoding result transmitted internally reads the operand from the data register corresponding to the address code in the i-th data register according to the address code in the decoding result, 1 ⁇ i ⁇ X, i is an integer;
  • the opcode in the result performs an action on the operand.
  • the number of processing units is one or more.
  • the processor provided by the embodiment of the present application further improves processor performance by configuring a plurality of processing units to execute instructions in parallel.
  • the main frequency of the decoding unit is lower than the main frequency of the processing unit.
  • the processor provided by the embodiment of the present application reduces the power consumption of the processor by configuring the primary frequency of the decoding unit to be lower than the primary frequency of the processing unit.
  • an embodiment of the present application provides an instruction scheduling method, which is applied to a decoding unit of a processor, where the processor supports an X-way interleaved multi-threaded IMT, where X is an integer greater than one; the method includes: the decoding unit is Each cycle receives an instruction from each of the predefined Z threads, decodes the obtained Z instructions to obtain Z decoding results, and sends the Z decoding results to the processing unit.
  • the decoding unit includes an instruction decoder and X program counters.
  • the i-th transmission cycle period depending on the value of the j-th program counter, an instruction acquired from the j-th Z threads threads predefined step started.
  • the decoding unit in a case where the number of threads actually executed by the processor is smaller than the maximum number of threads supported by the processor, the decoding unit repeatedly transmits the decoding result to the processing unit in multiple transmission periods, so that the decoding unit repeatedly transmits the decoding result to the processing unit in multiple transmission periods.
  • the decoding unit sends a decoding result to the processing unit in each transmission cycle, so the processing unit does not have an idle period in which the instruction is not executed, so that the processing unit is fully utilized, thereby ensuring high utilization efficiency of the processing unit. Performance can be fully utilized.
  • FIG. 1 is a schematic diagram of an instruction scheduling process of a processor using S-IMT technology according to the prior art
  • FIG. 2 is a schematic diagram showing an instruction scheduling process of another processor using the S-IMT technology according to the prior art
  • FIG. 3A is a schematic structural diagram of a processor provided by an embodiment of the present application.
  • FIG. 3B is a schematic diagram of an instruction scheduling process of a processor involved in the embodiment of FIG. 3A;
  • FIG. 4 is a schematic structural diagram of a processor according to another embodiment of the present application.
  • Figure 7 is a flow chart showing an instruction scheduling process when 1 ⁇ Z ⁇ X;
  • FIG. 8 is a schematic diagram showing an instruction scheduling process when 1 ⁇ Z ⁇ X;
  • FIG. 9 is a flow chart showing another instruction scheduling process when 1 ⁇ Z ⁇ X.
  • FIG. 10 is a schematic diagram showing another instruction scheduling process when 1 ⁇ Z ⁇ X;
  • Figure 12 shows a schematic diagram of scheduling and executing instructions in a hybrid mode
  • FIG. 13A is a schematic structural diagram of a processor according to another embodiment of the present disclosure.
  • FIG. 13B is a schematic diagram of a processor in the embodiment of FIG. 13A adopting a TLP mode scheduling instruction
  • FIG. 13C is a schematic diagram of a processor in the embodiment of FIG. 13A adopting a DLP mode scheduling instruction
  • FIG. 13D is a schematic diagram of a processor in the embodiment of FIG. 13A adopting a mixed mode scheduling instruction
  • 14A is a schematic diagram showing an instruction scheduling process in another hybrid mode
  • 14B is a schematic diagram showing another instruction scheduling process in the TLP mode
  • FIG. 15A is a schematic diagram showing an instruction scheduling and execution process in another hybrid mode
  • 15B is a schematic diagram showing an instruction scheduling and execution process in another DLP mode
  • Figure 15C shows a schematic diagram of another instruction scheduling and execution process in TLP mode.
  • FIG. 3A is a schematic structural diagram of a processor provided by an embodiment of the present application.
  • Processor 30 supports X-way IMT, where X is an integer greater than one.
  • the processor 30 may include a decoding unit 31 and a processing element (PE) 32.
  • PE processing element
  • the maximum number of threads supported by processor 30 is X, and the number of threads actually executed by processor 30 is Z. Since Z ⁇ X, the number of threads actually executed by processor 30 is equal to or less than the maximum number of threads it supports.
  • the number of threads Z actually executed by the processor 30 can be predefined, for example, the value of Z is predefined according to the specific conditions of the program to be executed.
  • Each cycle includes X transmission cycles.
  • Each of the transmission period decoding units 31 transmits a decoding result to the processing unit 32.
  • the decoding result that may exist in the Z decoding results is repeatedly transmitted by the decoding unit 31 in a plurality of transmission periods, so that when 1 ⁇ Z ⁇ X, at least one decoding result is decoded in the Z decoding results.
  • Unit 31 repeats transmissions over multiple transmission cycles.
  • the decoding unit 31 acquires the first instruction (denoted as instruction 1) in the thread corresponding to PC0 and acquires the first instruction in the thread corresponding to PC1. (Remarked as instruction 2), decoding unit 31 decodes instruction 1 and instruction 2 to obtain decoding result 1 and decoding result 2.
  • instruction 1 decodes instruction 1 and instruction 2 to obtain decoding result 1 and decoding result 2.
  • One cycle period includes four transmission periods, and decoding unit 31 is in each transmission period.
  • a decoding result is transmitted to the processing unit 32, so that at least one decoding result in the decoding result 1 and the decoding result 2 is repeatedly transmitted by the decoding unit 31 in the above four transmission periods.
  • decoding result 1, decoding result 1, decoding result 2, and decoding result 2 (as shown in FIG. 3B);
  • decoding result 1 decoding result 1, decoding result 1, decoding result 2;
  • Transmission is performed in the first to fourth transmission periods: decoding result 1, decoding result 1, decoding result 2, and decoding result 1.
  • the decoding unit 31 acquires the next unexecuted instruction from the threads corresponding to PC0 and PC1 and decodes the decoded result, and sends the decoded result in 4 transmission cycles.
  • the processing unit 32 is configured to execute an instruction according to the decoding result.
  • the processing unit 32 is configured to execute the instruction by using a pipeline technology according to the decoding result, so that multiple instruction parallel processing can be implemented, and the efficiency of executing the instruction is improved.
  • the decoding unit in the case that the number of threads actually executed by the processor is less than the maximum number of threads supported by the processor, the decoding unit repeatedly sends the decoding result in multiple transmission periods. Giving the processing unit, so that the decoding unit sends a decoding result to the processing unit in each transmission cycle, so the processing unit does not have an idle period in which the instruction is not executed, so that the processing unit is fully utilized, thereby ensuring the processing unit High efficiency and its performance can be fully utilized.
  • the IMT technology used in the embodiments of the present application described herein may be referred to as SS-IMT (Static SIMD IMT) technology.
  • FIG. 4 is a schematic structural diagram of a processor provided by another embodiment of the present application.
  • Processor 30 supports X-way IMT, where X is an integer greater than one.
  • the processor 30 may include a decoding unit 31 and a processing unit 32.
  • the decoding unit 31 is mainly used to decode an instruction.
  • the decoding unit 31 may include an instruction decoder 311 and X program counters 312.
  • the instruction decoder 311 is configured to decode the instruction to obtain a decoding result.
  • the instruction includes an operation code and an address code.
  • the opcode is used to indicate the operational characteristics and functions of the instruction
  • the address code is used to indicate the address of the operand participating in the operation.
  • Each program counter 312 corresponds to a thread, and the program counter 312 is used to store an instruction address, which is the storage address of the next instruction to be executed.
  • the processing unit 32 is mainly configured to execute an instruction according to the decoding result; wherein the decoding result includes an operation code and an address code of the instruction.
  • Processing unit 32 may include a data path 321 and an X set of data registers 322.
  • the data path 321 is configured to acquire an operand according to the address code of the instruction, and perform a corresponding operation on the operand according to the operation code of the instruction.
  • Data register 322 is used to store operands. Each set of data registers 322 includes at least one data register.
  • the processor 30 further includes a Z register 33.
  • the Z register 33 is used to store the value of Z, that is, the number of threads Z that the processor 30 actually executes.
  • the value of Z is predefined in the Z register 33, for example, the value of Z is predefined in the Z register 33 depending on the specifics of the program to be executed.
  • the value range of Z is an integer greater than or equal to 1 and less than or equal to X.
  • Step 51 Obtain an instruction from the jth thread of the predefined Z threads according to the value of the jth program counter in the ith transmission period of the nth cycle; wherein, n, i, j The initial value is 1, and Z is equal to X;
  • Step 52 updating the value of the jth program counter, and decoding the obtained instruction to obtain a decoding result corresponding to the instruction in the jth thread;
  • Step 53 Send the decoding result corresponding to the instruction in the jth thread to the processing unit;
  • the instruction scheduling mode of the instruction decoder 311 can be referred to as a TLP mode.
  • the instruction decoder 311 schedules the instructions in accordance with the instruction scheduling described above with respect to the embodiment of Fig. 3A. Next, two possible instruction scheduling sequences of the instruction decoder 311 will be described.
  • the instruction decoder 311 is configured to perform the following steps:
  • Step 72 updating the value of the jth program counter, and decoding the obtained instruction to obtain a decoding result corresponding to the instruction in the jth thread;
  • Step 73 the decoding result corresponding to the instruction in the jth thread is sent to the processing unit;
  • step 75 If no (ie i ⁇ X), then perform the following step 75;
  • step 73 If not (that is, k ⁇ X / Z), the execution from step 73 above is performed again;
  • the instruction scheduling sequence of the instruction decoder 311 is as shown in FIG. 8.
  • the instruction decoder 311 is configured to perform the following steps:
  • Step 91 Obtain an instruction from the jth thread of the predefined Z threads according to the value of the jth program counter in the ith transmission period of the nth cycle; wherein, n, i, j The initial value is 1, and X is an integer multiple of Z;
  • Step 92 updating the value of the jth program counter, and decoding the obtained instruction to obtain a decoding result corresponding to the instruction in the jth thread;
  • Step 93 the decoding result corresponding to the instruction in the jth thread is sent to the processing unit;
  • the instruction decoder 311 sends the same decoding result to the processing unit 32 in consecutive transmission periods, it is not necessary to configure an additional buffer to store.
  • the decoding results help to save the hardware cost of the processor 30.
  • the instruction decoder 311 is configured to: acquire an instruction from a predefined one thread in each cycle, and decode the obtained instruction to obtain a decoding result.
  • the decoded result is repeatedly transmitted to the processing unit 32 in the X transmission periods included in the cycle.
  • the instruction decoder 311 transmits a decoding result to the processing unit 32 every transmission cycle.
  • the instruction scheduling sequence of the instruction decoder 311 is as shown in FIG.
  • the instruction scheduling mode of the instruction decoder 311 can be referred to as a DLP mode.
  • the instruction scheduling mode of the instruction decoder 311 can be referred to as a mixed mode, a DTLP mode, or a TDLP mode.
  • the instruction decoder 311 acquires the value of Z from the Z register 33, determines the instruction scheduling mode according to the magnitude relationship of Z and X, and then schedules the instruction according to the determined instruction scheduling mode.
  • the instruction scheduling mode includes the first two or all three of the TLP mode, the DLP mode, and the mixed mode. That is, in one case, the processor 40 supports two instruction scheduling modes in the TLP mode and the DLP mode; in another case, the processor 30 supports three instruction scheduling modes: TLP mode, DLP mode, and mixed mode. .
  • processor 30 also includes a mode register 34.
  • the mode register 34 is used to indicate to the decoding unit 31 an instruction scheduling mode including the first two or all three of the TLP mode, the DLP mode, and the mixed mode.
  • the mode register 34 obtains the value of Z from the Z register 33, determines the instruction scheduling mode according to the magnitude relationship of Z and X, and sets its own value according to the determined instruction scheduling mode.
  • the instruction decoder 311 acquires the value of the mode register 34, and determines the instruction scheduling mode based on the value of the mode register 34, and then schedules the instruction according to the determined instruction scheduling mode.
  • the processor 30 supports a total of two instruction scheduling modes, a TLP mode and a DLP mode.
  • the TLP mode is represented by a first value
  • the DLP mode is represented by a second value, the first value being different from the second value.
  • the first value is 0 and the second value is 1, or the first value is 1 and the second value is 0.
  • the mode register 34 occupies 1 bit (bit).
  • the processor 30 supports a total of three instruction scheduling modes, a TLP mode, a DLP mode, and a hybrid mode.
  • the TLP mode is represented by a first value
  • the DLP mode is represented by a second value
  • the mixed mode is represented by a third value
  • the first value, the second value, and the third value are all different.
  • the first value is 0, the second value is 1 and the third value is 2.
  • the mode register 34 occupies 2 bits.
  • the data path 321 of the processing unit 32 executes the instructions as follows.
  • the data path 321 is configured to: for the decoding result sent by the decoding unit 31 in the ith transmission period of each cyclic period, according to the address code in the decoding result, from the ith group data register and the address
  • the read operand in the corresponding data register of the code, 1 ⁇ i ⁇ X, i is an integer; the operation is performed on the operand read as described above according to the operation code in the decoding result.
  • the processor 30 schedules the instructions in the hybrid mode by using the first possible implementation manner described above, and the corresponding scheduling and execution process is as shown in FIG. It is assumed that PC0 and PC1 respectively represent the program counters of the above two threads. In each cycle, the instruction scheduling and execution process is as follows:
  • the instruction decoder 311 acquires an instruction from the first thread according to the value of PC0, decodes the instruction to obtain a decoding result (denoted as decoding result 1), and decodes the result. 1 is sent to the data path 321; the data path 321 reads the operand from the data register corresponding to the address code in the first group of data registers according to the address code in the decoding result 1, and according to the decoding result 1
  • the operation code performs an operation on the operand read above;
  • the instruction decoder 311 transmits the decoding result 1 to the data path 321 again; the data path 321 is compared with the address code from the second group of data registers according to the address code in the decoding result 1. Reading an operand in the corresponding data register, and performing an operation on the read operand according to the operation code in the decoding result 1;
  • the instruction decoder 311 sends the decoding result 1 to the data path 321 again; the data path 321 is compared with the address code from the third group of data registers according to the address code in the decoding result 1. Reading an operand in the corresponding data register, and performing an operation on the read operand according to the operation code in the decoding result 1;
  • the instruction decoder 311 sends the decoding result 1 to the data path 321 again; the data path 321 is compared with the address code from the third group of data registers according to the address code in the decoding result 1. Reading an operand in the corresponding data register, and performing an operation on the read operand according to the operation code in the decoding result 1;
  • the instruction decoder 311 acquires an instruction from the second thread according to the value of the PC1, decodes the instruction to obtain a decoding result (decoded as the decoding result 2), and decodes the result. 2 is sent to the data path 321; the data path 321 reads the operand from the data register corresponding to the address code in the fifth group data register according to the address code in the decoding result 2, and according to the decoding result 2
  • the operation code performs an operation on the operand read above;
  • the instruction decoder 311 sends the decoding result 2 to the data path 321 again; the data path 321 is compared with the address code from the sixth group of data registers according to the address code in the decoding result 2. Reading the operand in the corresponding data register, and performing an operation on the read operand according to the operation code in the decoding result 2;
  • the instruction decoder 311 sends the decoding result 2 to the data path 321 again; the data path 321 is compared with the address code from the seventh group of data registers according to the address code in the decoding result 2. Reading the operand in the corresponding data register, and performing an operation on the read operand according to the operation code in the decoding result 2;
  • the instruction decoder 311 sends the decoding result 2 to the data path 321 again; the data path 321 is compared with the address code from the eighth group of data registers according to the address code in the decoding result 2.
  • the operand is read in the corresponding data register, and an operation is performed on the above-mentioned read operand according to the operation code in the decoding result 2.
  • the opcode in decoding result 1 is used to indicate the addition operation, the address code is used to indicate the third data register; the opcode in decoding result 2 is used to indicate the subtraction operation, and the address code is used to indicate the second Data register, then:
  • the data path 321 After receiving the decoding result 1 transmitted by the instruction decoder 311 in the first transmission cycle, the data path 321 reads the operand from the third data register in the first group of data registers (assumed to be x1 and y1). , performing an addition operation on the above operand, that is, calculating the sum of x1 and y1 to obtain z1;
  • the data path 321 After receiving the decoding result 1 transmitted by the instruction decoder 311 in the second transmission cycle, the data path 321 reads the operand from the third data register in the second group of data registers (assumed to be x2 and y2). , performing an addition operation on the above operand, that is, calculating the sum of x2 and y2 to obtain z2;
  • the data path 321 reads the operand from the third data register in the third data register after receiving the decoding result 1 transmitted by the instruction decoder 311 in the third transmission cycle (assumed to be x3 and y3). , performing an addition operation on the above operand, that is, calculating the sum of x3 and y3 to obtain z3;
  • the data path 321 reads the operand from the third data register in the fourth group of data registers after receiving the decoding result 1 transmitted by the instruction decoder 311 in the fourth transmission period (assumed to be x4 and y4). , performing an addition operation on the above operand, that is, calculating the sum of x4 and y4 to obtain z4;
  • the data path 321 reads the operand from the second data register in the fifth group of data registers after receiving the decoding result 2 transmitted by the instruction decoder 311 in the fifth transmission period (assumed to be x5 and y5). Performing a subtraction operation on the above operand, that is, calculating the difference between x5 and y5 to obtain z5;
  • the data path 321 reads the operand from the second data register in the sixth group of data registers after receiving the decoding result 2 transmitted by the instruction decoder 311 in the sixth transmission period (assumed to be x6 and y6). Performing a subtraction operation on the above operand, that is, calculating the difference between x6 and y6 to obtain z6;
  • the data path 321 reads the operand from the second data register in the seventh group of data registers after receiving the decoding result 2 transmitted by the instruction decoder 311 in the seventh transmission period (assumed to be x7 and y7). Performing a subtraction operation on the above operand, that is, calculating the difference between x7 and y7 to obtain z7;
  • the data path 321 reads the operand from the second data register in the eighth group of data registers after receiving the decoding result 2 transmitted by the instruction decoder 311 in the eighth transmission period (assumed to be x8 and y8). Perform a subtraction operation on the above operand, that is, calculate the difference between x8 and y8 to obtain z8.
  • the decoding unit may repeatedly decode the decoding result in multiple transmission periods. Sending to the processing unit, so that the decoding unit sends a decoding result to the processing unit in each transmission cycle, so the processing unit does not have an idle period in which the instruction is not executed, so that the processing unit is fully utilized, thereby ensuring the processing unit Its high efficiency and its performance can be fully utilized.
  • the technical solution provided by the embodiment of the present application can always ensure that the instruction processing intervals of two adjacent instructions in each thread are the same, and the instruction processing intervals corresponding to different threads are also the same, so that the instruction processing interval can be avoided.
  • the problem of increasing the complexity of the processor logic circuit makes the processor logic circuit simple and facilitates the improvement of the processor frequency.
  • the number of processing units 32 is one or more.
  • the decoding unit 31 transmits the decoding result of the instruction to the plurality of processing units 32, and the plurality of processing units 32 execute the instructions in parallel to further improve the processor performance.
  • processor 30 includes a decoding unit 31 and four processing units 32.
  • the decoding unit 31 includes an instruction decoder 311 and X program counters 312.
  • Each processing unit 32 includes a data path 321 and an X group data register 322.
  • Each set of data registers 322 includes at least one data register. The number of data registers included in the different sets of data registers 322 may be the same or different.
  • the instruction scheduling process of the processor 30 in the TLP mode is as shown in FIG. 13B
  • the instruction scheduling process in the DLP mode is as shown in FIG. 13C
  • the instruction scheduling process in the mixed mode is as shown in FIG. 13D. Show.
  • the processor provided by the embodiment of the present application further improves processor performance by configuring a plurality of processing units to execute instructions in parallel.
  • the primary frequency of the decoding unit 31 is allowed to be lower than the primary frequency of the processing unit 32, that is, the primary frequency of the instruction decoder 311 is lower than the primary frequency of the data path 321 .
  • the frequency of components reflects the efficiency of the component. The higher the main frequency, the higher the working efficiency and the higher the power consumption. Conversely, the lower the main frequency, the lower the working efficiency and the lower the power consumption.
  • the instruction scheduling process of the decoding unit 31 is as shown in FIG. 14A.
  • the instruction scheduling process shown in FIG. 14A can be regarded as a mixed mode.
  • the primary frequency of the decoding unit 31 is the Z/X of the primary frequency of the processing unit 32, that is, the primary frequency of the instruction decoder 311 is the Z/X of the primary frequency of the data path 321 .
  • the main frequency of the decoding unit 31 (or the instruction decoder 311) takes a minimum value, and its power consumption is minimized.
  • the main frequency of the decoding unit 31 is 1/4 of the main frequency of the processing unit 32, that is, the main frequency of the instruction decoder 311 is the data path 321
  • the 1/4 of the main frequency the instruction scheduling process of the decoding unit 31 is as shown in FIG. 14B.
  • the instruction scheduling process shown in FIG. 14B can be regarded as a TLP mode.
  • the processor provided by the embodiment of the present application reduces the power consumption of the processor by configuring the primary frequency of the decoding unit to be lower than the primary frequency of the processing unit.
  • q x Z processing units 32 can be configured, q being a positive integer.
  • q processing units 32 are configured for each thread to execute the instructions in the thread.
  • the primary frequency of decoding unit 31 is 1/2 of the primary frequency of processing unit 32, that is, the primary frequency of instruction decoder 311 is 1/2 of the primary frequency of data path 321 .
  • each thread is configured with four processing units 32 for executing instructions in the thread as an example, and the corresponding instruction scheduling and execution process is as shown in FIG. 15B.
  • the embodiment of the present application further provides an instruction scheduling method, which is applicable to the decoding unit of the processor provided by the foregoing embodiment.
  • the method can include the following steps:
  • the decoding unit acquires one instruction from each of the predefined Z threads in each cycle, decodes the obtained Z instructions to obtain Z decoding results, and obtains Z decoding results. Send to the processing unit.
  • the decoding unit includes an instruction decoder and X program counters.
  • the decoding unit can schedule the instructions in accordance with the three instruction scheduling modes described above.
  • the decoding unit in the case that the number of threads actually executed by the processor is less than the maximum number of threads supported by the processor, the decoding unit repeatedly sends the decoding result in multiple transmission periods. Giving the processing unit, so that the decoding unit sends a decoding result to the processing unit in each transmission cycle, so the processing unit does not have an idle period in which the instruction is not executed, so that the processing unit is fully utilized, thereby ensuring the processing unit High efficiency and its performance can be fully utilized.
  • the processor provided in the embodiment of the present application can be applied to any electronic device having a computing processing requirement.
  • the electronic device can be a personal computer (PC), a mobile phone, a tablet computer, an e-book reader, a multimedia playback device, a wearable device, a server or a network communication device, and the like.
  • a plurality as referred to herein means two or more.
  • "and/or” describing the association relationship of the associated objects, indicating that there may be three relationships, for example, A and/or B, which may indicate that there are three cases where A exists separately, A and B exist at the same time, and B exists separately.
  • the character "/" generally indicates that the contextual object is an "or" relationship.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Executing Machine-Instructions (AREA)
  • Advance Control (AREA)

Abstract

一种处理器和指令调度方法,属于计算机技术领域。处理器支持X路交织多线程,X为大于1的整数。该处理器包括译码单元(31)和处理单元(32)。译码单元(31)用于在每一个循环周期内,从预定义的Z个线程的每一个线程中分别获取一条指令,对获取的Z条指令进行译码得到Z个译码结果,并将Z个译码结果发送给处理单元(32);其中,每一个循环周期包括X个发送周期,每一个发送周期向处理单元(32)发送一个译码结果,Z个译码结果中可存在译码结果被译码单元(31)在多个发送周期内重复发送,1≤Z<X或Z=X,Z为整数。处理单元(32)用于根据译码结果执行指令。该技术方案使得处理单元(32)得到充分利用,从而确保处理单元(32)的高使用效率,其性能能够得到充分利用。

Description

处理器和指令调度方法
本申请要求于2017年3月21日提交中国专利局、申请号为201710169572.6、发明名称为“处理器和指令调度方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种处理器和指令调度方法。
背景技术
处理器通常采用流水线技术来加快处理速度。如果流水线中即将执行的指令需要依赖于前面指令的执行结果,而前一指令却并未完成时,会导致该指令无法立刻开始执行,此时,所引发的冲突可以称为数据冒险(data hazard),进而导致处理上的指令处理延时。在现有技术中,采用交织多线程(Interleaved Multi-Threading,IMT)技术来解决流水线上因数据冒险而引起的指令处理延时。
IMT技术是一种利用线程级并行性(Thread Level Parallelism,TLP)的指令调度机制。请参考图1,其示出了一个支持4路IMT的处理器的指令调度过程的示意图。也即,该处理器支持的最大线程数量X为4。处理器包括指令译码器(instruction decoder)和数据通路(data path);其中,指令译码器用于对指令进行译码得到译码结果并将译码结果发送给数据通路,数据通路用于根据译码结果执行指令。如图1所示,PC0、PC1、PC2和PC3分别表示4个独立线程的程序计数器(Program Counter,PC)。指令译码器按照如下顺序调度各个线程中的指令:在第1个时间周期,译码单元获取PC0对应的线程中的第1条指令并对其译码;在第2个时间周期,译码单元获取PC1对应的线程中的第1条指令并对其译码;在第3个时间周期,译码单元获取PC2对应的线程中的第1条指令并对其译码;在第4个时间周期,译码单元获取PC3对应的线程中的第1条指令并对其译码;之后,在第5个时间周期,译码单元重新返回PC0,获取PC0对应的线程中的第2条指令并对其译码,以此循环。
这样,对于同一个线程中的前后两条指令来说,这两条指令之间会有若干个时间周期的缓冲期。例如对于图1所示的支持4路IMT的处理器来说,同一个线程中的前后两条指令之间存在3个时间周期的缓冲期。当后一条指令开始执行的时候,前一条指令已经完成数据写回操作,因此不会产生数据冒险。并且,在同一个线程中的前后两条指令之间的缓冲期内,流水线一直被其它线程所使用,因此流水线也保持了很高的使用效率。
上文介绍的现有的IMT技术可以称为静态IMT(Static IMT,S-IMT)技术。S-IMT技术也存在着如下的技术问题:
当处理器实际执行的线程数量Z小于其支持的最大线程数量X时,指令译码器仍然按照上述固定顺序在X个线程中循环调度指令。结合参考图2,仍然以支持4路IMT的处理器为例,当处理器实际执行的线程数量Z为2时,指令译码器的指令调度顺序仍然如上所述,在PC0、PC1、PC2和PC3对应的4个线程中循环往复。由于PC2和PC3并没有对应的线程在执行,因此在每一轮循环过程中总有2个时间周期内指令译码器不执行译码操作,也不向数 据通路发送译码结果,进而导致数据通路也存在不执行指令的空闲时段。
因此,对于现有的采用S-IMT技术的处理器来说,当其实际执行的线程数量小于其支持的最大线程数量时,无法充分利用数据通路,导致数据通路的使用效率降低,性能得不到充分利用。
发明内容
本申请实施例提供了一种处理器和指令调度方法,用以解决现有的采用S-IMT技术的处理器,当其实际执行的线程数量小于其支持的最大线程数量时,无法充分利用数据通路,导致数据通路的使用效率降低,性能得不到充分利用的问题。
一方面,提供了一种处理器,该处理器支持X路IMT,该处理器包括译码单元和处理单元,X为大于1的整数;译码单元用于在每一个循环周期内,从预定义的Z个线程的每一个线程中分别获取一条指令,对获取的Z条指令进行译码得到Z个译码结果,并将Z个译码结果发送给处理单元;其中,每一个循环周期包括X个发送周期,每一个发送周期向处理单元发送一个译码结果,Z个译码结果中可存在译码结果被译码单元在多个发送周期内重复发送,1≤Z<X或Z=X,且Z为整数;处理单元用于根据译码结果执行指令。
本申请实施例提供的技术方案,在处理器实际执行的线程数量小于其支持的最大线程数量的情况下,由于译码单元将译码结果在多个发送周期内重复地发送给处理单元,使得译码单元在每一个发送周期内都会向处理单元发送一个译码结果,因此处理单元并不会存在不执行指令的空闲时段,使得处理单元得到充分利用,从而确保处理单元的高使用效率,其性能能够得到充分利用。
在一个可能的设计中,译码单元包括指令译码器和X个程序计数器;指令译码器用于令k=1,在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令;其中,n、i、j的初始值为1,X是Z的整数倍;更新第j个程序计数器的值,并对获取的指令进行译码得到第j个线程中的指令对应的译码结果;将第j个线程中的指令对应的译码结果发送给处理单元;令i=i+1,判断i是否大于X;若i不大于X,则令k=k+1,判断k是否大于X/Z;若k不大于X/Z,则再次从将第j个线程中的指令对应的译码结果发送给处理单元的步骤开始执行;若k大于X/Z,则令j=j+1,k=1,并再次从在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行;若i大于X,则令n=n+1,i=1,j=1,k=1,并再次从在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行。
在一个可能的设计中,译码单元包括指令译码器和X个程序计数器;指令译码器用于在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令;其中,n、i、j的初始值为1,X是Z的整数倍;更新第j个程序计数器的值,并对获取的指令进行译码得到第j个线程中的指令对应的译码结果;将第j个线程中的指令对应的译码结果发送给处理单元;令i=i+1,比对i与X、Z之间的大小关系;若i不大于Z,则令j=j+1,并再次从在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行;若i大于Z且不大于X,令j=i除Z的余数,并再次从将第j个线程中的指令对应的译码结果发送给处理单 元的步骤开始执行;若i大于X,则令n=n+1,i=1,j=1,并再次从在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行。
本申请实施例提供的技术方案,始终能够确保每个线程中相邻两条指令的指令处理间隔相同,并且不同线程对应的指令处理间隔也相同,因此可以避免因指令处理间隔不均一而导致增加处理器逻辑电路的复杂度的问题,使得处理器逻辑电路简单,利于处理器主频的提升。
在一个可能的设计中,该处理器还包括Z寄存器;Z寄存器用于存储所述Z的值。
在一个可能的设计中,该处理器还包括模式寄存器;模式寄存器,用于向译码单元指示指令调度模式,指令调度模式包括线程级并行性TLP模式、数据级并行性DLP模式、混合模式中的前两种或全部三种;其中,TLP模式是指Z=X时译码单元的指令调度模式,DLP模式是指Z=1时译码单元的指令调度模式,混合模式是指1<Z<X时译码单元的指令调度模式。
在一个可能的设计中,一个处理单元包括一个数据通路和X组数据寄存器,每一组数据寄存器包括至少一个数据寄存器;数据通路用于对于译码单元在每一个循环周期的第i个发送周期内发送的译码结果,根据译码结果中的地址码,从第i组数据寄存器中与地址码相对应的数据寄存器中读取操作数,1≤i≤X,i为整数;根据译码结果中的操作码对操作数执行操作。
在一个可能的设计中,处理单元的数量为一个或多个。
本申请实施例提供的处理器,通过配置多个处理单元并行执行指令,进一步提升处理器性能。
在一个可能的设计中,译码单元的主频低于处理单元的主频。
本申请实施例提供的处理器,通过将译码单元的主频配置为低于处理单元的主频,从而降低处理器的功耗。
另一方面,本申请实施例提供一种指令调度方法,应用于处理器的译码单元中,处理器支持X路交织多线程IMT,X为大于1的整数;该方法包括:译码单元在每一个循环周期内,从预定义的Z个线程的每一个线程中分别获取一条指令,对获取的Z条指令进行译码得到Z个译码结果,并将Z个译码结果发送给处理单元;其中,每一个循环周期包括X个发送周期,每一个发送周期向处理单元发送一个译码结果,Z个译码结果中可存在译码结果被译码单元在多个发送周期内重复发送,1≤Z<X或Z=X,且Z为整数。
在一个可能的设计中,译码单元包括指令译码器和X个程序计数器。令k=1,指令译码器在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令;其中,n、i、j的初始值为1,X是Z的整数倍;指令译码器更新第j个程序计数器的值,并对获取的指令进行译码得到第j个线程中的指令对应的译码结果;指令译码器将第j个线程中的指令对应的译码结果发送给处理单元;令i=i+1,指令译码器判断i是否大于X;若i不大于X,则令k=k+1,指令译码器判断k是否大于X/Z;若k不大于X/Z,则指令译码器再次从将第j个线程中的指令对应的译码结果发送给处理单元的步骤开始执行;若k大于X/Z,则令j=j+1,k=1,指令译码器再次从在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行;若i大于X,则令n=n+1,i=1,j=1,k=1,指令译码器再次从在第n个循环 周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行。
在一个可能的设计中,译码单元包括指令译码器和X个程序计数器。指令译码器在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令;其中,n、i、j的初始值为1,X是Z的整数倍;指令译码器更新第j个程序计数器的值,并对获取的指令进行译码得到第j个线程中的指令对应的译码结果;指令译码器将第j个线程中的指令对应的译码结果发送给处理单元;令i=i+1,指令译码器比对i与X、Z之间的大小关系;若i不大于Z,令j=j+1,则指令译码器再次从在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行;若i大于Z且不大于X,令j=i除Z的余数,则指令译码器再次从将第j个线程中的指令对应的译码结果发送给处理单元的步骤开始执行;若i大于X,令n=n+1,i=1,j=1,则指令译码器再次从在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行。
本申请实施例提供的技术方案,在处理器实际执行的线程数量小于其支持的最大线程数量的情况下,由于译码单元将译码结果在多个发送周期内重复地发送给处理单元,使得译码单元在每一个发送周期内都会向处理单元发送一个译码结果,因此处理单元并不会存在不执行指令的空闲时段,使得处理单元得到充分利用,从而确保处理单元的高使用效率,其性能能够得到充分利用。
附图说明
图1示出了现有技术涉及的一种采用S-IMT技术的处理器的指令调度过程的示意图;
图2示出了现有技术涉及的另一种采用S-IMT技术的处理器的指令调度过程的示意图;
图3A示出了本申请一个实施例提供的处理器的结构示意图;
图3B是图3A实施例涉及的一种处理器的指令调度过程的示意图;
图4示出了本申请另一个实施例提供的处理器的结构示意图;
图5示出了一种Z=X时的指令调度过程的流程图;
图6示出了一种Z=X时的指令调度过程的示意图;
图7示出了一种1≤Z<X时的指令调度过程的流程图;
图8示出了一种1≤Z<X时的指令调度过程的示意图;
图9示出了另一种1≤Z<X时的指令调度过程的流程图;
图10示出了另一种1≤Z<X时的指令调度过程的示意图;
图11示出了一种Z=1时的指令调度过程的示意图;
图12示出了一种混合模式下调度和执行指令的示意图;
图13A示出了本申请另一个实施例提供的处理器的结构示意图;
图13B是图13A实施例提供的处理器采用TLP模式调度指令的示意图;
图13C是图13A实施例提供的处理器采用DLP模式调度指令的示意图;
图13D是图13A实施例提供的处理器采用混合模式调度指令的示意图;
图14A示出了另一种混合模式下的指令调度过程的示意图;
图14B示出了另一种TLP模式下的指令调度过程的示意图;
图15A示出了另一种混合模式下的指令调度和执行过程的示意图;
图15B示出了另一种DLP模式下的指令调度和执行过程的示意图;
图15C示出了另一种TLP模式下的指令调度和执行过程的示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
请参考图3A,其示出了本申请一个实施例提供的处理器的结构示意图。处理器30支持X路IMT,X为大于1的整数。处理器30可以包括:译码单元31和处理单元(Processing Element,PE)32。
译码单元31,用于在每一个循环周期内,从预定义的Z个线程的每一个线程中分别获取一条指令,对获取的Z条指令进行译码得到Z个译码结果,并将Z个译码结果发送给处理单元32,1≤Z<X或Z=X,且Z为整数。
处理器30支持的最大线程数量为X,处理器30实际执行的线程数量为Z,由于Z≤X,因此处理器30实际执行的线程数量等于或小于其支持的最大线程数量。其中,处理器30实际执行的线程数量Z可预先定义,例如根据所需执行的程序的具体情况预定义Z的值。
每一个循环周期包括X个发送周期。每一个发送周期译码单元31向处理单元32发送一个译码结果。Z个译码结果中可存在译码结果被译码单元31在多个发送周期内重复发送,这样,当1≤Z<X时,Z个译码结果中存在至少一个译码结果被译码单元31在多个发送周期内重复发送。
在一个示例中,假设X=4且Z=2,也即处理器30支持的最大线程数量为4,处理器30实际执行的线程数量为2,假设这2个实际执行的线程对应的程序计数器为PC0和PC1。以第1个循环周期为例,在第1个循环周期内,译码单元31获取PC0对应的线程中的第1条指令(记为指令1)并获取PC1对应的线程中的第1条指令(记为指令2),译码单元31对指令1和指令2进行译码得到译码结果1和译码结果2。1个循环周期包括4个发送周期,译码单元31在每一个发送周期向处理单元32发送一个译码结果,因此译码结果1和译码结果2中存在至少一个译码结果被译码单元31在上述4个发送周期内重复发送。具体包括如下几种可能的实现方式:
1、在第1至4个发送周期分别发送:译码结果1、译码结果1、译码结果2、译码结果2(如图3B所示);
2、在第1至4个发送周期分别发送:译码结果1、译码结果1、译码结果1、译码结果2;
3、在第1至4个发送周期分别发送:译码结果1、译码结果2、译码结果2、译码结果2;
4、在第1至4个发送周期分别发送:译码结果1、译码结果2、译码结果1、译码结果2;
5、在第1至4个发送周期分别发送:译码结果1、译码结果2、译码结果2、译码结果1;
6、在第1至4个发送周期分别发送:译码结果1、译码结果1、译码结果2、译码结果1。
类似地,在后续的每一个循环周期,译码单元31分别从PC0和PC1对应的线程中获取下1条未执行的指令并对其译码得到译码结果,并在4个发送周期内发送上述2个译码结果,其中存在至少一个译码结果被译码单元31在多个发送周期内重复发送。
处理单元32,用于根据译码结果执行指令。
可选地,处理单元32用于根据译码结果采用流水线技术执行指令,从而能够实现多条指令并行处理,提高执行指令的效率。
综上所述,本申请实施例提供的技术方案,在处理器实际执行的线程数量小于其支持的最大线程数量的情况下,由于译码单元将译码结果在多个发送周期内重复地发送给处理单元,使得译码单元在每一个发送周期内都会向处理单元发送一个译码结果,因此处理单元并不会存在不执行指令的空闲时段,使得处理单元得到充分利用,从而确保处理单元的高使用效率,其性能能够得到充分利用。
本文中介绍的本申请实施例所采用的IMT技术可以称为SS-IMT(Static SIMD IMT)技术。
请参考图4,其示出了本申请另一个实施例提供的处理器的结构示意图。处理器30支持X路IMT,X为大于1的整数。处理器30可以包括:译码单元31和处理单元32。
译码单元31主要用于对指令进行译码。译码单元31可以包括:指令译码器311和X个程序计数器312。指令译码器311用于对指令进行译码得到译码结果。其中,指令包括操作码和地址码。操作码用于指示指令的操作特性与功能,地址码用于指示参与操作的操作数的地址。每一个程序计数器312对应于一个线程,程序计数器312用于存放指令地址,该指令地址是指待执行的下一条指令的存放地址。
处理单元32主要用于根据译码结果执行指令;其中,译码结果中包括指令的操作码和地址码。处理单元32可以包括:数据通路321和X组数据寄存器322。数据通路321用于根据指令的地址码获取操作数,并根据指令的操作码对操作数执行相应的操作。数据寄存器322用于存储操作数。每一组数据寄存器322包括至少一个数据寄存器。
如图4所示,在本申请实施例中,处理器30还包括Z寄存器33。Z寄存器33用于存储Z的值,也即用于存储处理器30实际执行的线程数量Z。Z的值在Z寄存器33中预定义,例如根据所需执行的程序的具体情况在Z寄存器33中预定义Z的值。Z的取值范围为大于等于1且小于等于X的整数。
当Z=X时,处理器30实际执行的线程数量与其支持的最大线程数量相等。结合参考图5,其示出了当Z=X时指令译码器311的指令调度顺序。指令译码器311用于执行如下步骤:
步骤51,在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令;其中,n、i、j的初始值为1,Z与X相等;
步骤52,更新第j个程序计数器的值,并对获取的指令进行译码得到第j个线程中的指令对应的译码结果;
步骤53,将第j个线程中的指令对应的译码结果发送给处理单元;
步骤54,令i=i+1,判断i是否大于X;
若否(也即i≤X),则令j=j+1,并再次从上述步骤51开始执行;
若是(也即i>X),则令n=n+1,i=1,j=1,并再次从上述步骤51开始执行。
以X=8为例,当Z=X=8时,指令译码器411的指令调度顺序如图6所示。
当Z=X时,可以将指令译码器311的指令调度模式称为TLP模式。
当1≤Z<X时,处理器30实际执行的线程数量小于其支持的最大线程数量。指令译码 器311按照上述图3A实施例介绍的指令调度顺序调度指令。下面,对指令译码器311两种可能的指令调度顺序进行介绍说明。
在第一种可能的实现方式中,如图7所示,指令译码器311用于执行如下步骤:
步骤71,令k=1,在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令;其中,n、i、j的初始值为1,X是Z的整数倍;
步骤72,更新第j个程序计数器的值,并对获取的指令进行译码得到第j个线程中的指令对应的译码结果;
步骤73,将第j个线程中的指令对应的译码结果发送给处理单元;
步骤74,令i=i+1,判断i是否大于X;
若否(也即i≤X),则执行下述步骤75;
若是(也即i>X),则令n=n+1,i=1,j=1,k=1,并再次从上述步骤71开始执行;
步骤75,令k=k+1,判断k是否大于X/Z;
若否(也即k≤X/Z),则再次从上述步骤73开始执行;
若是(也即k>X/Z),则令j=j+1,k=1,并再次从上述步骤71开始执行。
以X=8,Z=4为例,在上述第一种可能的实现方式中,指令译码器311的指令调度顺序如图8所示。
在第二种可能的实现方式中,如图9所示,指令译码器311用于执行如下步骤:
步骤91,在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从预定义的Z个线程的第j个线程中获取一条指令;其中,n、i、j的初始值为1,X是Z的整数倍;
步骤92,更新第j个程序计数器的值,并对获取的指令进行译码得到第j个线程中的指令对应的译码结果;
步骤93,将第j个线程中的指令对应的译码结果发送给处理单元;
步骤94,令i=i+1,比对i与X、Z之间的大小关系;
若i≤Z,则令j=j+1,并再次从上述步骤91开始执行;
若Z<i≤X,令j=i除Z的余数,并再次从上述步骤93开始执行;
若i>X,则令n=n+1,i=1,j=1,并再次从上述步骤91开始执行。
以X=8,Z=4为例,在上述第二种可能的实现方式中,指令译码器311的指令调度顺序如图10所示。
上述两种实现方式相比,由于在第一种可能的实现方式中,指令译码器311在连续的发送周期内将相同的译码结果发送给处理单元32,因此不必配置额外的缓存来存储译码结果,有助于节省处理器30的硬件成本。
另外,当Z=1时,指令译码器311用于:在每一个循环周期内,从预定义的1个线程中获取一条指令,对获取的该条指令进行译码得到一个译码结果,并在该循环周期包括的X个发送周期内将该译码结果重复地发送给处理单元32。指令译码器311在每一个发送周期向处理单元32发送一个译码结果。
以X=8,Z=1为例,指令译码器311的指令调度顺序如图11所示。
当Z=1时,可以将指令译码器311的指令调度模式称为DLP模式。
当1<Z<X时,可以将指令译码器311的指令调度模式称为混合模式、DTLP模式或TDLP 模式。
在实际实现时,指令译码器311从Z寄存器33中获取Z的值,根据Z和X的大小关系,确定指令调度模式,进而按照确定出的指令调度模式调度指令。其中,指令调度模式包括TLP模式、DLP模式、混合模式中的前两种或全部三种。也即,在一种情况下,处理器40支持TLP模式和DLP模式共2种指令调度模式;在另一种情况下,处理器30支持TLP模式、DLP模式和混合模式共3种指令调度模式。
可选地,如图4所示,处理器30还包括模式寄存器34。模式寄存器34用于向译码单元31指示指令调度模式,该指令调度模式包括TLP模式、DLP模式、混合模式中的前两种或全部三种。
在实际实现时,模式寄存器34从Z寄存器33中获取Z的值,根据Z和X的大小关系,确定指令调度模式,并按照确定出的指令调度模式设置其自身的值。指令译码器311获取模式寄存器34的值,并根据模式寄存器34的值确定指令调度模式,进而按照确定出的指令调度模式调度指令。
在一种情况下,处理器30支持TLP模式和DLP模式共2种指令调度模式。其中,TLP模式以第一数值表示,DLP模式以第二数值表示,第一数值与第二数值不同。例如,第一数值为0且第二数值为1,或者,第一数值为1且第二数值为0。当处理器30支持上述2种指令调度模式时,模式寄存器34占用1bit(比特)。
在另一种情况下,处理器30支持TLP模式、DLP模式和混合模式共3种指令调度模式。其中,TLP模式以第一数值表示,DLP模式以第二数值表示,混合模式以第三数值表示,第一数值、第二数值和第三数值均不相同。例如,第一数值为0、第二数值为1且第三数值为2。当处理器30支持上述3种指令调度模式时,模式寄存器34占用2bit。
另外,不论译码单元31按照何种指令调度模式调度指令,处理单元32的数据通路321均按照如下方式执行指令。数据通路321,用于:对于译码单元31在每一个循环周期的第i个发送周期内发送的译码结果,根据该译码结果中的地址码,从第i组数据寄存器中与该地址码相对应的数据寄存器中读取操作数,1≤i≤X,i为整数;根据译码结果中的操作码对上述读取的操作数执行操作。
以X=8,Z=2为例,处理器30在混合模式下采用上文介绍的第一种可能的实现方式对指令进行调度,相应的调度和执行过程如图12所示。假设PC0和PC1分别表示上述2个线程的程序计数器。在每一个循环周期内,指令调度和执行过程如下:
第1个发送周期,指令译码器311根据PC0的值从第1个线程中获取一条指令,对该指令进行译码后得到译码结果(记为译码结果1),并将译码结果1发送给数据通路321;数据通路321根据该译码结果1中的地址码,从第1组数据寄存器中与该地址码相对应的数据寄存器中读取操作数,并根据译码结果1中的操作码对上述读取的操作数执行操作;
在第2个发送周期,指令译码器311再次将译码结果1发送给数据通路321;数据通路321根据该译码结果1中的地址码,从第2组数据寄存器中与该地址码相对应的数据寄存器中读取操作数,并根据译码结果1中的操作码对上述读取的操作数执行操作;
在第3个发送周期,指令译码器311再次将译码结果1发送给数据通路321;数据通路321根据该译码结果1中的地址码,从第3组数据寄存器中与该地址码相对应的数据寄存器中读取操作数,并根据译码结果1中的操作码对上述读取的操作数执行操作;
在第4个发送周期,指令译码器311再次将译码结果1发送给数据通路321;数据通路321根据该译码结果1中的地址码,从第3组数据寄存器中与该地址码相对应的数据寄存器中读取操作数,并根据译码结果1中的操作码对上述读取的操作数执行操作;
第5个发送周期,指令译码器311根据PC1的值从第2个线程中获取一条指令,对该指令进行译码后得到译码结果(记为译码结果2),并将译码结果2发送给数据通路321;数据通路321根据该译码结果2中的地址码,从第5组数据寄存器中与该地址码相对应的数据寄存器中读取操作数,并根据译码结果2中的操作码对上述读取的操作数执行操作;
在第6个发送周期,指令译码器311再次将译码结果2发送给数据通路321;数据通路321根据该译码结果2中的地址码,从第6组数据寄存器中与该地址码相对应的数据寄存器中读取操作数,并根据译码结果2中的操作码对上述读取的操作数执行操作;
在第7个发送周期,指令译码器311再次将译码结果2发送给数据通路321;数据通路321根据该译码结果2中的地址码,从第7组数据寄存器中与该地址码相对应的数据寄存器中读取操作数,并根据译码结果2中的操作码对上述读取的操作数执行操作;
在第8个发送周期,指令译码器311再次将译码结果2发送给数据通路321;数据通路321根据该译码结果2中的地址码,从第8组数据寄存器中与该地址码相对应的数据寄存器中读取操作数,并根据译码结果2中的操作码对上述读取的操作数执行操作。
例如,假设译码结果1中的操作码用于指示加法操作,地址码用于指示第3个数据寄存器;译码结果2中的操作码用于指示减法操作,地址码用于指示第2个数据寄存器,则:
数据通路321在接收到指令译码器311在第1个发送周期发送的译码结果1之后,从第1组数据寄存器中的第3个数据寄存器中读取操作数(假设为x1和y1),对上述操作数执行加法操作,也即计算x1和y1的和,得到z1;
数据通路321在接收到指令译码器311在第2个发送周期发送的译码结果1之后,从第2组数据寄存器中的第3个数据寄存器中读取操作数(假设为x2和y2),对上述操作数执行加法操作,也即计算x2和y2的和,得到z2;
数据通路321在接收到指令译码器311在第3个发送周期发送的译码结果1之后,从第3组数据寄存器中的第3个数据寄存器中读取操作数(假设为x3和y3),对上述操作数执行加法操作,也即计算x3和y3的和,得到z3;
数据通路321在接收到指令译码器311在第4个发送周期发送的译码结果1之后,从第4组数据寄存器中的第3个数据寄存器中读取操作数(假设为x4和y4),对上述操作数执行加法操作,也即计算x4和y4的和,得到z4;
数据通路321在接收到指令译码器311在第5个发送周期发送的译码结果2之后,从第5组数据寄存器中的第2个数据寄存器中读取操作数(假设为x5和y5),对上述操作数执行减法操作,也即计算x5和y5的差,得到z5;
数据通路321在接收到指令译码器311在第6个发送周期发送的译码结果2之后,从第6组数据寄存器中的第2个数据寄存器中读取操作数(假设为x6和y6),对上述操作数执行减法操作,也即计算x6和y6的差,得到z6;
数据通路321在接收到指令译码器311在第7个发送周期发送的译码结果2之后,从第7组数据寄存器中的第2个数据寄存器中读取操作数(假设为x7和y7),对上述操作数执行减法操作,也即计算x7和y7的差,得到z7;
数据通路321在接收到指令译码器311在第8个发送周期发送的译码结果2之后,从第8组数据寄存器中的第2个数据寄存器中读取操作数(假设为x8和y8),对上述操作数执行减法操作,也即计算x8和y8的差,得到z8。
综上所述,本申请实施例提供的技术方案,在处理器实际执行的线程数量小于其支持的最大线程数量的情况下,由于译码单元可将译码结果在多个发送周期内重复地发送给处理单元,使得译码单元在每一个发送周期内都会向处理单元发送一个译码结果,因此处理单元并不会存在不执行指令的空闲时段,使得处理单元得到充分利用,从而确保处理单元的高使用效率,其性能能够得到充分利用。
另外,本申请实施例提供的技术方案,始终能够确保每个线程中相邻两条指令的指令处理间隔相同,并且不同线程对应的指令处理间隔也相同,因此可以避免因指令处理间隔不均一而导致增加处理器逻辑电路的复杂度的问题,使得处理器逻辑电路简单,利于处理器主频的提升。
在本申请实施例中,处理单元32的数量为一个或多个。
当处理单元32的数量为多个时,译码单元31将指令的译码结果发送给该多个处理单元32,由该多个处理单元32并行执行指令,进一步提升处理器性能。
在一个例子中,如图13A所示,处理器30包括译码单元31和4个处理单元32。其中,译码单元31包括:指令译码器311和X个程序计数器312。每一个处理单元32包括:数据通路321和X组数据寄存器322。每一组数据寄存器322包括至少一个数据寄存器。不同组的数据寄存器322中包括的数据寄存器的数量可以相同,也可以不同。
以X=8为例,处理器30在TLP模式下的指令调度过程如图13B所示,在DLP模式下的指令调度过程如图13C所示,在混合模式下的指令调度过程如图13D所示。
本申请实施例提供的处理器,通过配置多个处理单元并行执行指令,进一步提升处理器性能。
可选地,在本申请实施例中,允许译码单元31的主频低于处理单元32的主频,也即指令译码器311的主频低于数据通路321的主频。元器件的主频反映了该元器件的工作效率。主频越高,其工作效率越高,相应的功耗也较高;反之,主频越低,其工作效率越低,相应的功耗也较低。
假设译码单元31的主频是处理单元32的主频的1/w,w为2的正整数次幂且w<X,则处理器30实际支持的最大线程数量由X降低至X/w。这样,当Z=X/w时,译码单元31的指令调度模式为TLP模式;当Z=1时,译码单元31的指令调度模式为DLP模式;当1<Z<X/w时,译码单元31的指令调度模式为混合模式。
在一个例子中,以X=8,Z=2为例,假设译码单元31的主频是处理单元32的主频的1/2,则译码单元31的指令调度过程如图14A所示。图14A所示的指令调度过程可以看作是混合模式。
可选地,译码单元31的主频是处理单元32的主频的Z/X,也即指令译码器311的主频是数据通路321的主频的Z/X。此时,译码单元31(或者指令译码器311)的主频取最小值,其功耗降到最低。
在一个例子中,以X=8,Z=2为例,假设译码单元31的主频是处理单元32的主频的1/4,也即指令译码器311的主频是数据通路321的主频的1/4,则译码单元31的指令调度过程如图14B所示。图14B所示的指令调度过程可以看作是TLP模式。
本申请实施例提供的处理器,通过将译码单元的主频配置为低于处理单元的主频,从而降低处理器的功耗。
可选地,在本申请实施例中,除了允许译码单元31的主频低于处理单元32的主频,也即指令译码器311的主频低于数据通路321的主频之外,可配置q×Z个处理单元32,q为正整数。对于处理器30实际执行的Z个线程,为每一个线程配置q个处理单元32用以执行该线程中的指令。此配置虽然需要额外的缓存来存储译码结果,但因为将同一线程中的指令交由同一个处理单元32执行,即可以采用处理单元32内流水线段间传递数据的方式来实现线程内数据传递,简化其实现之开销。
在一个例子中,假设译码单元31的主频是处理单元32的主频的1/2,也即指令译码器311的主频是数据通路321的主频的1/2。以X=8为例,在指令译码器311的主频是数据通路321的主频的1/2的情况下,处理器30实际支持的最大线程数量为8×1/2=4。在混合模式下,以Z=2,且为每一个线程配置2个处理单元32用以执行该线程中的指令为例,相应的指令调度和执行过程如图15A所示。在DLP模式下,Z=1,以为每一个线程配置4个处理单元32用以执行该线程中的指令为例,相应的指令调度和执行过程如图15B所示。在TLP模式下,Z=4,以为每一个线程配置1个处理单元32用以执行该线程中的指令为例,相应的指令调度和执行过程如图15C所示。
本申请实施例还提供了一种指令调度方法,该方法可应用于上述实施例提供的处理器的译码单元中。该方法可以包括如下步骤:
译码单元在每一个循环周期内,从预定义的Z个线程的每一个线程中分别获取一条指令,对获取的Z条指令进行译码得到Z个译码结果,并将Z个译码结果发送给处理单元。
其中,每一个循环周期包括X个发送周期,每一个发送周期向处理单元发送一个译码结果,Z个译码结果中可存在译码结果被译码单元在多个发送周期内重复发送,1≤Z<X或Z=X,且Z为整数。
可选地,译码单元包括指令译码器和X个程序计数器。译码单元可以按照上文介绍的3种指令调度模式调度指令。
对于本申请方法实施例中未披露的细节,请参考本申请中有关处理器的实施例。
综上所述,本申请实施例提供的技术方案,在处理器实际执行的线程数量小于其支持的最大线程数量的情况下,由于译码单元将译码结果在多个发送周期内重复地发送给处理单元,使得译码单元在每一个发送周期内都会向处理单元发送一个译码结果,因此处理单元并不会存在不执行指令的空闲时段,使得处理单元得到充分利用,从而确保处理单元的高使用效率,其性能能够得到充分利用。
需要补充说明的一点是,本申请实施例提供的处理器,可应用于任何有计算处理需求的电子设备中。例如,该电子设备可以是个人计算机(Personal Computer,PC)、手机、平板电 脑、电子书阅读器、多媒体播放设备、可穿戴设备、服务器或网络通讯设备,等等。
应当理解的是,在本文中提及的“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的示例性实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (11)

  1. 一种处理器,其特征在于,所述处理器支持X路交织多线程IMT,所述处理器包括:译码单元和处理单元,X为大于1的整数;
    所述译码单元,用于在每一个循环周期内,从预定义的Z个线程的每一个线程中分别获取一条指令,对获取的所述Z条指令进行译码得到Z个译码结果,并将所述Z个译码结果发送给所述处理单元;其中,每一个循环周期包括X个发送周期,每一个发送周期向所述处理单元发送一个译码结果,所述Z个译码结果中可存在译码结果被所述译码单元在多个发送周期内重复发送,1≤Z<X或Z=X,且Z为整数;
    所述处理单元,用于根据所述译码结果执行指令。
  2. 根据权利要求1所述的处理器,其特征在于,所述译码单元包括:指令译码器和X个程序计数器;
    所述指令译码器,用于:
    令k=1,在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从所述预定义的Z个线程的第j个线程中获取一条指令;其中,n、i、j的初始值为1,X是Z的整数倍;
    更新所述第j个程序计数器的值,并对获取的所述指令进行译码得到所述第j个线程中的指令对应的译码结果;
    将所述第j个线程中的指令对应的译码结果发送给所述处理单元;
    令i=i+1,判断i是否大于X;
    若i不大于X,则令k=k+1,判断k是否大于X/Z;若k不大于X/Z,则再次从所述将所述第j个线程中的指令对应的译码结果发送给所述处理单元的步骤开始执行;若k大于X/Z,则令j=j+1,k=1,并再次从所述在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从所述预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行;
    若i大于X,则令n=n+1,i=1,j=1,k=1,并再次从所述在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从所述预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行。
  3. 根据权利要求1所述的处理器,其特征在于,所述译码单元包括:指令译码器和X个程序计数器;
    所述指令译码器,用于:
    在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从所述预定义的Z个线程的第j个线程中获取一条指令;其中,n、i、j的初始值为1,X是Z的整数倍;
    更新所述第j个程序计数器的值,并对获取的所述指令进行译码得到所述第j个线程中的指令对应的译码结果;
    将所述第j个线程中的指令对应的译码结果发送给所述处理单元;
    令i=i+1,比对i与X、Z之间的大小关系;
    若i不大于Z,则令j=j+1,并再次从所述在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从所述预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行;
    若i大于Z且不大于X,令j=i除Z的余数,并再次从所述将所述第j个线程中的指令对应的译码结果发送给所述处理单元的步骤开始执行;
    若i大于X,则令n=n+1,i=1,j=1,并再次从所述在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从所述预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行。
  4. 根据权利要求1所述的处理器,其特征在于,所述处理器还包括:Z寄存器;
    所述Z寄存器,用于存储所述Z的值。
  5. 根据权利要求1所述的处理器,其特征在于,所述处理器还包括:模式寄存器;
    所述模式寄存器,用于向所述译码单元指示指令调度模式,所述指令调度模式包括线程级并行性TLP模式、数据级并行性DLP模式、混合模式中的前两种或全部三种;
    其中,所述TLP模式是指Z=X时所述译码单元的指令调度模式,所述DLP模式是指Z=1时所述译码单元的指令调度模式,所述混合模式是指1<Z<X时所述译码单元的指令调度模式。
  6. 根据权利要求1所述的处理器,其特征在于,所述处理单元包括:数据通路和X组数据寄存器,每一组数据寄存器包括至少一个数据寄存器;
    所述数据通路,用于:
    对于所述译码单元在每一个循环周期的第i个发送周期内发送的译码结果,根据所述译码结果中的地址码,从第i组数据寄存器中与所述地址码相对应的数据寄存器中读取操作数,1≤i≤X,i为整数;
    根据所述译码结果中的操作码对所述操作数执行操作。
  7. 根据权利要求1所述的处理器,其特征在于,所述处理单元的数量为一个或多个。
  8. 根据权利要求1所述的处理器,其特征在于,所述译码单元的主频低于所述处理单元的主频。
  9. 一种指令调度方法,其特征在于,应用于处理器的译码单元中,所述处理器支持X路交织多线程IMT,X为大于1的整数;所述方法包括:
    所述译码单元在每一个循环周期内,从预定义的Z个线程的每一个线程中分别获取一条指令,对获取的所述Z条指令进行译码得到Z个译码结果,并将所述Z个译码结果发送给处理单元;
    其中,每一个循环周期包括X个发送周期,每一个发送周期向所述处理单元发送一个译码结果,所述Z个译码结果中可存在译码结果被所述译码单元在多个发送周期内重复发送,1≤Z<X或Z=X,且Z为整数。
  10. 根据权利要求9所述的方法,其特征在于,所述译码单元包括:指令译码器和X个 程序计数器;
    令k=1,所述指令译码器在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从所述预定义的Z个线程的第j个线程中获取一条指令;其中,n、i、j的初始值为1,X是Z的整数倍;
    所述指令译码器更新所述第j个程序计数器的值,并对获取的所述指令进行译码得到所述第j个线程中的指令对应的译码结果;
    所述指令译码器将所述第j个线程中的指令对应的译码结果发送给所述处理单元;
    令i=i+1,所述指令译码器判断i是否大于X;
    若i不大于X,则令k=k+1,所述指令译码器判断k是否大于X/Z;若k不大于X/Z,则所述指令译码器再次从所述将所述第j个线程中的指令对应的译码结果发送给所述处理单元的步骤开始执行;若k大于X/Z,则令j=j+1,k=1,所述指令译码器再次从所述在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从所述预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行;
    若i大于X,则令n=n+1,i=1,j=1,k=1,所述指令译码器再次从所述在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从所述预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行。
  11. 根据权利要求9所述的方法,其特征在于,所述译码单元包括:指令译码器和X个程序计数器;
    所述指令译码器在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从所述预定义的Z个线程的第j个线程中获取一条指令;其中,n、i、j的初始值为1,X是Z的整数倍;
    所述指令译码器更新所述第j个程序计数器的值,并对获取的所述指令进行译码得到所述第j个线程中的指令对应的译码结果;
    所述指令译码器将所述第j个线程中的指令对应的译码结果发送给所述处理单元;
    令i=i+1,所述指令译码器比对i与X、Z之间的大小关系;
    若i不大于Z,令j=j+1,所述指令译码器再次从所述在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从所述预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行;
    若i大于Z且不大于X,令j=i除Z的余数,所述指令译码器再次从所述将所述第j个线程中的指令对应的译码结果发送给所述处理单元的步骤开始执行;
    若i大于X,令n=n+1,i=1,j=1,所述指令译码器再次从所述在第n个循环周期的第i个发送周期,根据第j个程序计数器的值,从所述预定义的Z个线程的第j个线程中获取一条指令的步骤开始执行。
PCT/CN2018/073200 2017-03-21 2018-01-18 处理器和指令调度方法 WO2018171319A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP18770207.1A EP3591518B1 (en) 2017-03-21 2018-01-18 Processor and instruction scheduling method
US16/577,092 US11256543B2 (en) 2017-03-21 2019-09-20 Processor and instruction scheduling method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710169572.6A CN108628639B (zh) 2017-03-21 2017-03-21 处理器和指令调度方法
CN201710169572.6 2017-03-21

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/577,092 Continuation US11256543B2 (en) 2017-03-21 2019-09-20 Processor and instruction scheduling method

Publications (1)

Publication Number Publication Date
WO2018171319A1 true WO2018171319A1 (zh) 2018-09-27

Family

ID=63586305

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/073200 WO2018171319A1 (zh) 2017-03-21 2018-01-18 处理器和指令调度方法

Country Status (4)

Country Link
US (1) US11256543B2 (zh)
EP (1) EP3591518B1 (zh)
CN (1) CN108628639B (zh)
WO (1) WO2018171319A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113031561A (zh) * 2021-03-05 2021-06-25 深圳市元征科技股份有限公司 车辆数据获取方法、发送方法、电子设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297662B (zh) * 2019-07-04 2021-11-30 中昊芯英(杭州)科技有限公司 指令乱序执行的方法、处理器及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060218373A1 (en) * 2005-03-24 2006-09-28 Erich Plondke Processor and method of indirect register read and write operations
US20060230257A1 (en) * 2005-04-11 2006-10-12 Muhammad Ahmed System and method of using a predicate value to access a register file
CN1985242A (zh) * 2003-04-23 2007-06-20 国际商业机器公司 确定同时多线程(smt)处理器中每线程处理器资源利用的核算方法和逻辑
CN101171570A (zh) * 2005-03-14 2008-04-30 高通股份有限公司 多线程处理器和线程切换方法
US20100281234A1 (en) * 2009-04-30 2010-11-04 Novafora, Inc. Interleaved multi-threaded vector processor
CN102089742A (zh) * 2008-02-26 2011-06-08 高通股份有限公司 执行单元内的数据转发系统和方法

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6311204B1 (en) * 1996-10-11 2001-10-30 C-Cube Semiconductor Ii Inc. Processing system with register-based process sharing
US8074051B2 (en) 2004-04-07 2011-12-06 Aspen Acquisition Corporation Multithreaded processor with multiple concurrent pipelines per thread
US7428653B2 (en) * 2004-07-23 2008-09-23 Mindspeed Technologies, Inc. Method and system for execution and latching of data in alternate threads
US7523295B2 (en) * 2005-03-21 2009-04-21 Qualcomm Incorporated Processor and method of grouping and executing dependent instructions in a packet
US7917907B2 (en) * 2005-03-23 2011-03-29 Qualcomm Incorporated Method and system for variable thread allocation and switching in a multithreaded processor
US8713286B2 (en) * 2005-04-26 2014-04-29 Qualcomm Incorporated Register files for a digital signal processor operating in an interleaved multi-threaded environment
US8429384B2 (en) * 2006-07-11 2013-04-23 Harman International Industries, Incorporated Interleaved hardware multithreading processor architecture
US8032737B2 (en) * 2006-08-14 2011-10-04 Marvell World Trade Ltd. Methods and apparatus for handling switching among threads within a multithread processor
US7904704B2 (en) * 2006-08-14 2011-03-08 Marvell World Trade Ltd. Instruction dispatching method and apparatus
US8195921B2 (en) * 2008-07-09 2012-06-05 Oracle America, Inc. Method and apparatus for decoding multithreaded instructions of a microprocessor
US8397238B2 (en) * 2009-12-08 2013-03-12 Qualcomm Incorporated Thread allocation and clock cycle adjustment in an interleaved multi-threaded processor
US9558000B2 (en) * 2014-02-06 2017-01-31 Optimum Semiconductor Technologies, Inc. Multithreading using an ordered list of hardware contexts

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1985242A (zh) * 2003-04-23 2007-06-20 国际商业机器公司 确定同时多线程(smt)处理器中每线程处理器资源利用的核算方法和逻辑
CN101171570A (zh) * 2005-03-14 2008-04-30 高通股份有限公司 多线程处理器和线程切换方法
US20060218373A1 (en) * 2005-03-24 2006-09-28 Erich Plondke Processor and method of indirect register read and write operations
US20060230257A1 (en) * 2005-04-11 2006-10-12 Muhammad Ahmed System and method of using a predicate value to access a register file
CN102089742A (zh) * 2008-02-26 2011-06-08 高通股份有限公司 执行单元内的数据转发系统和方法
US20100281234A1 (en) * 2009-04-30 2010-11-04 Novafora, Inc. Interleaved multi-threaded vector processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3591518A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113031561A (zh) * 2021-03-05 2021-06-25 深圳市元征科技股份有限公司 车辆数据获取方法、发送方法、电子设备及存储介质

Also Published As

Publication number Publication date
CN108628639A (zh) 2018-10-09
EP3591518A1 (en) 2020-01-08
EP3591518B1 (en) 2022-07-27
CN108628639B (zh) 2021-02-12
EP3591518A4 (en) 2020-12-23
US20200012524A1 (en) 2020-01-09
US11256543B2 (en) 2022-02-22

Similar Documents

Publication Publication Date Title
US10235175B2 (en) Processors, methods, and systems to relax synchronization of accesses to shared memory
US9846581B2 (en) Method and apparatus for asynchronous processor pipeline and bypass passing
KR101486025B1 (ko) 프로세서에서의 쓰레드 스케쥴링
KR100733943B1 (ko) 프로세서 시스템, dma 제어 회로, dma 제어 방법,dma 제어기의 제어 방법, 화상 처리 방법, 및 화상처리 회로
CA3021414C (en) Instruction set
GB2503438A (en) Method and system for pipelining out of order instructions by combining short latency instructions to match long latency instructions
US8819345B2 (en) Method, apparatus, and computer program product for inter-core communication in multi-core processors
US10402223B1 (en) Scheduling hardware resources for offloading functions in a heterogeneous computing system
May The xmos xs1 architecture
TW201732564A (zh) 用於使用monitor及mwait架構之使用者層級執行緒同步的方法及裝置
US20200183878A1 (en) Controlling timing in computer processing
US9286125B2 (en) Processing engine implementing job arbitration with ordering status
TW201342225A (zh) 用於使用觸發來決定指令順序之方法
US20230093393A1 (en) Processor, processing method, and related device
WO2018171319A1 (zh) 处理器和指令调度方法
JP2022138116A (ja) 管理バスのための通信プロトコルの選択
US20160011874A1 (en) Silent memory instructions and miss-rate tracking to optimize switching policy on threads in a processing device
US11467844B2 (en) Storing multiple instructions in a single reordering buffer entry
CN107636611B (zh) 用于临时加载指令的系统、设备和方法
WO2022040877A1 (zh) 一种图指令处理方法及装置
US20140201505A1 (en) Prediction-based thread selection in a multithreading processor
US9716646B2 (en) Using thresholds to gate timing packet generation in a tracing system
US20090276777A1 (en) Multiple Programs for Efficient State Transitions on Multi-Threaded Processors
May XMOS XS1 Architecture
JP2024077425A (ja) プロセッサ

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18770207

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018770207

Country of ref document: EP

Effective date: 20190930