WO2022088074A1 - 基于多指令引擎的指令处理方法及处理器 - Google Patents

基于多指令引擎的指令处理方法及处理器 Download PDF

Info

Publication number
WO2022088074A1
WO2022088074A1 PCT/CN2020/125404 CN2020125404W WO2022088074A1 WO 2022088074 A1 WO2022088074 A1 WO 2022088074A1 CN 2020125404 W CN2020125404 W CN 2020125404W WO 2022088074 A1 WO2022088074 A1 WO 2022088074A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
engine
candidate
group
processing request
Prior art date
Application number
PCT/CN2020/125404
Other languages
English (en)
French (fr)
Inventor
王锦
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2020/125404 priority Critical patent/WO2022088074A1/zh
Priority to CN202080106768.0A priority patent/CN116635840A/zh
Priority to EP20959236.9A priority patent/EP4220425A4/en
Publication of WO2022088074A1 publication Critical patent/WO2022088074A1/zh
Priority to US18/309,177 priority patent/US20230267002A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30047Prefetch instructions; cache control instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the processor has developed into a multi-instruction engine (IE) parallel execution, that is, the number of instructions executed in parallel by the processor in a single cycle continues to increase, and the processor There are higher requirements for the instruction fetch bandwidth and instruction fetch delay. Therefore, a single cache memory (cache) can no longer meet the instruction fetch requirements of processors with multiple IEs, and it is necessary to increase the instruction fetch bandwidth through a solution of multiple cache memories.
  • IE multi-instruction engine
  • multiple IEs share multiple cache memories.
  • the queue depth determines the IE that executes the instruction processing request; then determines the cache memory that caches the instruction corresponding to the instruction pointer according to the instruction pointer (program counter, PC) in the instruction processing request, obtains the instruction from the cache memory, and crosses the instruction
  • the switch matrix (crossbar) is sent to the corresponding IE processing.
  • Embodiments of the present application provide an instruction processing method and a processor based on a multi-instruction engine, which are used to improve the execution efficiency of instructions and the utilization rate of resources in the processor, thereby reducing cost and power consumption.
  • an embodiment of the present application provides an instruction processing method based on a multiple instruction engine.
  • the instruction processing method is applied to a processor.
  • the processor includes: an instruction block scheduler, an instruction cache group, and an instruction engine group.
  • the instruction cache group includes a plurality of instruction caches (for example, cache 0-cache 15), and the instruction engine group Including multiple instruction engines (for example, IE 0-IE 15); multiple instruction caches in the instruction cache group correspond to multiple instruction engines in the instruction engine group one-to-one (for example, cache 0 corresponds to IE 0; cache 1 corresponds to IE 1; and so on).
  • the instruction processing method includes: an instruction block scheduler receives an instruction processing request; the instruction processing request is used to request a processor to process a first instruction set.
  • the instruction block scheduler determines the first instruction engine according to the instruction processing request; the first instruction engine is an instruction engine that processes the first instruction set in the instruction engine group.
  • the instruction block scheduler takes the instruction processing and sends it to the first instruction buffer corresponding to the first instruction engine.
  • the first instruction engine obtains the first instruction set from the first instruction buffer.
  • the instruction block scheduler includes an instruction block table, and the instruction block table records the instruction engine that each instruction block in a program can be mapped (allocated), according to the instruction block identification number PBID in the instruction processing request.
  • the instruction engine that can be mapped (allocated) to each instruction block is determined, for example, the instruction engine that processes each instruction block can be determined according to a certain rule.
  • the instruction block scheduler also includes multiple instruction processing request queues, multiple instruction processing request queues are in one-to-one correspondence with multiple instruction engines, and multiple instruction processing request queues are in one-to-one correspondence with multiple instruction buffers, thereby realizing The one-to-one correspondence between the instruction engine and the instruction buffer is established.
  • the instruction engine and the corresponding instruction buffer may be connected through a hardware interface to transmit instructions.
  • the instruction block scheduler determines the first instruction engine for processing the first instruction set according to the instruction processing request for processing the first instruction set; A corresponding relationship is used to determine the first instruction buffer for caching the first instruction set; the instruction block scheduler sends the instruction pointer PC in the instruction processing request to the first instruction buffer; the first instruction engine sends the first instruction buffer from the first instruction buffer The first instruction set is obtained from the , so as to execute the instructions in the first instruction set.
  • each instruction engine can exclusively enjoy the service of an instruction buffer, so that the processor has a stable and definite instruction fetch bandwidth, thereby improving the instruction execution efficiency of the processor; in addition, when the program is executed by the processor , divide the instructions in the program into blocks, and according to the execution order of the program, assign different instruction blocks to different IEs for execution in an orderly manner, which can improve the resource utilization of the processor and reduce the number of instructions between the instruction buffers. replication, thereby reducing cost and power consumption.
  • the instruction block scheduler determines the first instruction engine according to the instruction processing request, which may include: the instruction block scheduler obtains the candidate instruction engine of the first instruction set according to the instruction processing request;
  • An alternate instruction engine is an instruction engine that can be used to process the first instruction set.
  • the instruction block scheduler selects an instruction engine from the alternative instruction engines as the first instruction engine.
  • the alternative instruction engine may be predetermined, that is, it may be obtained from the instruction block table in the instruction block scheduler.
  • the mapping relationship between each instruction block and the instruction engine can generally be allocated and configured according to the characteristics of the program to be executed. For example, it can be allocated according to the uniformity of the number of instructions processed by each instruction engine.
  • the instruction engine group may include a first candidate instruction engine group.
  • the instruction block scheduler obtains the candidate instruction engine of the first instruction set according to the instruction processing request, which may include: if the first instruction set is an instruction set on a non-performance path, the instruction block scheduler assigns the first candidate instruction engine group to the instruction block scheduler.
  • the instruction engine in , as the candidate instruction engine of the first instruction set.
  • the instruction block on the non-performance path is mainly used to handle exceptions and process protocol packets
  • the instruction block to be executed is an instruction block on the non-performance path, no matter whether the instruction engine is congested or not,
  • the instruction engine is directly selected from the instruction engine group (such as static IE group) pre-assigned in the instruction block table, and other instruction engines are not expanded, thereby further reducing the instruction execution cost and power consumption of the processor.
  • the instruction engine group may include a first candidate instruction engine group and a second instruction engine group.
  • the instruction block scheduler obtains the candidate instruction engine of the first instruction set according to the instruction processing request, which may include: if the first instruction set is an instruction set on the performance path, the instruction block scheduler assigns the first candidate instruction engine group or the first instruction engine group or the first instruction set.
  • the instruction engine in the second instruction engine group is used as the candidate instruction engine of the first instruction set.
  • the first candidate instruction engine group is a pre-configured static IE group in the instruction block table
  • the second instruction engine group is a set of other instruction engines in the processor except for IEs in the static IE group.
  • the instruction block on the performance path belongs to the program that is mainly executed in the whole program
  • the instruction block to be executed is the instruction block on the performance path
  • the pre-allocated instruction block from the instruction block table can be prioritized.
  • An instruction engine is selected from the first candidate instruction engine group. If all the instruction engines in the pre-assigned first candidate instruction engine group are in a congested state, other instruction engines of the processor, such as the second instruction engine group, can be selected. In order to ensure that there are enough resources to process the instruction blocks on the performance path, so as to further improve the instruction execution efficiency of the processor.
  • the instruction block scheduler uses the instruction engine in the first candidate instruction engine group or the instruction engine in the second instruction engine group as the candidate instruction engine of the first instruction set, which may include: if the first condition is satisfied, the instruction block scheduling The controller takes the instruction engine in the first candidate instruction engine group as the candidate instruction engine of the first instruction set.
  • the first condition may be: in the first candidate command engine group, there is at least one command engine corresponding to the command processing request queue whose queue depth is lower than the first preset threshold.
  • the instruction block scheduler uses the instruction engine in the second instruction engine group as a candidate instruction engine of the first instruction set.
  • the second condition may be: in the first candidate command engine group, the queue depths of the command processing request queues corresponding to all command engines exceed the first preset threshold.
  • whether the command engine in the first candidate command engine group is congested is determined by preconfiguring the first preset threshold.
  • the first preset threshold can be changed according to the actual situation of the processor, so that the instruction execution efficiency of the processor can be improved while ensuring the lowest power consumption of the processor.
  • the second instruction engine group may include a second candidate instruction engine group and a third candidate instruction engine group.
  • the instruction block scheduler uses the instruction engine in the second instruction engine group as the candidate instruction engine of the first instruction set, which may include: An instruction engine, which is an alternative instruction engine of the first instruction set. If the third condition is satisfied, the instruction block scheduler adds at least one instruction engine in the third candidate instruction engine group to the second candidate instruction engine group.
  • the third condition may be: the second candidate command engine group is empty, or the queue depths of the command processing request queues corresponding to all command engines in the second candidate command engine group exceed the second preset threshold.
  • the second instruction engine group can be divided into a second candidate instruction engine group and a third candidate instruction engine group, wherein the second candidate instruction engine group is a dynamic IE group and is in an enabled state ;
  • the third candidate command engine group is the disabled engine group.
  • the execution tasks of the enabled instruction engines in the processor are few, part of the instruction engines in the processor may be disabled, thereby reducing the power consumption of the processor.
  • the instruction engines in the third candidate instruction engine group may be enabled to improve the instruction execution efficiency of the processor.
  • the instruction block scheduler selects at least one instruction engine whose queue depth of the instruction processing request queue corresponding to the instruction engine is lower than the third preset threshold in the third candidate instruction engine group, and adds it to the second candidate instruction engine. in the group.
  • the instruction block scheduler selects at least one instruction engine whose queue depth of the instruction processing request queue corresponding to the instruction engine is lower than the third preset threshold in the third candidate instruction engine group, and adds it to the second candidate instruction engine. in the group.
  • the instruction block scheduler selects at least one instruction engine whose queue depth of the instruction processing request queue corresponding to the instruction engine is lower than the third preset threshold in the third candidate instruction engine group, and adds it to the second candidate instruction engine. in the group.
  • the command engine when extending the command engine from the third candidate command engine group, select the command engine whose queue depth of the command processing request queue corresponding to the command engine is lower than the third preset threshold, that is, expand the non-congested command engine. Instruction engine, in order to improve the efficiency of instruction execution
  • the instruction processing method in the first aspect may further include: the instruction block scheduler records the instruction engine selection difference. If the instruction engine selection difference exceeds the fourth preset threshold, the instruction block scheduler deletes all instruction engines in the second candidate instruction engine group.
  • the instruction engine selection difference is used to indicate that the number of times an instruction engine is selected from the first candidate instruction engine group is different from the number of times an instruction engine is selected from the second candidate instruction engine group.
  • the instruction block scheduler records the instruction engine selection difference. If the instruction engine selection difference exceeds the fourth preset threshold, the instruction block scheduler deletes all instruction engines in the second candidate instruction engine group.
  • the instruction engine selection difference is used to indicate that the number of times an instruction engine is selected from the first candidate instruction engine group is different from the number of times an instruction engine is selected from the second candidate instruction engine group.
  • the instruction block selects the instruction engine IE from the second candidate instruction engine group that is, the dynamic IE group
  • the number of times the instruction engine IE is selected is very high. When it is small, the command engine in the dynamic
  • the instruction block scheduler selects an instruction engine from the candidate instruction engines, as the first instruction engine, which may include: the instruction block scheduler obtains the queue depth of the instruction processing request queue corresponding to the candidate instruction engine. The instruction block scheduler selects the candidate instruction engine corresponding to the instruction processing request queue with the smallest queue depth as the first instruction engine.
  • the instruction engine that finally executes the first instruction set is determined from the alternative instruction engines, and the alternative instruction engine corresponding to the instruction processing request queue with the smallest queue depth is selected, so that the number of instruction executions of each instruction engine can be increased. lower, which is beneficial to improve the utilization rate of processor resources and improve the execution efficiency of the processor.
  • the instruction processing method in the first aspect may further include: when the first instruction cache detects the end marker of the first instruction set, the first instruction cache sends a schedule to the instruction block scheduler information, the scheduling information is used to indicate that the first instruction engine can process the next instruction processing request.
  • the polling scheduler in the instruction block scheduler can take out the next instruction processing request from the instruction processing request queue corresponding to the first instruction engine for processing by the first instruction engine, so that the instructions can be executed in sequence, thereby improving the processing efficiency. device efficiency.
  • inventions of the present application provide a processor.
  • the processor includes: an instruction block scheduler, an instruction cache group and an instruction engine group.
  • the instruction cache group includes multiple instruction buffers, and the instruction engine group includes multiple instruction engines; the multiple instruction caches in the instruction cache group are connected to The multiple command engines in the command engine group correspond to each other one by one.
  • the instruction block scheduler is used to receive an instruction processing request; the instruction processing request is used to request the processor to process the first instruction set.
  • the instruction block scheduler is used to determine the first instruction engine according to the instruction processing request; the first instruction engine is the instruction engine processing the first instruction set in the instruction engine group, the first instruction engine and the first instruction buffer in the instruction cache group correspond.
  • the instruction block scheduler is configured to send the instruction processing request to the first instruction buffer corresponding to the first instruction engine.
  • the first instruction engine is used to obtain the first instruction set from the first instruction buffer.
  • the processor further includes multiple instruction processing request queues, multiple instruction processing request queues are in one-to-one correspondence with multiple instruction engines, and multiple instruction processing request queues are in one-to-one correspondence with multiple instruction buffers; instruction block scheduling The device is configured to determine the first instruction buffer corresponding to the first instruction engine according to the instruction processing request queue corresponding to the first instruction engine. In this way, through the one-to-one correspondence between the instruction processing request queues, the one-to-one correspondence between multiple instruction engines and multiple instruction buffers can be realized.
  • the instruction block scheduler may be specifically configured to: obtain an alternative instruction engine of the first instruction set according to the instruction processing request; the alternative instruction engine is an alternative instruction engine that can be used to process the first instruction set command engine. An instruction engine is selected from the candidate instruction engines as the first instruction engine.
  • the instruction engine group may include a first candidate instruction engine group.
  • the instruction engine group is specifically used for, if the first instruction set is an instruction set on a non-performance path, the instruction engine in the first candidate instruction engine group is used as the candidate instruction engine of the first instruction set.
  • the instruction engine group may include a first candidate instruction engine group and a second instruction engine group.
  • the instruction block scheduler is specifically used to, if the first instruction set is an instruction set on the performance path, take the instruction engine in the first candidate instruction engine group or the instruction engine in the second instruction engine group as the candidate instruction of the first instruction set engine.
  • the instruction block scheduler can be specifically configured to: if the first condition is satisfied, the instruction engine in the first candidate instruction engine group is used as the candidate instruction engine of the first instruction set.
  • the first condition may be: in the first candidate command engine group, there is at least one command engine corresponding to the command processing request queue whose queue depth is lower than the first preset threshold.
  • the instruction engine in the second instruction engine group is used as the candidate instruction engine of the first instruction set.
  • the second condition may be: in the first candidate command engine group, the queue depths of the command processing request queues corresponding to all command engines exceed the first preset threshold.
  • the second instruction engine group may include a second candidate instruction engine group and a third candidate instruction engine group.
  • the instruction block scheduler may be specifically configured to: use the instruction engine in the second candidate instruction engine group in the second instruction engine group as the candidate instruction engine of the first instruction set. If the third condition is satisfied, at least one instruction engine in the third candidate instruction engine group is added to the second candidate instruction engine group.
  • the third condition may be: the second candidate command engine group is empty, or the queue depths of command processing request queues corresponding to all command engines in the second candidate command engine group exceed the second preset threshold.
  • the instruction block scheduler may be specifically used to select at least one instruction engine whose queue depth of the instruction processing request queue corresponding to the instruction engine is lower than the third preset threshold in the third candidate instruction engine group, and add it to the third candidate instruction engine group.
  • the second alternative instruction engine group may be specifically used to select at least one instruction engine whose queue depth of the instruction processing request queue corresponding to the instruction engine is lower than the third preset threshold in the third candidate instruction engine group, and add it to the third candidate instruction engine group.
  • the instruction block scheduler may also be used to: record the instruction engine selection difference. If the command engine selection difference exceeds the fourth preset threshold, all command engines in the second candidate command engine group are deleted. The command engine selection difference is used to indicate that the number of times the command engine is selected from the first candidate command engine group is different from the number of times the command engine is selected from the second candidate command engine group.
  • the instruction block scheduler may be specifically configured to: obtain the queue depth of the instruction processing request queue corresponding to the alternative instruction engine.
  • the candidate command engine corresponding to the command processing request queue with the smallest queue depth is selected as the first command engine.
  • the first instruction buffer may also be used to send scheduling information to the instruction block scheduler when the end marker of the first instruction set is detected, and the scheduling information is used to indicate that the first instruction engine can process the next instruction.
  • a directive handles the request.
  • the cache length of a single cache unit in each instruction cache in the instruction cache group is consistent with the number of instructions that can be processed in a single execution cycle of the corresponding instruction engine.
  • the cache length of a single cache unit in each instruction buffer refers to the length of the cache line in each cache memory; the length of the single execution cycle of the instruction engine
  • the number of instructions that can be processed refers to the size (size) of an arithmetic logic unit (ALU) array that the instruction engine can process in a single execution cycle.
  • ALU arithmetic logic unit
  • the size of the ALU array that the instruction engine IE can process in a single execution cycle is 4 instructions
  • the length of the cache line in the cache memory corresponding to the instruction engine is also designed to cache 4 instructions. In this way, an instruction queue (Inst Q) is no longer needed between the instruction engine IE and the cache memory to cache instructions, thereby reducing cost and power consumption.
  • inventions of the present application provide an electronic device.
  • the electronic device includes a processor, and a memory coupled to the processor, where the processor is the processor provided by any one of the possible implementations of the second aspect above.
  • any processor or electronic device for the instruction processing method based on the multiple instruction engine provided above is used to execute the instruction processing method based on the multiple instruction engine provided in the first aspect. Therefore, it can achieve For the beneficial effects, reference may be made to the beneficial effects in the instruction processing method based on the multi-instruction engine provided in the first aspect, which will not be repeated here.
  • FIG. 1 is a schematic structural diagram of a processor of a multi-cache multi-instruction engine
  • FIG. 2 is a schematic structural diagram of a processor of a multi-cache memory multi-instruction engine provided by an embodiment of the present application;
  • FIG. 3 is a schematic diagram of an instruction block allocation scheme provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of an instruction processing method based on a multi-instruction engine provided by an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 1 is a schematic structural diagram of a processor of a multi-cache multi-instruction engine.
  • the processor is composed of an instruction buffer (instruction buffer, IBUF), a plurality of slice-based caches (slice-based cache), a crossbar, and a plurality of instruction engines.
  • instruction engine, IE instruction queue
  • instruction queue instruction queue
  • instructions are stored as follows: Take the cache slice-based cache 0–slice-based cache 15 as an example, assuming the length of each cache line in the cache is 8 instructions; if the instruction corresponding to an instruction pointer (program counter, PC) is stored in slice-based cache 0, the eighth instruction after the instruction corresponding to the PC (that is, the instruction corresponding to PC+8) is stored in In slice-based cache 1, and so on; if an instruction corresponding to a PC is stored in slice-based cache 15, the eighth instruction after the instruction corresponding to the PC (ie, the instruction corresponding to PC+8) is stored in slice- in based cache 0.
  • PC instruction pointer
  • the instruction buffer IBUF caches the instruction fetch request in an instruction first in first out (IFIFO) queue middle.
  • a dispatcher (dispatcher, DISP) in the instruction buffer IBUF reads the instruction fetch request from the IFIFO queue and distributes it to an execution thread corresponding to an instruction engine IE, and each instruction engine IE corresponds to an execution thread.
  • the allocator DISP sends the instruction pointer PC in the instruction fetch request to the corresponding cache memory to read the instruction according to the instruction pointer PC; the cache memory passes through the scheduler after receiving the PC. , SCH), enter the cache pipeline (cache pipeline), in the pipeline, first through the tag lookup controller (tag lookup) to find the tag table (tag table), and then through the arbitration module (arbiter, ARB) )arbitration.
  • the tag table is used to record the correspondence between the instruction pointer PC and the cached instruction data in the cache memory.
  • the arbitration module ARB is used for judging the hit result obtained from the tag table, and confirming whether the instruction corresponding to the PC hits in the cache memory. After arbitration, if the instruction corresponding to the PC hits in the cache, the instruction is obtained from the cache data module in the cache, and sent to the bound IE through the crossbar matrix .
  • a refill request also known as a backfill request, is issued to the instruction memory (IMEM), which is used to request instructions from the IMEM and relearn into the corresponding cache memory.
  • the tag table is updated after the cache fetches the instruction from the IMEM.
  • the PC+8 will initiate an instruction fetch request to the next cache memory, and the scheduling of the latter cache memory
  • the processor schedules the instruction fetch request according to the state table, and enters the pipeline of the next cache memory (the process after entering the pipeline of the cache memory is as described above, and will not be repeated here).
  • the scheduler of the cache memory initiates the instruction backfill request according to the state table, fills the corresponding instruction into the cache memory data module, and the instruction The relevant information is recorded in the tag table and the command is sent to the IE via the crossbar matrix.
  • the EI flag exists for the instruction fetched from the cache, it means that the current PC has completed the fetch. If the execution thread corresponding to the current PC still has a PC waiting for instruction fetching, the execution thread uses a new PC to continue fetching instructions. If there is no PC waiting for instruction fetch in the execution thread, it means that the current instruction fetch request has completed all instruction fetch operations, and the next instruction fetch request can be processed.
  • At least one means one or more
  • plural means two or more.
  • And/or which describes the association relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, which can indicate: the existence of A alone, the existence of A and B at the same time, and the existence of B alone, where A, B can be singular or plural.
  • At least one item(s) below or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s).
  • At least one (a) of a, b or c may represent: a, b, c, a-b, a-c, b-c or a-b-c, where a, b and c may be single or multiple.
  • the character "/" generally indicates that the associated objects are an "or” relationship.
  • words such as "first” and “second” do not limit the quantity and execution order.
  • Performance path Refers to the most important execution path in a program.
  • Non-performance path refers to the execution path in a program for handling exceptions or handling protocol packets.
  • FIG. 2 is a schematic structural diagram of a processor of a multi-cache memory multi-instruction engine provided by an embodiment of the present application, the processor includes: an instruction block dispatcher (program block dispatcher, PBD), an instruction cache group (instruction cache group, ICG) and instruction engine group (instruction engine group, IEG).
  • an instruction block dispatcher program block dispatcher, PBD
  • an instruction cache group instruction cache group
  • IEG instruction engine group
  • the instruction cache group includes multiple instruction caches, and the instruction cache may use a cache memory (cache), such as cache0-cache15;
  • the instruction engine group includes multiple instruction engines (IE), such as IE0-cache15 IE15;
  • the multiple instruction caches in the instruction cache group correspond to the multiple instruction engines in the instruction engine group one-to-one, that is, each IE has an exclusive instruction cache, and the IE and the instruction cache are bound one by one, so that each Each IE has a stable and deterministic fetch bandwidth.
  • the processor may further include multiple instruction processing request queues, and multiple instruction processing request queues are in one-to-one correspondence with multiple instruction engines; and multiple instruction processing request queues are in one-to-one correspondence with multiple instruction buffers. That is to say, each IE may correspond to an instruction processing request queue (IE-based queue), and each instruction request queue corresponds to an instruction buffer, thereby realizing a one-to-one correspondence between the instruction processing request queue and the instruction buffer.
  • the instruction engine and the corresponding instruction buffer may be connected through a hardware interface to transmit instructions. It should be understood that all instruction processing request queues can be managed by the queue manager QM.
  • the cache length of a single cache unit in each instruction cache in the instruction cache group may be consistent with the number of instructions that can be processed in a single execution cycle of the corresponding instruction engine.
  • the cache length of a single cache unit in each instruction buffer refers to the length of the cache line in each cache memory; the single execution cycle of the instruction engine
  • the number of instructions that can be processed refers to the size (size) of an arithmetic logic unit (ALU) array that the instruction engine can process in a single execution cycle.
  • ALU arithmetic logic unit
  • the size of the ALU array that the instruction engine IE can process in a single execution cycle is 4 instructions
  • the length of the cache line in the cache memory corresponding to the instruction engine is also designed to cache 4 instructions. In this way, an instruction queue (Inst Q) is no longer needed between the instruction engine IE and the cache memory to cache instructions, thereby reducing cost and power consumption.
  • an instruction queue can be set between the instruction cache and the IE to cache instructions, so that the instructions are sequentially executed by the IE.
  • the processor shown in Fig. 2 does not need a crossbar matrix, and the bounded instruction buffer can
  • the instruction cache length of a single cache unit is set to be the same as the number of instructions that can be processed in a single instruction cycle of the instruction engine, which can reduce the setting of the instruction queue Inst Q, thereby reducing the complexity of the processing flow, and reducing the cost and power consumption of the processor .
  • the processor in Figure 2 involves an instruction block scheduler.
  • the program can be divided into several program blocks first, and each program block can be divided into several program blocks based on the execution order of the program.
  • the instruction block is assigned an identifier (ID), which is called the program block identifier (PBID).
  • ID an identifier
  • PBID program block identifier
  • the program can be divided into multiple different phases (phases). After the program is divided into several instruction blocks, each instruction block has only one PBID.
  • the PBIDs assigned to different instruction blocks in different stages are different, but they can also be the same; the PBIDs assigned to different instruction blocks in the same stage are generally different.
  • the program can be divided into instruction blocks on the performance path and instruction blocks on the non-performance path.
  • each instruction block in different stages on the same performance path can be distributed to all IEs as evenly as possible, for example, each instruction block in different stages on the same performance path is allocated to different IEs respectively .
  • each instruction block at different stages on the non-performance path can also be evenly distributed to all IEs.
  • the processing of each instruction block on the same performance path can be avoided from being aggregated on a part of the IE, increasing the processing burden of the part of the IE, and increasing the burden of the instruction buffer corresponding to the part of the IE, such as the cache memory, and causing the cache
  • the memory hit ratio is reduced in order to improve the execution efficiency and resource utilization of the processor.
  • different instruction blocks in the same stage are usually located in different branches and generally not executed at the same time, different instruction blocks in the same stage can be allocated to the same IE.
  • FIG. 3 is a schematic diagram of an instruction block allocation solution provided by an embodiment of the present application.
  • the instruction block scheduler includes an instruction block table (program block table, PBT), a lookup table controller (lookup table controller, LTC), a queue manager (queue management, QM) and a round robin (RR) )scheduler.
  • PBT program block table
  • LTC lookup table controller
  • QM queue management
  • RR round robin
  • the instruction block table PBT is preconfigured before the program is executed, for example, it can be generated in the process of compiling the program.
  • the following fields can be included in the PBT:
  • PERF is the performance path field, which can be used to indicate whether the corresponding instruction block is on the performance path, occupying 1 bit, in which 1 can be used to indicate that the corresponding instruction block is a performance path block, and 0 indicates that the corresponding instruction block is Instruction blocks on non-performance paths.
  • SF_BM It is a bitmap-based static tag, which can be used to specify whether the mapping relationship between the corresponding instruction block and all IEs is static or dynamic (take 16 IEs, ie IE0-IE15 as an example), Occupied 16bit. Each bit corresponds to an IE, that is, SF_BM[0] corresponds to IE0, SF_BM[1] corresponds to IE1, and so on.
  • the mapping relationship will not be changed, and the mapping relationship between the instruction block and IE in the dynamic mapping relationship can be changed according to the congestion state of the IE.
  • IE_BM It is an IE bitmap, which can be used to specify the mapping relationship between the instruction block and IE, occupying 16 bits. Each bit corresponds to an IE, that is, IE_BM[0] corresponds to IE0, IE_BM[1] corresponds to IE1, and so on.
  • DIFF_CNT Records that the instruction block selects IEs from the IEs in the static mapping relationship, and the difference in the number of times that IEs are selected from the IEs in the dynamic mapping relationship can be used to indicate whether the IEs with the dynamic mapping relationship with the instruction block need to be deleted. Wherein, when the IE is selected from the IE of the static mapping relationship, DIFF_CNT is increased by 1, and when the IE is selected from the IE of the dynamic mapping relationship, the DIFF_CNT is decreased by one.
  • the IE when determining the IE for executing the instruction, the IE is preferentially selected from the IE set that has a static mapping relationship with the instruction block to be executed; when the IE set that has a static mapping relationship with the instruction block to be executed does not meet the requirements
  • the IE can be selected from the set of IEs that have a dynamic mapping relationship with the instruction block to be executed.
  • IEs can be divided into three types, namely static IE groups, dynamic IE groups and disabled IE groups.
  • the instruction block scheduler searches the instruction block table through the lookup table controller to determine the mapping relationship between the instruction block and the IE in the instruction engine group, and then determines the final execution of the instruction block according to the queue depth of the instruction processing request queue corresponding to each IE. IE, and then add the instruction processing request of the instruction block to the instruction processing request queue corresponding to the corresponding IE, and wait for the corresponding IE to execute.
  • the instruction processing request queue is managed by the queue associator QM; the instruction processing requests in the instruction processing request queue are scheduled by the round robin (RR).
  • the processor shown in FIG. 2 may further include an input scheduler (input scheduler, IS) and an output scheduler (output scheduler, OS).
  • the input scheduler IS is used to receive the data of the previous module, which may include an instruction processing request, which is used to schedule and allocate the instruction processing request to the instruction block scheduler;
  • the output scheduler OS is used to receive the processing result of the instruction block, and For judging whether the execution of the entire program is completed, the instruction processing request and the instruction processing result are scheduled according to the execution sequence of each instruction block of the entire program.
  • the input scheduler and the output scheduler can also be designed as one scheduler, such as the input and output scheduler, which simultaneously implements all the functions of the input scheduler and the output scheduler.
  • FIG. 4 is a schematic flowchart of an instruction processing method based on a multi-instruction engine provided by an embodiment of the present application, which can be applied to the processor shown in FIG. 2 , and the method includes the following steps.
  • the instruction block scheduler receives an instruction processing request; the instruction processing request is used to request the processor to process the first instruction set.
  • the instruction processing request may include the PBID corresponding to the first instruction set and the instruction pointer (program counter, PC) corresponding to the first instruction set, and the first instruction set is the set of all instructions in an instruction block; the first instruction set corresponds to The PC can be used to index the instructions in the first instruction set.
  • PC program counter
  • the instruction block scheduler determines a first instruction engine according to the instruction processing request; the first instruction engine is an instruction engine that processes the first instruction set in the instruction engine group. The first instruction engine corresponds to the first instruction buffer in the instruction buffer.
  • the instruction block scheduler obtains an alternative instruction engine of the first instruction set according to the instruction processing request; the alternative instruction engine is an instruction engine that can be used to process the first instruction set.
  • the instruction block scheduler selects an instruction engine from the alternative instruction engines as the first instruction engine.
  • the alternative instruction engine may be predetermined, such as determined by the above-mentioned PBT table, that is, the alternative instruction engine may be all IEs in the static IE group to which the instruction block to which the first instruction set belongs is mapped;
  • the alternative instruction engine may also be a dynamically added IE from the dynamic IE group mapped to the instruction block to which the first instruction set belongs according to the congestion state of the request queue processing the instruction corresponding to the IE in the static IE group.
  • the instruction engine group can be divided into a first candidate instruction engine group and a second instruction engine group.
  • the first candidate instruction engine group may be a set of all IEs that have a static mapping relationship with the instruction block to which the first instruction set belongs, that is, the first candidate instruction engine group may be a static IE group.
  • the second instruction engine group may be a set of all IEs that have a dynamic mapping relationship with the instruction block to which the first instruction set belongs, that is, the second instruction engine group may be a combination of a dynamic IE group and a disabled IE group.
  • the instruction block scheduler uses the instruction engine of the first candidate instruction engine group as the candidate instruction engine of the first instruction set.
  • the instruction block scheduler takes the instruction engine in the first candidate instruction engine group or the instruction engine in the second instruction engine group as the candidate instruction engine of the first instruction set.
  • the IE for executing the instruction set may be selected directly according to the pre-configured mapping relationship between the instruction block and the IE, that is, the IE (IE in the static IE group) for which the instruction block has a static mapping relationship is selected.
  • the execution of the instruction block (instruction set) on the performance path generally requires more traffic and resources, so the IE for executing the instruction block may be preferentially selected from the IEs that have a static mapping relationship with the instruction block.
  • the IE that executes the instruction block may be selected from the IEs that have a dynamic mapping relationship with the instruction block. In this way, in a processor with multiple IEs, under the condition of ensuring the lowest power consumption of the processor, each instruction block can be executed more evenly in each IE as much as possible, thereby improving the execution efficiency of the processor on the program, and Improve the utilization of processor resources.
  • the candidate instruction engine can be determined as follows:
  • the instruction block scheduler takes the instruction engine in the first candidate instruction engine group as the candidate instruction engine of the first instruction set.
  • the first condition is: in the first candidate command engine group, there is at least one command engine corresponding to the command processing request queue whose queue depth is lower than the first preset threshold.
  • the instruction block scheduler uses the instruction engine in the second instruction engine group as a candidate instruction engine of the first instruction set.
  • the second condition is: in the first candidate command engine group, the queue depths of the command processing request queues corresponding to all command engines exceed the first preset threshold.
  • the queue depth of the instruction processing request queue corresponding to the IE reaches the first preset threshold, such as 16, it means that the IE is in a congested state and is not suitable for redistributing a new instruction set to be executed. If there is at least one IE in the first candidate instruction engine group that is not in a congested state, the instruction engine in the first candidate instruction engine group can still be used as the candidate instruction engine of the first instruction set; In a candidate instruction engine group, if there is no instruction engine that is not in a congested state, the instruction engine in the second instruction engine group can be used as the candidate instruction engine of the first instruction set.
  • the first preset threshold is preset and is a specified value for determining whether the IE is congested, and the specified value can be changed according to the actual processing situation of the processor.
  • the instruction block scheduler uses the instruction engine in the second instruction engine group as the candidate instruction engine of the first instruction set
  • the instruction block scheduler uses the instruction engine in the second candidate instruction engine group in the second instruction engine group. , as the alternative instruction engine of the first instruction set.
  • the instruction block scheduler adds at least one instruction engine in the third candidate instruction engine group to the second candidate instruction engine group.
  • the third condition may be: the second candidate command engine group is empty, or the queue depths of command processing request queues corresponding to all command engines in the second candidate command engine group exceed the second preset threshold.
  • the instruction block scheduler uses the instruction engine in the second instruction engine group as the candidate instruction engine of the first instruction set, it preferentially selects IEs from the second candidate instruction engine group, that is, from the dynamic IE group.
  • Select IE when there is no IE that meets the requirements in the second candidate command engine group, select an IE that meets the requirements from the third candidate command engine group to expand into the second candidate command engine group, even if it is never used. Select an IE that meets the requirements in the engine group, and change the state of the IE to the enabled state.
  • the congestion state of the IE can also be used as a judgment condition.
  • a second preset threshold can be preset for judging whether the IE is congested.
  • the non-congested IE should also be extended. That is, the instruction block scheduler selects at least one instruction engine whose queue depth of the instruction processing request queue corresponding to the instruction engine is lower than the third preset threshold in the third candidate instruction engine group, and adds it to the second candidate instruction engine group.
  • the third preset threshold is also preset and is a specified value for judging whether the IE is congested, and this value can also be changed according to the actual processing situation of the processor.
  • the first preset threshold, the second preset threshold and the third preset threshold are respectively used to determine the parameters in the first candidate command engine group, the second candidate command engine group and the third candidate command engine.
  • the specified value of instructing whether the engine is congested, the first preset threshold, the second preset threshold and the third preset threshold may be the same or different.
  • the instruction block scheduler records the instruction engine selection difference.
  • the instruction engine selection difference is used to indicate that the number of times an instruction engine is selected from the first candidate instruction engine group is different from the number of times an instruction engine is selected from the second candidate instruction engine group, that is, the instruction engine selection difference It represents the number of times that IEs are selected from the static IE group, which is different from the number of times that IEs are selected from the dynamic IE group. Therefore, the instruction engine selection difference can be recorded through the DIFF_CNT field in the instruction block table.
  • the instruction block scheduler deletes all instruction engines in the second candidate instruction engine group. That is to say, when the difference value selected by the instruction engine exceeds a certain value, this value is the fourth preset threshold value, which is a value determined according to the actual situation, for example, 500.
  • the instruction block scheduler sends the instruction processing request to the first instruction buffer corresponding to the first instruction engine.
  • the instruction block scheduler may include multiple instruction processing request queues, the multiple processing request queues are in one-to-one correspondence with multiple instruction engines, and the multiple instruction processing request queues are in one-to-one correspondence with multiple instruction buffers.
  • the instruction block scheduler can determine the first instruction buffer corresponding to the first instruction engine according to the instruction processing request queue corresponding to the first instruction engine; thereby realizing the one-to-one correspondence between the first instruction engine and the first instruction buffer, that is, the first instruction buffer.
  • An instruction buffer is used to cache the instructions processed by the first instruction engine.
  • the instruction block scheduler determines that the first instruction set is processed by the first instruction engine
  • the instruction block scheduler sends the instruction pointer PC in the instruction processing request to the first instruction buffer, and the first instruction buffer obtains the first instruction set according to the instruction pointer PC. Instructions of an instruction set are sent to the first instruction engine for processing.
  • the first instruction buffer may learn from the instruction memory IMEM to acquire the instructions in the first instruction set.
  • the first instruction engine acquires the first instruction set from the first instruction buffer.
  • the first instruction buffer After the first instruction buffer acquires the first instruction set according to the instruction pointer PC in the instruction processing request, the first instruction buffer can actively send it to the first instruction engine for processing, so that the first instruction engine can retrieve the first instruction set from the first instruction buffer. to obtain the first instruction set in order to process the first instruction set.
  • the first instruction buffer and the first instruction engine may be connected through a hardware interface, so that the first instruction buffer can send the first instruction set to the first instruction engine, or the first instruction engine can send the first instruction set from the first instruction cache.
  • the first instruction set is obtained from the controller.
  • the method may further include: when the first instruction buffer detects an end indicator (EI) of the first instruction set, the first instruction buffer Send scheduling information to the instruction block scheduler, where the scheduling information is used to indicate that the first instruction engine can process the next instruction processing request.
  • the polling scheduler RR in the instruction block scheduler can take out the next instruction processing request from the instruction processing request queue corresponding to the first instruction engine for processing by the first instruction engine.
  • the first instruction engine When the first instruction engine detects the end marker EI of the first instruction set, the first instruction engine ends the processing of the first instruction set, and initiates a scheduling request to the output scheduler OS.
  • the output scheduler OS responds to the scheduling request and determines whether the entire program has been executed. If the execution is completed, the output scheduler OS outputs the processing result to the subsequent module; otherwise, the output scheduler OS will send the next instruction block to be executed corresponding to the The instruction processing request of the above is sent to the input scheduler IS to continue processing, and the cycle is repeated in turn.
  • the processor may be divided into functional modules according to the foregoing method examples.
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
  • the processor includes an instruction block scheduler, an instruction cache group, and an instruction engine group.
  • the instruction cache group includes multiple The instruction buffer includes multiple instruction engines in the instruction engine group; the multiple instruction buffers in the instruction cache group correspond one-to-one with the multiple instruction engines in the instruction engine group.
  • the instruction block scheduler is used to receive an instruction processing request; the instruction processing request is used to request the processor to process the first instruction set.
  • the instruction block scheduler is used to determine the first instruction engine according to the instruction processing request; the first instruction engine is the instruction engine processing the first instruction set in the instruction engine group, the first instruction engine and the first instruction buffer in the instruction cache group correspond.
  • the instruction block scheduler is configured to send the instruction processing request to the first instruction buffer corresponding to the first instruction engine.
  • the first instruction engine is used to obtain the first instruction set from the first instruction buffer.
  • the processor further includes multiple instruction processing request queues, multiple instruction processing request queues are in one-to-one correspondence with multiple instruction engines, and multiple instruction processing request queues are in one-to-one correspondence with multiple instruction buffers; instruction block scheduling The device is configured to determine the first instruction buffer corresponding to the first instruction engine according to the instruction processing request queue corresponding to the first instruction engine. In this way, through the one-to-one correspondence between the instruction processing request queues, the one-to-one correspondence between multiple instruction engines and multiple instruction buffers can be realized.
  • the instruction block scheduler can be specifically configured to: obtain an alternative instruction engine of the first instruction set according to the instruction processing request; the alternative instruction engine is an instruction engine that can be used to process the first instruction set. An instruction engine is selected from the candidate instruction engines as the first instruction engine.
  • the instruction engine group may include a first candidate instruction engine group.
  • the instruction engine group is specifically used for, if the first instruction set is an instruction set on a non-performance path, the instruction engine in the first candidate instruction engine group is used as the candidate instruction engine of the first instruction set.
  • the instruction engine group may include a first candidate instruction engine group and a second instruction engine group.
  • the instruction block scheduler is specifically used to, if the first instruction set is an instruction set on the performance path, take the instruction engine in the first candidate instruction engine group or the instruction engine in the second instruction engine group as the candidate instruction of the first instruction set engine.
  • the instruction block scheduler can be specifically configured to: if the first condition is satisfied, the instruction engine in the first candidate instruction engine group is used as the candidate instruction engine of the first instruction set.
  • the first condition may be: in the first candidate command engine group, there is at least one command engine corresponding to the command processing request queue whose queue depth is lower than the first preset threshold.
  • the instruction engine in the second instruction engine group is used as the candidate instruction engine of the first instruction set.
  • the second condition may be: in the first candidate command engine group, the queue depths of the command processing request queues corresponding to all command engines exceed the first preset threshold.
  • the second instruction engine group may include a second candidate instruction engine group and a third candidate instruction engine group.
  • the instruction block scheduler may be specifically configured to: use the instruction engine in the second candidate instruction engine group in the second instruction engine group as the candidate instruction engine of the first instruction set. If the third condition is satisfied, at least one instruction engine in the third candidate instruction engine group is added to the second candidate instruction engine group.
  • the third condition may be: the second candidate command engine group is empty, or the queue depths of command processing request queues corresponding to all command engines in the second candidate command engine group exceed the second preset threshold.
  • the instruction block scheduler can be specifically used to select at least one instruction engine whose queue depth of the instruction processing request queue corresponding to the instruction engine is lower than the preset threshold in the third candidate instruction engine group, and add it to the second standby instruction engine. in the command engine group.
  • the instruction block scheduler may also be used to: record the instruction engine selection difference. If the command engine selection difference exceeds the fourth preset threshold, all command engines in the second candidate command engine group are deleted. The command engine selection difference is used to indicate that the number of times the command engine is selected from the first candidate command engine group is different from the number of times the command engine is selected from the second candidate command engine group.
  • the instruction block scheduler may be specifically configured to: obtain the queue depth of the instruction processing request queue corresponding to the candidate instruction engine.
  • the candidate command engine corresponding to the command processing request queue with the smallest queue depth is selected as the first command engine.
  • the instruction buffer may also be used to send scheduling information to the instruction block scheduler when the end marker of the first instruction set is detected, where the scheduling information is used to indicate that the first instruction engine can process the next instruction processing request.
  • each instruction engine in the processor can exclusively enjoy the service of one instruction buffer, so that the processor has a stable and definite instruction fetch bandwidth, thereby improving the instruction execution efficiency of the processor; in addition,
  • the instructions in the program are divided into blocks, and different instruction blocks are sequentially allocated to different IEs for execution according to the execution order of the program, which can improve the utilization rate of processor resources and reduce the number of instructions. Copy between instruction caches, thereby reducing cost and power consumption.
  • the processor since the instruction buffer and the instruction engine are bound one by one in the processor, the processor does not need a crossbar matrix, and the instruction cache length of a single cache unit of the bound instruction buffer can be related to the instruction engine
  • the instruction lengths that can be processed in a single instruction cycle are set to be the same, which can reduce the setting of the instruction queue Inst Q, thereby reducing the complexity of the processing flow, and reducing the cost and power consumption of the processor.
  • an embodiment of the present application further provides an electronic device.
  • the electronic device includes: a memory 501 and a processor 502 .
  • the memory 501 is used to store the program code and data of the device
  • the processor 502 is used to control and manage the actions of the device shown in FIG. 5
  • the structure of the processor 502 can be the structure shown in the above-mentioned FIG. 2 , for example, Specifically, it is used to support the instruction processing to execute S401-S403 in the above method embodiments, and/or other processes used in the technology described herein.
  • the electronic device shown in FIG. 5 may further include a communication interface 503, and the communication interface 503 is used to support the device to communicate.
  • the processor 502 may be a central processing unit, a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a processing chip, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof . It can implement or execute various eg logic blocks, modules and circuits described in connection with the disclosure of the embodiments of the present application.
  • the processor 502 may also be a combination that implements computing functions, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like.
  • the communication interface 503 may be a transceiver, a transceiver circuit, a transceiver interface, or the like.
  • the memory 501 may be a volatile memory or a non-volatile memory or the like.
  • the communication interface 503, the processor 502, and the memory 501 are connected to each other through a bus 504;
  • the bus 504 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus Wait.
  • the bus 504 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in FIG. 5, but it does not mean that there is only one bus or one type of bus.
  • memory 501 may be included in processor 502 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Advance Control (AREA)

Abstract

一种基于多指令引擎的指令处理方法及处理器,涉及计算机技术领域,用于提高指令的执行效率和处理器中资源的利用率,进而降低成本和功耗。该方法应用于处理器中,该处理器包括指令块调度器、指令缓存组和指令引擎组;指令缓存组中的多个指令缓存器与指令引擎组中的多个指令引擎一一对应;该方法中,指令块调度器根据处理第一指令集的指令处理请求,确定处理第一指令集的第一指令引擎;根据指令引擎与指令缓存器一一对应关系,确定用于缓存第一指令集的第一指令缓存器;指令块调度器将指令处理请求中的指令指针发送给第一指令缓存器;第一指令引擎从第一指令缓存器中获取第一指令集,以执行该第一指令集中的指令。

Description

基于多指令引擎的指令处理方法及处理器 技术领域
本申请涉及计算机技术领域,尤其涉及一种基于多指令引擎的指令处理方法及处理器。
背景技术
随着第五代(5th generation,5G)移动通信技术的应用,处理器已发展为多指令引擎(instruction engine,IE)并行执行,即处理器在单个周期并行执行的指令数不断增多,处理器对取指带宽和取指延时具有更高的要求。因此单一的高速缓冲存储器(cache)已不能满足多IE的处理器的取指需求,需要通过多高速缓冲存储器的方案提升取指带宽。
目前通过多高速缓冲存储器提升取指带宽的方案中,多个IE共享多个高速缓冲存储器,当接收到新的指令处理请求时,根据每个IE对应的指令队列(instruction queue,Inst Q)的队列深度确定执行该指令处理请求的IE;再根据指令处理请求中的指令指针(program counter,PC)确定缓存该指令指针对应的指令的高速缓冲存储器,从该高速缓冲存储器获取指令,并经过交叉开关矩阵(crossbar)发送给相应的IE处理。
因此,在目前的多高速缓冲存储器方案中,指令缓存在哪个高速缓冲存储器中是根据该指令的PC确定的,且在连续取指时,需在多个高速缓冲存储器中循环取指,导致取指请求可能在多高速缓冲存储器中分配不均,从而造成处理器处理性能下降,执行效率低下的问题。
发明内容
本申请的实施例提供一种基于多指令引擎的指令处理方法及处理器,用于提高指令的执行效率和处理器中资源的利用率,进而降低成本和功耗。
为达到上述目的,本申请的实施例采用如下技术方案:
第一方面,本申请的实施例提供一种基于多指令引擎的指令处理方法。该指令处理方法应用于处理器中,该处理器包括:指令块调度器、指令缓存组和指令引擎组,指令缓存组中包括多个指令缓存器(例如,cache 0-cache15),指令引擎组中包括多个指令引擎(例如,IE 0-IE 15);指令缓存组中的多个指令缓存器与指令引擎组中的多个指令引擎一一对应(例如,cache 0对应IE 0;cache 1对应IE 1;以此类推)。该指令处理方法包括:指令块调度器接收指令处理请求;指令处理请求用于请求处理器处理第一指令集。指令块调度器根据指令处理请求确定第一指令引擎;第一指令引擎为指令引擎组中处理第一指令集的指令引擎。指令块调度器将指令处理拿过去发送给第一指令引擎对应的第一指令缓存器中。第一指令引擎从第一指令缓存器中获取第一指令集。
需要说明的是,指令块调度器中包括指令块表,该指令块表中记录了一个程序中每个指令块可映射(分配)的指令引擎,根据指令处理请求中的指令块标识号PBID可 以确定每个指令块可映射(分配)的指令引擎,如根据一定的规则可确定处理每个指令块的指令引擎。此外,在指令块调度器中还包括多个指令处理请求队列,多个指令处理请求队列与多个指令引擎一一对应,多个指令处理请求队列与多个指令缓存器一一对应,从而实现了指令引擎与指令缓存器的一一对应关系。此外,指令引擎与对应的指令缓存器之间可通过硬件接口来连接,以便传输指令。
基于第一方面提供的基于多指令引擎的指令处理方法,指令块调度器根据处理第一指令集的指令处理请求,确定处理第一指令集的第一指令引擎;根据指令引擎与指令缓存器一一对应关系,确定用于缓存第一指令集的第一指令缓存器;指令块调度器将指令处理请求中的指令指针PC发送给第一指令缓存器;第一指令引擎从第一指令缓存器中获取第一指令集,以执行该第一指令集中的指令。在该方案中,每个指令引擎可以独享一个指令缓存器的服务,使得该处理器具有稳定且确定的取指带宽,从而提高处理器的指令执行效率;此外,程序在该处理器执行时,将程序中的指令分块,并且按照程序的执行顺序,将不同的指令块有序分配到不同的IE上执行,可以提高处理器的资源利用率,并且减少指令在指令缓存器之间的复制,从而降低成本和功耗。
在第一方面的一种可能的实现方式中,指令块调度器根据指令处理请求确定第一指令引擎,可以包括:指令块调度器根据指令处理请求,获取第一指令集的备选指令引擎;备选指令引擎为可用于处理第一指令集的指令引擎。指令块调度器从备选指令引擎中选择指令引擎,作为第一指令引擎。可以理解地,备选指令引擎可以是预先确定的,即可以从指令块调度器中的指令块表中获得。在指令块表中,关于每个指令块与指令引擎的映射关系,一般可以根据待执行的程序的特征进行分配和配置,例如可以按照每个指令引擎处理的指令条数的均匀程度来分配,以一待执行程序的性能路径上具有PBID=0、PBID=1和PBID=2三个指令块为例,假设PBID=0和PBID=1的指令块中均包含24条指令,而PBID=2的指令块中包含48条指令;因此可以将PBID=0和PBID=1的指令块均分配到IE0上执行,而将PBID=2的指令块分配到IE1上执行,使得不同的IE需执行的指令条数尽可能均匀。因此,通过该可能的实现方式,可以进一步提高处理器的指令执行效率和资源利用率。
可选地,指令引擎组可以包括第一备选指令引擎组。指令块调度器根据指令处理请求,获取第一指令集的备选指令引擎,可以包括:若第一指令集为非性能路径上的指令集,则指令块调度器将第一备选指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。在该可选的方案中,由于非性能路径上的指令块主要用于处理异常和处理协议报文,因此若待执行的指令块为非性能路径上的指令块,无论指令引擎拥塞与否,都直接从指令块表中预先分配的指令引擎组(如静态IE组)中选择指令引擎,不再扩展其他的指令引擎,从而进一步降低该处理器的指令执行成本和功耗。
或者,可选地,指令引擎组可以包括第一备选指令引擎组和第二指令引擎组。指令块调度器根据指令处理请求获取第一指令集的备选指令引擎,可以包括:若第一指令集为性能路径上的指令集,则指令块调度器将第一备选指令引擎组或第二指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。其中,第一备选指令引擎组为指令块表中预先配置的静态IE组,第二指令引擎组为该处理器中除静态IE组中的IE以外的其他指令引擎的集合。在该可选的方案中,由于性能路径上的指令块属于整个程 序中主要执行的程序,因此若待执行的指令块为性能路径上的指令块,则可以优先从指令块表中预先分配的第一备选指令引擎组中选择指令引擎,若该预先分配的第一备选指令引擎组中的指令引擎均处于拥塞状态,则可以从该处理器的其他指令引擎,如第二指令引擎组中扩展,以便确保有足够的资源处理性能路径上的指令块,从而进一步提高处理器的指令执行效率。
进一步地,指令块调度器将第一备选指令引擎组或第二指令引擎组中的指令引擎,作为第一指令集的备选指令引擎,可以包括:若满足第一条件,则指令块调度器将第一备选指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。其中,第一条件可以为:第一备选指令引擎组中,至少存在一个指令引擎对应的指令处理请求队列的队列深度低于第一预设阈值。或者,若满足第二条件,则指令块调度器将第二指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。其中,第二条件可以为:第一备选指令引擎组中,所有的指令引擎对应的指令处理请求队列的队列深度均超过第一预设阈值。该进一步的方案中,通过预先配置第一预设阈值来判断第一备选指令引擎组中的指令引擎是否拥塞。该第一预设阈值可根据处理器的实际情况而更改,如此,可以在保证处理器功耗最低的情况下,提高处理器的指令执行效率。
再进一步地,第二指令引擎组可以包括第二备选指令引擎组和第三备选指令引擎组。指令块调度器将第二指令引擎组中的指令引擎,作为第一指令集的备选指令引擎,可以包括:指令块调度器将第二指令引擎组中的第二备选指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。若满足第三条件,则指令块调度器将第三备选指令引擎组中的至少一个指令引擎,添加至第二备选指令引擎组中。其中,第三条件可以为:第二备选指令引擎组为空,或者第二备选指令引擎组中,所有的指令引擎对应的指令处理请求队列的队列深度,均超过第二预设阈值。在该进一步地方案中,可以将第二指令引擎组分为第二备选指令引擎组和第三备选指令引擎组,其中,第二备选指令引擎组为动态IE组,处于使能状态;第三备选指令引擎组为未使能引擎组。在处理器中已使能的指令引擎的执行任务较少的情况下,可以将处理器中的部分指令引擎去使能,从而降低处理器的功耗。当第二备选指令引擎组,即动态IE组中的指令引擎均处于拥塞状态时,可将第三备选指令引擎组中指令引擎使能,以便提高处理器的指令执行效率。
可选地,指令块调度器选择第三备选指令引擎组中,指令引擎对应的指令处理请求队列的队列深度低于第三预设阈值的至少一个指令引擎,添加至第二备选指令引擎组中。在该可选的方案中,从第三备选指令引擎组中扩展指令引擎时,选择指令引擎对应的指令处理请求队列的队列深度低于第三预设阈值的指令引擎,即扩展不拥塞的指令引擎,以便提高指令执行效率。
可选的,该第一方面中的指令处理方法还可以包括:指令块调度器记录指令引擎选择差值。若指令引擎选择差值超过第四预设阈值,则指令块调度器将第二备选指令引擎组中的所有指令引擎删除。其中,指令引擎选择差值用于指示:从第一备选指令引擎组中选择指令引擎的次数,与从第二备选指令引擎组中选择指令引擎的次数的数量差。在该可选的方案中,为了降低处理器的功耗,针对第一指令集所属的指令块,若该指令块从第二备选指令引擎组,即动态IE组选择指令引擎IE的次数非常少时, 可以将动态IE组中的指令引擎删除,以便降低处理器功耗。
进一步地,指令块调度器从备选指令引擎中选择指令引擎,作为第一指令引擎,可以包括:指令块调度器获取备选指令引擎对应的指令处理请求队列的队列深度。指令块调度器选择队列深度最小的指令处理请求队列对应的备选指令引擎,作为第一指令引擎。在该进一步地方案中,从备选指令引擎中确定最终执行第一指令集的指令引擎,选择队列深度最小的指令处理请求队列对应的备选指令引擎,可以使得每个指令引擎的指令执行数量更低,有利于提高处理器资源的利用率,以及提高处理器的执行效率。
在一种可能的实现方式中,该第一方面中的指令处理方法还可以包括:第一指令缓存器检测到第一指令集的结束标记时,第一指令缓存器向指令块调度器发送调度信息,调度信息用于指示第一指令引擎可处理下一个指令处理请求。如此,可通过指令块调度器中的轮询调度器从第一指令引擎对应的指令处理请求队列中取出下一个指令处理请求,以供第一指令引擎处理,以便指令的顺序执行,从而提高处理器效率。
第二方面,本申请的实施例提供一种处理器。该处理器包括:指令块调度器、指令缓存组和指令引擎组,指令缓存组中包括多个指令缓存器,指令引擎组中包括多个指令引擎;指令缓存组中的多个指令缓存器与指令引擎组中的多个指令引擎一一对应。指令块调度器,用于接收指令处理请求;指令处理请求用于请求处理器处理第一指令集。指令块调度器,用于根据指令处理请求确定第一指令引擎;第一指令引擎为指令引擎组中处理第一指令集的指令引擎,第一指令引擎与指令缓存组中的第一指令缓存器对应。指令块调度器,用于将指令处理请求发送给第一指令引擎对应的第一指令缓存器。第一指令引擎,用于从第一指令缓存器中获取第一指令集。
可选地,处理器还包括多个指令处理请求队列,多个指令处理请求队列与多个指令引擎一一对应,并且多个指令处理请求队列与多个指令缓存器一一对应;指令块调度器,用于根据第一指令引擎对应的指令处理请求队列,确定第一指令引擎对应的第一指令缓存器。如此,通过指令处理请求队列的一一对应关系,可实现多个指令引擎与多个指令缓存器的一一对应关系。
在第二方面的一种可能的实现方式中,指令块调度器具体可以用于:根据指令处理请求,获取第一指令集的备选指令引擎;备选指令引擎为可用于处理第一指令集的指令引擎。从备选指令引擎中选择指令引擎,作为第一指令引擎。
可选地,指令引擎组可以包括第一备选指令引擎组。指令引擎组具体用于,若第一指令集为非性能路径上的指令集,则将第一备选指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。
可选地,指令引擎组可以包括第一备选指令引擎组和第二指令引擎组。指令块调度器具体用于,若第一指令集为性能路径上的指令集,则将第一备选指令引擎组或第二指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。
进一步地,指令块调度器具体可以用于:若满足第一条件,则将第一备选指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。其中,第一条件可以为:第一备选指令引擎组中,至少存在一个指令引擎对应的指令处理请求队列的队列深度低于第一预设阈值。或者,若满足第二条件,则将第二指令引擎组中的指令引擎,作为第 一指令集的备选指令引擎。其中,第二条件可以为:第一备选指令引擎组中,所有的指令引擎对应的指令处理请求队列的队列深度均超过第一预设阈值。
再进一步地,第二指令引擎组可以包括第二备选指令引擎组和第三备选指令引擎组。指令块调度器具体可以用于:将第二指令引擎组中的第二备选指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。若满足第三条件,则将第三备选指令引擎组中的至少一个指令引擎,添加至第二备选指令引擎组中。第三条件可以为:第二备选指令引擎组为空,或者第二备选指令引擎组中,所有的指令引擎对应的指令处理请求队列的队列深度,均超过第二预设阈值。
可选地,指令块调度器可以具体用于,选择第三备选指令引擎组中,指令引擎对应的指令处理请求队列的队列深度低于第三预设阈值的至少一个指令引擎,添加至第二备选指令引擎组中。
可选地,指令块调度器还可以用于:记录指令引擎选择差值。若指令引擎选择差值超过第四预设阈值,则将第二备选指令引擎组中的所有指令引擎删除。其中,指令引擎选择差值用于指示,从第一备选指令引擎组中选择指令引擎的次数,与从第二备选指令引擎组中选择指令引擎的次数的数量差。
一种可能的实现方式中,指令块调度器具体可以用于:获取备选指令引擎对应的指令处理请求队列的队列深度。选择队列深度最小的指令处理请求队列对应的备选指令引擎,作为第一指令引擎。
一种可能的实现方式中,第一指令缓存器还可以用于,检测到第一指令集的结束标记时,向指令块调度器发送调度信息,调度信息用于指示第一指令引擎可处理下一个指令处理请求。
可选地,指令缓存组中每个指令缓存器中单个缓存单元的缓存长度与对应的指令引擎的单个执行周期的可处理的指令条数一致。其中,以指令缓存器为高速缓冲存储器为例,每个指令缓存器中单个缓存单元的缓存长度是指,每个高速缓冲存储器中缓存行(cache line)的长度;指令引擎的单个执行周期的可处理的指令条数是指,单个执行周期内,指令引擎可处理的算术逻辑单元(arithmetic logic unit,ALU)阵列的尺寸(size)。例如,若指令引擎IE单个执行周期可处理的ALU阵列的尺寸为4条指令,则与该指令引擎对应的高速缓冲存储器中缓存行的长度也设计为可缓存4条指令。如此,可以使得指令引擎IE和高速缓冲存储器之间不再需要指令队列(instruction queue,Inst Q)来缓存指令,从而降低成本和功耗。
第三方面,本申请的实施例提供一种电子设备。该电子设备包括处理器、以及与处理器耦合的存储器,其中处理器为如上第二方面中任一种可能的实现方式所提供的处理器。
可以理解地,上述提供的任一种基于多指令引擎的指令处理方法的处理器或电子设备均用于执行上述第一方面所提供的基于多指令引擎的指令处理方法,因此,其所能达到的有益效果可参考上述第一方面所提供的基于多指令引擎的指令处理方法中的有益效果,此处不再赘述。
附图说明
图1为一种多高速缓冲存储器多指令引擎的处理器的结构示意图;
图2为本申请实施例提供的一种多高速缓存存储器多指令引擎的处理器的结构示意图;
图3为本申请实施例提供的一种指令块分配方案的示意图;
图4为本申请实施例提供的一种基于多指令引擎的指令处理方法的流程示意图;
图5为本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
图1为一种多高速缓冲存储器多指令引擎的处理器的结构示意图。在图1所示的方案中,该处理器是由指令缓冲器(instruction buffer,IBUF)、多个基于切片的高速缓冲存储器(slice-based cache)、交叉开关矩阵(crossbar)、多个指令引擎(instruction engine,IE),以及与多个指令引擎IE一一对应的指令队列(instruction queue,Inst Q)。在多个基于切片的高速缓冲存储器中,指令按照如下方式存储:以高速缓冲存储器slice-based cache 0–slice-based cache 15为例,假设高速缓冲存储器中每个缓存行(cache line)的长度为8条指令;若一指令指针(program counter,PC)对应的指令存储于slice-based cache 0中,则该PC对应的指令之后的第8条指令(即PC+8对应的指令)存储于slice-based cache 1中,依次类推;若一PC对应的指令存储于slice-based cache 15中,则该PC对应的指令之后的第8条指令(即PC+8对应的指令)存储于slice-based cache 0中。
具体的处理流程是:
指令缓冲器IBUF接收到包括指令指针(program counter,PC)的取指令请求(后文简称:取指请求)后,将取指请求缓存在指令先进先出(instruction first in first out,IFIFO)队列中。指令缓冲器IBUF中的分配器(dispatcher,DISP)从IFIFO队列中读取取指请求分配至与某指令引擎IE对应的执行线程中,每个指令引擎IE会对应一个执行线程。
将取指请求分配至指令引擎IE对应的执行线程中时,需满足以下要求:
(1)该执行线程已经完成前一次的取指请求;
(2)若存在多个可以分配的执行线程时,根据指令引擎IE对应的指令队列Inst Q的队列深度分配,优先分配给队列深度最浅的指令队列Inst Q对应的指令引擎IE,所对应的执行线程。
在取指请求分配执行线程之后,分配器DISP根据指令指针PC,将取指请求中的指令指针PC发送给相应的高速缓冲存储器中读取指令;高速缓冲存储器接收到PC后经过调度器(scheduler,SCH)的调度后,进入该高速缓冲存储器的流水线(cache pipeline),在该流水线中,先通过标签查找控制器(tag lookup)查找标签表(tag table),然后经过仲裁模块(arbiter,ARB)仲裁。其中,标签表用于记录指令指针PC与高速缓冲存储器中的缓存的指令数据的对应关系,通过查找标签表可确定PC与高速缓存存储器的缓存单元(如缓存行cache line)是否具有对应关系,若具有对应关系,则认为命中,否则未命中。仲裁模块ARB用于对从标签表中获取的命中结果进行判断,确认该PC对应的指令是否在该高速缓冲存储器中命中。经过仲裁后,如果该PC对应的 指令在该高速缓冲存储器中命中,则从该高速缓冲存储器中的高速缓冲存储器数据(cache data)模块中获取指令,并经过交叉开关矩阵发送给绑定的IE。如果在该高速缓冲存储器中未命中,则向指令存储器(instruction memory,IMEM)发起重填(refill)请求,也可以称为回填(backfill)请求,该重填请求用于请求从IMEM中获取指令并重新学习至相应的高速缓冲存储器中。在该高速缓冲存储器从IMEM中获取指令后,再更新标签表。
如果该PC对应的指令后没有结束标记(end indicator,EI)(可通过查找标签表获得),则将该PC+8后向后一个高速缓冲存储器发起取指请求,后一个高速缓冲存储器的调度器根据状态表(state table)对取指请求进行调度,进入该后一个高速缓冲存储器的流水线(进入该高速缓冲存储器流水线后的流程如前所述,此处不再赘述)。
如果回填请求从指令存储器中读取指令并返回高速缓冲存储器后,则该高速缓冲存储器的调度器根据状态表发起指令回填请求,将相应的指令填充到高速缓冲存储器数据模块中,将该指令的相关信息记录在标签表中,并且将该指令经过交叉开关矩阵发送给IE。
如果从高速缓冲存储器获取的指令存在EI标记,则表示当前PC已经完成取指。如果该当前PC对应的执行线程还有PC等待取指,则在该执行线程中使用新的PC继续取指。如果该执行线程中没有PC等待取指,则表示当前取指请求已经完成所有取指操作,可以处理下一个取指请求。
需要说明的是,在图1的方案中,指令缓存在哪个高速缓冲存储器中,与该指令的PC有关,且在连续取指时,需在多个高速缓冲存储器中循环取指,从而导致取指请求可能在多个高速缓冲存储器中分配不均,造成处理器处理性能下降,执行效率低的问题。
另外,由于多IE共享多高速缓冲存储器,因此高速缓冲存储器和IE之间需要通过交叉开关矩阵进行数据传输,具有较高的硬件成本和功耗。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c或a-b-c,其中a、b和c可以是单个,也可以是多个。字符“/”一般表示前后关联对象是一种“或”的关系。另外,在本申请的实施例中,“第一”、“第二”等字样并不对数量和执行次序进行限定。
需要说明的是,本申请中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
另外,在介绍本申请实施例之前,首先对本申请实施例所涉及的技术名词进行介绍说明。
性能路径(performance path):指一程序中最主要的执行路径。
非性能路径(non-performance path):指一程序中处理异常或处理协议报文类的执行路径。
图2为本申请实施例提供的一种多高速缓存存储器多指令引擎的处理器的结构示意图,该处理器包括:指令块调度器(program block dispatcher,PBD)、指令缓存组(instruction cache group,ICG)和指令引擎组(instruction engine group,IEG)。其中,指令缓存组中包括多个指令缓存器,该指令缓存器可采用高速缓冲存储器(cache),例如cache0-cache15;指令引擎组中包括多个指令引擎(instruction engine,IE),例如IE0-IE15;指令缓存组中的多个指令缓存器与指令引擎组中的多个指令引擎一一对应,即每个IE独享一个指令缓存器,IE和指令缓存器一一绑定,从而使得每个IE都具有稳定且确定的取指带宽。
应该说明的是,在处理器中还可以包括多个指令处理请求队列,多个指令处理请求队列与多个指令引擎一一对应;多个指令处理请求队列与多个指令缓存器一一对应。也就是说,每个IE均可以对应一个指令处理请求队列(IE-based queue),每个指令请求队列对应一个指令缓存器,从而实现指令处理请求队列与指令缓存器的一一对应关系。此外,指令引擎与对应的指令缓存器之间可通过硬件接口来连接,以便传输指令。应理解,所有的指令处理请求队列可以由队列管理器QM管理。
为降低成本和功耗,可以使指令缓存组中每个指令缓存器中单个缓存单元的缓存长度与对应的指令引擎的单个执行周期的可处理的指令条数一致。其中,以指令缓存器为高速缓冲存储器cache为例,每个指令缓存器中单个缓存单元的缓存长度是指,每个高速缓冲存储器中缓存行(cache line)的长度;指令引擎的单个执行周期的可处理的指令条数是指,单个执行周期内,指令引擎可处理的算术逻辑单元(arithmetic logic unit,ALU)阵列的尺寸(size)。例如,若指令引擎IE单个执行周期可处理的ALU阵列的尺寸为4条指令,则与该指令引擎对应的高速缓冲存储器中缓存行的长度也设计为可缓存4条指令。如此,可以使得指令引擎IE和高速缓冲存储器之间不再需要指令队列(instruction queue,Inst Q)来缓存指令,从而降低成本和功耗。
当然,如果指令缓存组中每个指令缓存器中单个缓存单元的缓存长度与对应的指令引擎的单个执行周期的可处理的指令条数不一致。则可以在指令缓存器与IE之间设置指令队列缓存指令,以便指令顺序被IE执行。
比较于图1中所示的处理器,由于指令缓存器和指令引擎一一绑定,该图2中所示的处理器不需要交叉开关矩阵,并且可以将绑定在一起的指令缓存器的单个缓存单元的指令缓存长度与指令引擎的单个指令周期的可处理的指令条数设置为相同,可以减少指令队列Inst Q的设置,从而降低处理流程的复杂度,降低处理器的成本和功耗。
在图2中的处理器涉及了指令块调度器,使用该处理器执行一完整的程序时,可以先将该程序分为若干个指令块(program block),并基于程序的执行顺序为每个指令块分配一个标识号(identifier,ID),称为指令块标识号(program block identifier,PBID)。对程序分块时,一般可基于IO操作或基于switch跳转指令。
需要说明的是,对于一个完整的程序,按照程序的执行顺序,可以将程序分为多个不同的阶段(phase)。在程序分为若干指令块后,每个指令块均只有一个PBID,不同阶段的不同的指令块分配的PBID不同,但也可以相同;同一阶段的不同指令块分 配的PBID一般不同。
此外,按照程序的执行路径,可以将程序分为性能路径上的指令块和非性能路径上的指令块。在指令块分配时,对于同一条性能路径上的不同阶段的各个指令块,可以尽可能均匀分配到所有的IE上,例如将同一性能路径上的不同阶段的各个指令块分别分配给不同的IE。当然,非性能路径上的不同阶段的各个指令块也可以均匀分配到所有的IE上。如此,可以避免同一性能路径上的各个指令块的处理聚集在部分IE上,加重该部分IE的处理负担,并加重该部分IE对应的指令缓存器,如高速缓冲存储器的负担,而导致高速缓冲存储器命中率下降,以便提高处理器的执行效率以及资源利用率。另外,由于相同阶段的不同指令块通常位于不同分支,一般不会同时被执行,因此相同阶段的不同指令块可以分配到相同的IE。
下面以图3为例,对指令块的分配进行说明。图3为本申请实施例提供的一种指令块分配方案的示意图。在图3所示的示例中,该程序被分为了9个指令块,其中在一条性能路径上,包括指令块为PBID=0、PBID=2、PBID=4、PBID=5以及PBID=8;在一条非性能路径上,包括指令块为PBID=1、PBID=3、PBID=4。该程序被分为了5个phase,其中phase 1中的指令块为PBID=0;phase 2中的指令块为PBID=1、PBID=2、PBID=3;phase 3中的指令块为PBID=4;phase 4中的指令块为PBID=5、PBID=6;phase 5中的指令块为PBID=7、PBID=8、PBID=9。因此,根据上述的IE分配规则,对于性能路径上的指令块,可以按照如下方案分配IE,例如:PBID=0的指令块分配的IE为IE0,PBID=2的指令块分配的IE为IE1,PBID=4的指令块分配的IE为IE2,PBID=5的指令块分配的IE为IE3,PBID=8的指令块分配的IE为IE4;对于非性能路径上的指令块,可以按照如下方案分配IE,例如,PBID=1和PBID=2的指令块同属于phase 2,PBID=1的指令块分配的IE也可以为IE1;PBID=5和PBID=6的指令块同属于phase 4,PBID=6的指令块分配的IE也可以为IE3,PBID=8和PBID=9的指令块同属于phase 5,PBID=9的指令块分配的IE也可以为IE4。
另外,还应该说明的是,在同一性能路径上的不同阶段的各个指令块均匀分配到IE上,还可以指分配至每个IE的指令条数基本相同,例如某一程序的性能路径上具有PBID=0、PBID=1和PBID=2三个指令块,假设PBID=0和PBID=1的指令块中均具有24条指令,而PBID=2的指令块中具有48条指令;因此可以将PBID=0和PBID=1的指令块均分配到IE0上执行,而将PBID=2的指令块分配到IE1上执行,如此一来,IE0和IE1上需执行的指令数便相等了。
请参考图2,指令块调度器包括指令块表(program block table,PBT)、查找表控制器(lookup table controller,LTC)、队列管理器(queue management,QM)以及轮询(round robin,RR)调度器。
其中,指令块表PBT是在程序执行前预配置的,例如可以在程序编译的过程中生成。在PBT中可以包括如下字段:
PERF:为性能路径字段,可以用于指示对应的指令块是否为性能路径上,占位1位(bit),其中可以用1表示对应的指令块为性能路径块,0表示对应的指令块为非性能路径上的指令块。
SF_BM:为基于位图(bitmap)的静态标记,可以用于指定对应的指令块与所有的 IE之间的映射关系是静态的还是动态的(以16个IE,即IE0-IE15为例),占位16bit。每一个bit对应一个IE,即SF_BM[0]对应IE0,SF_BM[1]对应IE1,以此类推。其中SF_BM[x]=1时,可以表示IEx与该指令块为静态映射关系;SF_BM[x]=0时,可以表示IEx与该指令块为动态映射关系。对于静态映射关系的指令块和IE,其映射关系不会被更改,动态映射关系的指令块和IE,其映射关系可以根据IE的拥塞状态做更改。
IE_BM:为IE位图,可以用于指定指令块和IE的映射关系,占位16bit。每一个bit对应一个IE,即IE_BM[0]对应IE0,IE_BM[1]对应IE1,以此类推。其中IE_BM[x]=1时,可以表示将该指令块映射到IEx,即可以将该指令块分配给IEx执行,即该IEx使能;IE_BM[x]=0时,可以表示将该指令块未映射到IEx,即不可以将该指令块分配给IEx执行,即该IEx不使能,x为0至15的整数。
DIFF_CNT:记录该指令块从静态映射关系的IE中选择IE,与从动态映射关系的IE中选择IE的次数差,可以用于指示是否需要将与该指令块具有动态映射关系的IE删除。其中,当从静态映射关系的IE中选择IE时,DIFF_CNT加1,当从动态映射关系的IE中选择IE时,DIFF_CNT减一。
需要说明的是,在确定执行指令的IE时,优先从与待执行的指令块具有静态映射关系的IE集合中选择IE;当与待执行的指令块具有静态映射关系的IE集合中没有符合要求的IE时,则可以从与待执行的指令块具有动态映射关系的IE集合中选择IE。在确定执行指令的IE时,应当确保IE是处于使能状态的。因此,可以将IE分为三种类型,分别为静态IE组、动态IE组和未使能IE组。
对于某一指令块,静态IE组为与该指令块具有静态映射关系的所有IE的集合,并且该集合中所有的IE使能,即IE_BM[m]=1且SF_BM[m]=1。
动态IE组为与该指令块具有动态映射关系,且IE使能的所有IE的集合,即IE_BM[n]=1且SF_BM[0]=0。
未使能IE组为与该指令块具有动态映射关系,且IE不使能的所有IE的集合,即IE_BM[t]=0且SF_BM[t]=0。
指令块调度器通过查找表控制器查找指令块表,可以确定指令块与指令引擎组中的IE的映射关系,再根据每个IE对应的指令处理请求队列的队列深度确定最终执行该指令块的IE,然后将该指令块的指令处理请求添加到相应的IE对应的指令处理请求队列中,等待相应的IE执行。指令处理请求队列由队列关联器QM管理;指令处理请求队列中的指令处理请求由轮询调度器(round robin,RR)来调度。
此外,图2所示的处理器中还可以包括输入调度器(input scheduler,IS)和输出调度器(output scheduler,OS)。其中,输入调度器IS用于接收前级模块的数据,可以包括指令处理请求,用于对指令处理请求向指令块调度器中调度分配;输出调度器OS用于接收指令块的处理结果,并对判断整个程序是否执行完毕,根据整个程序的各个指令块的执行顺序,对指令处理请求和指令处理结果进行调度。
当然,输入调度器和输出调度器也可以设计为一个调度器,例如输入输出调度器,同时实现输入调度器和输出调度器的所有功能。
图4为本申请实施例提供的一种基于多指令引擎的指令处理方法的流程示意图,可以应用于如图2所示的处理器中,该方法包括如下几个步骤。
S401:指令块调度器接收指令处理请求;指令处理请求用于请求处理器处理第一指令集。
其中,指令处理请求中可以包括第一指令集对应的PBID和第一指令集对应的指令指针(program counter,PC),第一指令集为一指令块中所有指令的集合;第一指令集对应的PC可用于索引第一指令集中的指令。
S402:指令块调度器根据指令处理请求确定第一指令引擎;第一指令引擎为指令引擎组中处理第一指令集的指令引擎。第一指令引擎与指令缓存器中的第一指令缓存器对应。
进一步地,指令块调度器根据指令处理请求,获取第一指令集的备选指令引擎;备选指令引擎为可用于处理第一指令集的指令引擎。指令块调度器从备选指令引擎中选择指令引擎,作为第一指令引擎。
需要说明的是,备选指令引擎可以是预先确定的,如通过上述的PBT表来确定,即备选指令引擎可以是第一指令集所属的指令块映射到的静态IE组中的所有IE;备选指令引擎也可以是根据静态IE组中IE对应的指令处理请求队列的拥塞状态,从第一指令集所属的指令块映射到的动态IE组中动态新增的IE。
因此,针对不同的指令块,可以将指令引擎组分为第一备选指令引擎组和第二指令引擎组。其中,第一备选指令引擎组可以为与第一指令集所属的指令块具有静态映射关系的所有IE的集合,即第一备选指令引擎组可以为静态IE组。第二指令引擎组可以为与第一指令集所属的指令块具有动态映射关系的所有IE的集合,即第二指令引擎组可以为动态IE组和未使能IE组的结合。
对于第一指令集中备选指令引擎的确定,可以按照如下方式:
若第一指令集为非性能路径上的指令集,则指令块调度器将第一备选指令引擎组的指令引擎,作为第一指令集的备选指令引擎。
若第一指令集为性能路径上的指令集,则指令块调度器将第一备选指令引擎组或第二指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。
由于非性能路径上的指令块(指令集)主要为处理异常或处理协议报文的指令,一般不会占用太多的流量,因此为保证资源的均匀分配,对于非性能路径上的指令集,可以直接根据预先配置的指令块与IE的映射关系选择执行该指令集的IE,即选择该指令块具有静态映射关系的IE(静态IE组中的IE)。然而,对于性能路径上的指令块(指令集)在执行时一般需要较多的流量和资源,因此可以优先从与该指令块具有静态映射关系的IE中选择执行该指令块的IE。若与该指令块具有静态映射关系的IE均处于拥塞状态时,则可以从与该指令块具有动态映射关系的IE中选择执行该指令块的IE。通过该方式,在多IE的处理器中,保证处理器功耗最低的情况下,可以尽可能使各指令块在各IE中能够更均匀地执行,从而提高处理器对程序的执行效率,并提高处理器资源的利用率。
具体地,若第一指令集为性能路径上的指令集,则可以按照如下方式确定备选指令引擎:
若满足第一条件,则指令块调度器将第一备选指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。其中,第一条件为:第一备选指令引擎组中,至少存在一 个指令引擎对应的指令处理请求队列的队列深度低于第一预设阈值。
或者,
若满足第二条件,则指令块调度器将第二指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。其中,第二条件为:第一备选指令引擎组中,所有的指令引擎对应的指令处理请求队列的队列深度均超过第一预设阈值。
也就是说,若IE对应的指令处理请求队列的队列深度达到第一预设阈值,例如16,则表示该IE处于拥塞状态,不适合再分配新的待执行指令集。若在第一备选指令引擎组中,还至少存在一个IE未处于拥塞状态,则仍然可以将第一备选指令引擎组中的指令引擎作为第一指令集的备选指令引擎;若在第一备选指令引擎组中,已经不存在指令引擎未处于拥塞状态,则可以将第二指令引擎组中的指令引擎作为第一指令集的备选指令引擎。应当理解,第一预设阈值为预先设置的,为用于判断IE是否拥塞的指定值,该指定值可以根据处理器的实际处理情况而更改。
进一步地,为了降低处理器功耗,可以将部分IE不使能,需要该部分不使能的IE执行指令时,才将该IE使能,以便执行相应的指令。因此,第二指令引擎组可以分为第二备选指令引擎组和第三备选指令引擎组。第二备选指令引擎组可以为与第一指令集所属的指令块具有动态映射关系,且IE使能的所有IE的集合,即上述动态IE组。第三备选指令引擎组可以为与第一指令集所属的指令块具有动态映射关系,且IE不使能的所有IE的集合,即未使能IE组。
指令块调度器将第二指令引擎组中的指令引擎,作为第一指令集的备选指令引擎时,指令块调度器将第二指令引擎组中的第二备选指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。
若满足第三条件,则指令块调度器将第三备选指令引擎组中的至少一个指令引擎,添加至第二备选指令引擎组中。第三条件可以为:第二备选指令引擎组为空,或者第二备选指令引擎组中,所有的指令引擎对应的指令处理请求队列的队列深度,均超过第二预设阈值。
也就是说,指令块调度器将第二指令引擎组中的指令引擎,作为第一指令集的备选指令引擎时,优先从第二备选指令引擎组中选择IE,即从动态IE组中选择IE,当第二备选指令引擎组中没有符合要求的IE时,则从第三备选指令引擎组中选择一个符合要求的IE扩展至第二备选指令引擎组中,即从未使能引擎组中选择一个符合要求的IE,变更该IE的状态为使能状态。
应当说明的是,判断第二备选指令引擎组中是否有符合要求的IE,同样可以以IE的拥塞状态作为判断条件,同理可以预先设置第二预设阈值,用于判断IE是否拥塞。当第二备选指令引擎组中,所有的指令引擎对应的指令处理请求队列的队列深度,均超过第二预设阈值时,则第二备选指令引擎组中没有符合要求的IE可以选择,此时需要从第三备选指令引擎组中扩展至少一个IE。
为提高指令执行效率,从第三备选指令引擎组中扩展IE时,也应当扩展不拥塞的IE。即指令块调度器选择第三备选指令引擎组中,指令引擎对应的指令处理请求队列的队列深度低于第三预设阈值的至少一个指令引擎,添加至第二备选指令引擎组中。应当理解,第三预设阈值也为预先设置的,为用于判断IE是否拥塞的指定值,该值也 可以根据处理器的实际处理情况而更改。
可以理解地,第一预设阈值、第二预设阈值以及第三预设阈值为分别用于判断第一备选指令引擎组、第二备选指令引擎组以及第三备选指令引擎中的指令引擎是否拥塞的指定值,第一预设阈值、第二预设阈值以及第三预设阈值可以相同,也可以不同。
此外,为了降低处理器的功耗,针对第一指令集所属的指令块,若该指令块从动态IE组选择IE的次数非常少时,可以将动态IE组中的删除以便降低处理器功耗。因此,指令块调度器记录指令引擎选择差值。该指令引擎选择差值用于指示,从第一备选指令引擎组中选择指令引擎的次数,与从第二备选指令引擎组中选择指令引擎的次数的数量差,即指令引擎选择差值代表着从静态IE组选择IE的次数,与从动态IE组选择IE的次数的数量差。因此指令引擎选择差值可以通过指令块表中的DIFF_CNT字段来记录。
若指令引擎选择差值超过第四预设阈值,则指令块调度器将第二备选指令引擎组中的所有指令引擎删除。也就是说,指令引擎选择差值超过一定值时,这个值即第四预设阈值,为根据实际情况确定的一个值,例如500。当指令引擎选择差值超过第四预设阈值时,则表明第一指令集所属的指令块从第二备选指令引擎组中选择指令引擎的几率很小,因此指令块调度器可以配置指令块表中的IE_BM[n]=0,使第二备选指令引擎组中所有指令引擎不使能,从而降低处理器的功耗。
以上是针对如何获取第一指令集的备选指令引擎的说明,当指令块调度器获取了第一指令集的备选指令引擎后,可以从第一指令集的备选指令引擎中选择一个指令引擎作为第一指令引擎,处理第一指令集。具体的选择方式为:指令块调度器获取备选指令引擎对应的指令处理请求队列的队列深度。指令块调度器选择队列深度最小的指令处理请求队列对应的备选指令引擎,作为第一指令引擎。
S403,指令块调度器将指令处理请求发送给第一指令引擎对应的第一指令缓存器。
其中,指令块调度器中可以包括多个指令处理请求队列,多个处理请求队列与多个指令引擎一一对应,多个指令处理请求队列与多个指令缓存器一一对应。指令块调度器可以根据第一指令引擎对应的指令处理请求队列,确定第一指令引擎对应的第一指令缓存器;从而实现第一指令引擎与第一指令缓存器的一一对应关系,即第一指令缓存器用于缓存第一指令引擎处理的指令。
当指令块调度器确定由第一指令引擎处理第一指令集,则指令块调度器将指令处理请求中的指令指针PC发送给第一指令缓存器,第一指令缓存器根据指令指针PC获取第一指令集的指令发送给第一指令引擎处理。
当然,在第一指令缓存器中可能存在不能命中第一指令集中指令的情况,此时第一指令缓存器可以从指令存储器IMEM中学习,以获取第一指令集中的指令。
S404,第一指令引擎从第一指令缓存器中获取第一指令集。
当第一指令缓存器根据指令处理请求中的指令指针PC获取第一指令集后,第一指令缓存器可以主动发送给第一指令引擎处理,从而可以使得第一指令引擎从第一指令缓存器中获取第一指令集,以便处理该第一指令集。应理解,第一指令缓存器与第一指令引擎间可以通过硬件接口的方式连接,以便第一指令缓存器能够发送第一指令集给第一指令引擎,或者第一指令引擎从第一指令缓存器中获取第一指令集。
此外,本申请的实施例提供的基于多指令引擎的指令处理方法中,还可以包括:第一指令缓存器检测到第一指令集的结束标记(end indicator,EI)时,第一指令缓存器向指令块调度器发送调度信息,调度信息用于指示第一指令引擎可处理下一个指令处理请求。此时,可通过指令块调度器中轮询调度器RR从第一指令引擎对应的指令处理请求队列中取出下一个指令处理请求,以供第一指令引擎处理。
第一指令引擎检测到第一指令集的结束标记EI时,第一指令引擎则结束第一指令集的处理,并向输出调度器OS发起调度请求。
输出调度器OS响应调度请求,并判断整个程序是否执行完毕,若执行完毕,输出调度器OS则向后级模块输出处理结果;否则,输出调度器OS会将待执行的下一个指令块所对应于的指令处理请求发送至输入调度器IS中继续处理,依次循环。
上述主要从处理器的角度对本申请实施例提供的基于多指令引擎的指令处理方法进行了介绍。可以理解的是,该处理器为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的网元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对处理器进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
本申请实施例还提供的一种处理器,该处理器的结构可以参见上述图2所示,该处理器包括:指令块调度器、指令缓存组和指令引擎组,指令缓存组中包括多个指令缓存器,指令引擎组中包括多个指令引擎;指令缓存组中的多个指令缓存器与指令引擎组中的多个指令引擎一一对应。指令块调度器,用于接收指令处理请求;指令处理请求用于请求处理器处理第一指令集。指令块调度器,用于根据指令处理请求确定第一指令引擎;第一指令引擎为指令引擎组中处理第一指令集的指令引擎,第一指令引擎与指令缓存组中的第一指令缓存器对应。指令块调度器,用于将指令处理请求发送给第一指令引擎对应的第一指令缓存器。第一指令引擎,用于从第一指令缓存器中获取第一指令集。
可选地,处理器还包括多个指令处理请求队列,多个指令处理请求队列与多个指令引擎一一对应,并且多个指令处理请求队列与多个指令缓存器一一对应;指令块调度器,用于根据第一指令引擎对应的指令处理请求队列,确定第一指令引擎对应的第一指令缓存器。如此,通过指令处理请求队列的一一对应关系,可实现多个指令引擎与多个指令缓存器的一一对应关系。
具体地,指令块调度器具体可以用于:根据指令处理请求,获取第一指令集的备选指令引擎;备选指令引擎为可用于处理第一指令集的指令引擎。从备选指令引擎中选择指令引擎,作为第一指令引擎。
可选地,指令引擎组可以包括第一备选指令引擎组。指令引擎组具体用于,若第一指令集为非性能路径上的指令集,则将第一备选指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。
可选地,指令引擎组可以包括第一备选指令引擎组和第二指令引擎组。指令块调度器具体用于,若第一指令集为性能路径上的指令集,则将第一备选指令引擎组或第二指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。
进一步地,指令块调度器具体可以用于:若满足第一条件,则将第一备选指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。第一条件可以为:第一备选指令引擎组中,至少存在一个指令引擎对应的指令处理请求队列的队列深度低于第一预设阈值。或者,若满足第二条件,则将第二指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。第二条件可以为:第一备选指令引擎组中,所有的指令引擎对应的指令处理请求队列的队列深度均超过第一预设阈值。
进一步地,第二指令引擎组可以包括第二备选指令引擎组和第三备选指令引擎组。指令块调度器具体可以用于:将第二指令引擎组中的第二备选指令引擎组中的指令引擎,作为第一指令集的备选指令引擎。若满足第三条件,则将第三备选指令引擎组中的至少一个指令引擎,添加至第二备选指令引擎组中。第三条件可以为:第二备选指令引擎组为空,或者第二备选指令引擎组中,所有的指令引擎对应的指令处理请求队列的队列深度,均超过第二预设阈值。
可选地,指令块调度器可以具体用于,选择第三备选指令引擎组中,指令引擎对应的指令处理请求队列的队列深度低于预设阈值的至少一个指令引擎,添加至第二备选指令引擎组中。
可选地,指令块调度器还可以用于:记录指令引擎选择差值。若指令引擎选择差值超过第四预设阈值,则将第二备选指令引擎组中的所有指令引擎删除。其中,指令引擎选择差值用于指示,从第一备选指令引擎组中选择指令引擎的次数,与从第二备选指令引擎组中选择指令引擎的次数的数量差。
具体地,指令块调度器具体可以用于:获取备选指令引擎对应的指令处理请求队列的队列深度。选择队列深度最小的指令处理请求队列对应的备选指令引擎,作为第一指令引擎。
此外,指令缓存器还可以用于,检测到第一指令集的结束标记时,向指令块调度器发送调度信息,调度信息用于指示第一指令引擎可处理下一个指令处理请求。
在本申请的实施例中,处理器中的每个指令引擎可以独享一个指令缓存器的服务,使得该处理器具有稳定且确定的取指带宽,从而提高处理器的指令执行效率;此外,程序在该处理器执行时,将程序中的指令分块,并且按照程序的执行顺序,将不同的指令块有序分配到不同的IE上执行,可以提高处理器资源的利用率,并且减少指令在指令缓存器之间的复制,从而降低成本和功耗。
此外,由于该处理器中指令缓存器和指令引擎一一绑定,因此该处理器不需要交叉开关矩阵,并且可以将绑定在一起的指令缓存器的单个缓存单元的指令缓存长度与指令引擎的单个指令周期的可处理的指令长度设置为相同,可以减少指令队列Inst Q的设置,从而降低处理流程的复杂度,降低处理器的成本和功耗。
如图5所示,本申请实施例还提供一种电子设备,参见图5,该电子设备包括:存储器501和处理器502。其中,存储器501用于存储该设备的程序代码和数据,处理器502用于对图5所示的设备的动作进行控制管理,处理器502的结构可以为上述图2所示的结构,例如,具体用于支持该指令处理执行上述方法实施例中的S401-S403,和/或用于本文所描述的技术的其他过程。可选的,图5所示的电子设备还可以包括通信接口503,通信接口503用于支持该设备进行通信。
其中,处理器502可以是中央处理器单元,通用处理器,数字信号处理器,专用集成电路,处理芯片、现场可编程门阵列或者其他可编程逻辑器件,晶体管逻辑器件,硬件部件或者其任意组合。其可以实现或执行结合本申请实施例公开内容所描述的各种例如逻辑方框,模块和电路。处理器502也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等等。通信接口503可以是收发器、收发电路或收发接口等。存储器501可以是易失性存储器或者非易失性存储器等。
例如,通信接口503、处理器502以及存储器501通过总线504相互连接;总线504可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线504可以分为地址总线、数据总线、控制总线等。为便于表示,图5中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。可选地,存储器501可以包括于处理器502中。
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (22)

  1. 一种基于多指令引擎的指令处理方法,其特征在于,应用于处理器中,所述处理器包括:指令块调度器、指令缓存组和指令引擎组,所述指令缓存组中包括多个指令缓存器,所述指令引擎组中包括多个指令引擎;所述指令缓存组中的多个指令缓存器与所述指令引擎组中的多个指令引擎一一对应;
    所述方法包括:
    所述指令块调度器接收指令处理请求;所述指令处理请求用于请求所述处理器处理第一指令集;
    所述指令块调度器根据所述指令处理请求确定第一指令引擎;所述第一指令引擎为所述指令引擎组中处理所述第一指令集的指令引擎;
    所述指令块调度器将所述指令处理请求发送给所述第一指令引擎对应的第一指令缓存器;
    所述第一指令引擎从所述第一指令缓存器中获取所述第一指令集。
  2. 根据权利要求1所述的方法,其特征在于,所述指令块调度器根据所述指令处理请求确定第一指令引擎,包括:
    所述指令块调度器根据所述指令处理请求,获取所述第一指令集的备选指令引擎;所述备选指令引擎为可用于处理所述第一指令集的指令引擎;
    所述指令块调度器从所述备选指令引擎中选择指令引擎,作为所述第一指令引擎。
  3. 根据权利要求2所述的方法,其特征在于,所述指令引擎组包括第一备选指令引擎组;
    所述指令块调度器根据所述指令处理请求,获取所述第一指令集的备选指令引擎,包括:
    若所述第一指令集为非性能路径上的指令集,则所述指令块调度器将所述第一备选指令引擎组中的指令引擎,作为所述第一指令集的备选指令引擎。
  4. 根据权利要求2所述的方法,其特征在于,所述指令引擎组包括第一备选指令引擎组和第二指令引擎组;
    所述指令块调度器根据所述指令处理请求获取所述第一指令集的备选指令引擎,包括:
    若所述第一指令集为性能路径上的指令集,则所述指令块调度器将所述第一备选指令引擎组或所述第二指令引擎组中的指令引擎,作为所述第一指令集的备选指令引擎。
  5. 根据权利要求4所述的方法,其特征在于,所述指令块调度器将所述第一备选指令引擎组或第二指令引擎组中的指令引擎,作为所述第一指令集的备选指令引擎,包括:
    若满足第一条件,则所述指令块调度器将所述第一备选指令引擎组中的指令引擎,作为所述第一指令集的备选指令引擎;
    所述第一条件为:所述第一备选指令引擎组中,至少存在一个所述指令引擎对应的指令处理请求队列的队列深度低于第一预设阈值;
    或者,
    若满足第二条件,则所述指令块调度器将所述第二指令引擎组中的指令引擎,作为所述第一指令集的备选指令引擎;
    所述第二条件为:所述第一备选指令引擎组中,所有的所述指令引擎对应的指令处理请求队列的队列深度均超过所述第一预设阈值。
  6. 根据权利要求4至5任一项所述的方法,其特征在于,所述第二指令引擎组包括第二备选指令引擎组和第三备选指令引擎组;
    所述指令块调度器将所述第二指令引擎组中的指令引擎,作为所述第一指令集的备选指令引擎,包括:
    所述指令块调度器将所述第二指令引擎组中的第二备选指令引擎组中的指令引擎,作为所述第一指令集的备选指令引擎;
    若满足第三条件,则所述指令块调度器将所述第三备选指令引擎组中的至少一个指令引擎,添加至所述第二备选指令引擎组中;
    所述第三条件为:所述第二备选指令引擎组为空,或者所述第二备选指令引擎组中,所有的所述指令引擎对应的指令处理请求队列的队列深度,均超过第二预设阈值。
  7. 根据权利要求6所述的方法,其特征在于,所述指令块调度器选择第三备选指令引擎组中,指令引擎对应的指令处理请求队列的队列深度低于第三预设阈值的至少一个指令引擎,添加至所述第二备选指令引擎组中。
  8. 根据权利要求6或7任一项所述的方法,其特征在于,所述方法还包括:
    所述指令块调度器记录指令引擎选择差值;
    若所述指令引擎选择差值超过第四预设阈值,则所述指令块调度器将所述第二备选指令引擎组中的所有指令引擎删除;
    其中,所述指令引擎选择差值用于指示,从所述第一备选指令引擎组中选择指令引擎的次数,与从所述第二备选指令引擎组中选择指令引擎的次数的数量差。
  9. 根据权利要求2至8任一项所述的方法,其特征在于,所述指令块调度器从所述备选指令引擎中选择指令引擎,作为所述第一指令引擎,包括:
    所述指令块调度器获取所述备选指令引擎对应的指令处理请求队列的队列深度;
    所述指令块调度器选择队列深度最小的指令处理请求队列对应的所述备选指令引擎,作为所述第一指令引擎。
  10. 根据权利要求1至9任一项所述的方法,其特征在于,所述方法还包括:
    所述第一指令缓存器检测到所述第一指令集的结束标记时,所述第一指令缓存器向所述指令块调度器发送调度信息,所述调度信息用于指示所述第一指令引擎可处理下一个指令处理请求。
  11. 一种处理器,其特征在于,所述处理器包括:指令块调度器、指令缓存组和指令引擎组,所述指令缓存组中包括多个指令缓存器,所述指令引擎组中包括多个指令引擎;所述指令缓存组中的多个指令缓存器与所述指令引擎组中的多个指令引擎一一对应;
    所述指令块调度器,用于接收指令处理请求;所述指令处理请求用于请求所述处理器处理第一指令集;
    所述指令块调度器,用于根据所述指令处理请求确定第一指令引擎;所述第一指令引擎为所述指令引擎组中处理所述第一指令集的指令引擎;
    所述指令块调度器,用于将所述指令处理请求发送给所述第一指令引擎对应的第一指令缓存器;
    所述第一指令引擎,用于从所述第一指令缓存器中获取所述第一指令集。
  12. 根据权利要求11所述的处理器,其特征在于,所述处理器还包括多个指令处理请求队列,所述多个指令处理请求队列与所述多个指令引擎一一对应,并且所述多个指令处理请求队列与所述多个指令缓存器一一对应;所述指令块调度器,用于根据所述第一指令引擎对应的指令处理请求队列,确定所述第一指令引擎对应的第一指令缓存器。
  13. 根据权利要求11所述的处理器,其特征在于,所述指令块调度器,具体用于:
    根据所述指令处理请求,获取所述第一指令集的备选指令引擎;所述备选指令引擎为可用于处理所述第一指令集的指令引擎;
    从所述备选指令引擎中选择指令引擎,作为所述第一指令引擎。
  14. 根据权利要求13所述的处理器,其特征在于,所述指令引擎组包括第一备选指令引擎组;所述指令引擎组具体用于,若所述第一指令集为非性能路径上的指令集,则将所述第一备选指令引擎组中的指令引擎,作为所述第一指令集的备选指令引擎。
  15. 根据权利要求13所述的处理器,其特征在于,所述指令引擎组包括第一备选指令引擎组和第二指令引擎组;所述指令块调度器具体用于,若所述第一指令集为性能路径上的指令集,则将所述第一备选指令引擎组或所述第二指令引擎组中的指令引擎,作为所述第一指令集的备选指令引擎。
  16. 根据权利要求15所述的处理器,其特征在于,所述指令块调度器具体用于:
    若满足第一条件,则将所述第一备选指令引擎组中的指令引擎,作为所述第一指令集的备选指令引擎;
    所述第一条件为:所述第一备选指令引擎组中,至少存在一个所述指令引擎对应的指令处理请求队列的队列深度低于第一预设阈值;
    或者,
    若满足第二条件,则将所述第二指令引擎组中的指令引擎,作为所述第一指令集的备选指令引擎;
    所述第二条件为:所述第一备选指令引擎组中,所有的所述指令引擎对应的指令处理请求队列的队列深度均超过所述第一预设阈值。
  17. 根据权利要求15至16任一项所述的处理器,其特征在于,所述第二指令引擎组包括第二备选指令引擎组和第三备选指令引擎组;所述指令块调度器具体用于:
    将所述第二指令引擎组中的第二备选指令引擎组中的指令引擎,作为所述第一指令集的备选指令引擎;
    若满足第三条件,则将所述第三备选指令引擎组中的至少一个指令引擎,添加至所述第二备选指令引擎组中;
    所述第三条件为:所述第二备选指令引擎组为空,或者所述第二备选指令引擎组中,所有的所述指令引擎对应的指令处理请求队列的队列深度,均超过第二预设阈值。
  18. 根据权利要求17所述的处理器,其特征在于,所述指令块调度器具体用于,选择第三备选指令引擎组中,指令引擎对应的指令处理请求队列的队列深度低于第三预设阈值的至少一个指令引擎,添加至所述第二备选指令引擎组中。
  19. 根据权利要求18所述的处理器,其特征在于,所述指令块调度器还用于:
    记录指令引擎选择差值;
    若所述指令引擎选择差值超过第四预设阈值,则将所述第二备选指令引擎组中的所有指令引擎删除;
    其中,所述指令引擎选择差值用于指示,从所述第一备选指令引擎组中选择指令引擎的次数,与从所述第二备选指令引擎组中选择指令引擎的次数的数量差。
  20. 根据权利要求13至19任一项所述的处理器,其特征在于,所述指令块调度器具体用于:
    获取所述备选指令引擎对应的指令处理请求队列的队列深度;
    选择队列深度最小的指令处理请求队列对应的所述备选指令引擎,作为所述第一指令引擎。
  21. 根据权利要求11至20任一项所述的处理器,其特征在于,所述指令缓存器还用于,检测到所述第一指令集的结束标记时,向所述指令块调度器发送调度信息,所述调度信息用于指示所述第一指令引擎可处理下一个指令处理请求。
  22. 一种电子设备,其特征在于,所述电子设备包括处理器、以及与所述处理器耦合的存储器,其中所述处理器为权利要求11-21任一项所述的处理器。
PCT/CN2020/125404 2020-10-30 2020-10-30 基于多指令引擎的指令处理方法及处理器 WO2022088074A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/CN2020/125404 WO2022088074A1 (zh) 2020-10-30 2020-10-30 基于多指令引擎的指令处理方法及处理器
CN202080106768.0A CN116635840A (zh) 2020-10-30 2020-10-30 基于多指令引擎的指令处理方法及处理器
EP20959236.9A EP4220425A4 (en) 2020-10-30 2020-10-30 METHOD FOR PROCESSING INSTRUCTIONS BASED ON MULTIPLE INSTRUCTION ENGINES AND PROCESSOR
US18/309,177 US20230267002A1 (en) 2020-10-30 2023-04-28 Multi-Instruction Engine-Based Instruction Processing Method and Processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/125404 WO2022088074A1 (zh) 2020-10-30 2020-10-30 基于多指令引擎的指令处理方法及处理器

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/309,177 Continuation US20230267002A1 (en) 2020-10-30 2023-04-28 Multi-Instruction Engine-Based Instruction Processing Method and Processor

Publications (1)

Publication Number Publication Date
WO2022088074A1 true WO2022088074A1 (zh) 2022-05-05

Family

ID=81381779

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/125404 WO2022088074A1 (zh) 2020-10-30 2020-10-30 基于多指令引擎的指令处理方法及处理器

Country Status (4)

Country Link
US (1) US20230267002A1 (zh)
EP (1) EP4220425A4 (zh)
CN (1) CN116635840A (zh)
WO (1) WO2022088074A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8407432B2 (en) * 2005-06-30 2013-03-26 Intel Corporation Cache coherency sequencing implementation and adaptive LLC access priority control for CMP
CN108809854A (zh) * 2017-12-27 2018-11-13 北京时代民芯科技有限公司 一种用于大流量网络处理的可重构芯片架构
CN110618966A (zh) * 2019-09-27 2019-12-27 迈普通信技术股份有限公司 一种报文的处理方法、装置及电子设备
CN111352711A (zh) * 2020-02-18 2020-06-30 深圳鲲云信息科技有限公司 多计算引擎调度方法、装置、设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015070789A1 (en) * 2013-11-14 2015-05-21 Mediatek Inc. Task scheduling method and related non-transitory computer readable medium for dispatching task in multi-core processor system based at least partly on distribution of tasks sharing same data and/or accessing same memory address (es)
US11537851B2 (en) * 2017-04-07 2022-12-27 Intel Corporation Methods and systems using improved training and learning for deep neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8407432B2 (en) * 2005-06-30 2013-03-26 Intel Corporation Cache coherency sequencing implementation and adaptive LLC access priority control for CMP
CN108809854A (zh) * 2017-12-27 2018-11-13 北京时代民芯科技有限公司 一种用于大流量网络处理的可重构芯片架构
CN110618966A (zh) * 2019-09-27 2019-12-27 迈普通信技术股份有限公司 一种报文的处理方法、装置及电子设备
CN111352711A (zh) * 2020-02-18 2020-06-30 深圳鲲云信息科技有限公司 多计算引擎调度方法、装置、设备及存储介质

Also Published As

Publication number Publication date
US20230267002A1 (en) 2023-08-24
CN116635840A (zh) 2023-08-22
EP4220425A1 (en) 2023-08-02
EP4220425A4 (en) 2023-11-15

Similar Documents

Publication Publication Date Title
US11036556B1 (en) Concurrent program execution optimization
US10817184B2 (en) Control node for multi-core system
JP3801919B2 (ja) パケットルーティング動作におけるプロセッサ用キューイングシステム
US20190317802A1 (en) Architecture for offload of linked work assignments
CN108694089B (zh) 使用非贪婪调度算法的并行计算架构
EP1242883B1 (en) Allocation of data to threads in multi-threaded network processor
US9858241B2 (en) System and method for supporting optimized buffer utilization for packet processing in a networking device
US20090260013A1 (en) Computer Processors With Plural, Pipelined Hardware Threads Of Execution
US20170351555A1 (en) Network on chip with task queues
TW201543358A (zh) 用於多晶片系統中的工作調度的方法和系統
TWI547870B (zh) 用於在多節點環境中對i/o 存取排序的方法和系統
CN108351783A (zh) 多核数字信号处理系统中处理任务的方法和装置
JP2016195375A (ja) 複数のリンクされるメモリリストを利用する方法および装置
TW201543218A (zh) 具有多節點連接的多核網路處理器互連之晶片元件與方法
WO2017185285A1 (zh) 图形处理器任务的分配方法和装置
US10932202B2 (en) Technologies for dynamic multi-core network packet processing distribution
US20150127864A1 (en) Hardware first come first serve arbiter using multiple request buckets
US9442759B2 (en) Concurrent execution of independent streams in multi-channel time slice groups
WO2022088074A1 (zh) 基于多指令引擎的指令处理方法及处理器
US9282051B2 (en) Credit-based resource allocator circuit
Faraji Improving communication performance in GPU-accelerated HPC clusters
US11915041B1 (en) Method and system for sequencing artificial intelligence (AI) jobs for execution at AI accelerators
CN118012788B (zh) 数据处理器、数据处理方法、电子设备和存储介质
CN116841751B (zh) 一种多任务线程池的策略配置方法、装置和存储介质
JP2004086921A (ja) マルチプロセッサシステムおよびマルチプロセッサシステムにおいてタスクを実行する方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20959236

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202080106768.0

Country of ref document: CN

ENP Entry into the national phase

Ref document number: 2020959236

Country of ref document: EP

Effective date: 20230428

NENP Non-entry into the national phase

Ref country code: DE