CN109408118B - MHP heterogeneous multi-pipeline processor - Google Patents

MHP heterogeneous multi-pipeline processor Download PDF

Info

Publication number
CN109408118B
CN109408118B CN201811144658.4A CN201811144658A CN109408118B CN 109408118 B CN109408118 B CN 109408118B CN 201811144658 A CN201811144658 A CN 201811144658A CN 109408118 B CN109408118 B CN 109408118B
Authority
CN
China
Prior art keywords
task
pipeline
tasks
present application
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811144658.4A
Other languages
Chinese (zh)
Other versions
CN109408118A (en
Inventor
古进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201811144658.4A priority Critical patent/CN109408118B/en
Publication of CN109408118A publication Critical patent/CN109408118A/en
Application granted granted Critical
Publication of CN109408118B publication Critical patent/CN109408118B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines

Abstract

The application relates to MHP (Multiple Heteroid Pipeline) heterogeneous multi-pipeline processors. The provided multi-pipeline processor comprises a first pipeline, a second pipeline, an instruction fetching unit, a data access unit and a task distributor; the first pipeline and the second pipeline share an instruction fetching unit and a data access unit, acquire instructions through the instruction fetching unit, and acquire data required by instruction execution through the data access unit; the first pipeline distributes tasks through the task distributor to the second pipeline, which processes the tasks acquired from the task distributor.

Description

MHP heterogeneous multi-pipeline processor
Technical Field
The present application relates to processor technology, and in particular, to processors with heterogeneous multi-pipelines.
Background
Modern processor cores typically have multiple stages of pipelines. The execution of processor instructions is divided into a number of pipeline stages (also known as pipeline stages), such as instruction fetch, decode, execute, memory access, and write back. The complexity of each stage of the pipeline is reduced by increasing the number of stages of the pipeline, thereby enabling the processor core to operate at a higher clock frequency. The use of a multi-stage pipeline also increases the parallelism with which the processor processes instructions.
Multi-core/multi-threading techniques are also commonly used techniques to improve the parallelism of instruction processing by processors.
Some processors employ a multi-pipeline architecture. The processor core includes a plurality of pipelines that are either homogeneous or heterogeneous. For example, chinese patent publication number CN100557593C, entitled "multi-pipeline water treatment system and integrated circuit incorporating the same," provides a treatment system with multiple pipelines.
Disclosure of Invention
According to an embodiment of the present application, an MHP (Multiple Heteroid Pipeline, heterogeneous multi-pipeline) processor core architecture is provided. Programs may perceive and explicitly use heterogeneous pipelines. Pipelines are used with task or function granularity, and programming is easy.
According to a first aspect of the present application, there is provided a first multi-pipeline processor according to the first aspect of the present application, comprising a first pipeline, a second pipeline, an instruction fetch unit, a data access unit and a task dispatcher; the first pipeline and the second pipeline share an instruction fetching unit and a data access unit, acquire instructions through the instruction fetching unit, and acquire data required by instruction execution through the data access unit; the first pipeline distributes tasks through the task distributor to the second pipeline, which processes the tasks acquired from the task distributor.
According to a first multi-pipeline processor of a first aspect of the present application, there is provided a second multi-pipeline processor according to the first aspect of the present application, comprising a plurality of first pipelines.
According to a first or second multi-pipeline processor of the first aspect of the present application, there is provided a third multi-pipeline processor according to the first aspect of the present application, comprising a plurality of second pipelines.
According to one of the first to third multi-pipeline processors of the first aspect of the present application, there is provided a fourth multi-pipeline processor according to the first aspect of the present application, wherein the number of pipeline stages of the first pipeline is greater than that of the second pipeline.
According to one of the first to fourth multi-pipeline processors of the first aspect of the present application, there is provided a fifth multi-pipeline processor according to the first aspect of the present application, wherein the first pipeline and the second pipeline have the same instruction set architecture.
According to one of the first to fifth multi-pipeline processors of the first aspect of the present application, there is provided a sixth multi-pipeline processor according to the first aspect of the present application, wherein the task dispatcher comprises one or more task memories for each of the second pipelines, the task dispatcher adding task descriptors to the task memories in response to an indication of the first pipeline.
According to a sixth multi-pipeline processor of the first aspect of the present application, there is provided a seventh multi-pipeline processor according to the first aspect of the present application, wherein the second pipeline fetches task descriptors from the corresponding task memories, fetches tasks and processes them according to the instruction of the task descriptors.
According to a sixth or seventh multi-pipeline processor of the first aspect of the present application, there is provided an eighth multi-pipeline processor according to the first aspect of the present application, wherein a processing result of the second pipeline processing the completed task is added to the completed task memory; the first pipeline acquires the processing result of the processed task from the task completion memory.
According to one of the sixth to eighth multi-pipeline processors of the first aspect of the present application, there is provided a ninth multi-pipeline processor according to the first aspect of the present application, wherein the task descriptor indicates an entry address and/or a parameter of a code of the task.
According to one of the first to ninth multi-pipelined processors of the first aspect of the present application, there is provided a tenth multi-pipelined processor according to the first aspect of the present application, further comprising a first cache coupled to the first pipeline and caching data accessed by the first pipeline.
According to a tenth multi-pipeline processor of the first aspect of the present application, there is provided an eleventh multi-pipeline processor according to the first aspect of the present application, further comprising a second cache coupled to the one or more second pipelines and caching data of the one or more second pipelines.
According to an eleventh or twelfth multi-pipeline processor according to the first aspect of the present application, there is provided a twelfth multi-pipeline processor according to the first aspect of the present application, further comprising a first non-cacheable external data interface, data accessed by the first pipeline through the first non-cacheable external data interface not passing through the first cache.
According to one of the first to twelfth multi-pipeline processors of the first aspect of the present application, there is provided the thirteenth multi-pipeline processor according to the first aspect of the present application, wherein after the first pipeline provides the task to be dispatched to the task dispatcher, the first pipeline is not blocked and continues executing other instructions.
According to one of the first to thirteenth multi-pipeline processors of the first aspect of the present application, there is provided the fourteenth multi-pipeline processor according to the first aspect of the present application, wherein the first pipeline processes the task of which distribution failed in response to the task distributor indicating the task distribution failure to the first pipeline.
According to one of the first through thirteenth multi-pipeline processors of the first aspect of the present application, there is provided a fifteenth multi-pipeline processor according to the first aspect of the present application, wherein the first pipeline is coupled to a first general purpose register file; a second pipeline couples the second register file; the first register file and the second register file each provide a general purpose register of an instruction set architecture of the first pipeline or the second pipeline.
According to one of the first to fifteenth multi-pipeline processors of the first aspect of the present application, there is provided a sixteenth multi-pipeline processor according to the first aspect of the present application, further comprising one or more third pipelines; the first pipeline, the second pipeline and the third pipeline share the instruction fetching unit and the data access unit in a leisure mode, the instruction fetching unit is used for fetching instructions, and the data access unit is used for fetching data required by instruction execution.
According to a sixteenth multi-pipeline processor of the first aspect of the present application, there is provided the seventeenth multi-pipeline processor according to the first aspect of the present application, wherein the first pipeline distributes tasks to the second pipeline or the third pipeline through the task distributor, and the second pipeline or the third pipeline processes the tasks acquired from the task distributor.
According to a sixteenth or seventeenth multi-pipeline processor of the first aspect of the present application, there is provided a fifteenth multi-pipeline processor according to the first aspect of the present application, wherein the first pipeline, the second pipeline and the third pipeline are heterogeneous; and the first pipeline, the second pipeline and the third pipeline have the same instruction set architecture.
According to an eighteenth multi-pipeline processor of the first aspect of the present application, there is provided the nineteenth multi-pipeline processor according to the first aspect of the present application, wherein the number of pipeline stages of the second pipeline is larger than the third pipeline.
According to one of the sixteenth through nineteenth multi-pipelined processors of the first aspect of the present application, there is provided a twenty-first multi-pipelined processor according to the first aspect of the present application, wherein the first pipeline is coupled to a first general purpose register file; a second pipeline couples the second register file; a third pipeline is coupled to the third register file; the first, second, and third register files each provide a general purpose register of an instruction set architecture of the first, second, or third pipeline.
According to one of the sixteenth through twentieth pipelined processors of the first aspect of the present application, there is provided the twenty first more pipelined processor according to the first aspect of the present application, wherein each first pipeline couples the instruction memory, the data memory and the branch prediction unit.
A twenty-first multi-pipeline processor according to a first aspect of the present application, there is provided a twenty-second multi-pipeline processor according to the first aspect of the present application, 22, the processor according to claim 21, wherein each first pipeline is further coupled to a respective instruction cache, a first non-cacheable external data interface and a data cache interface; the instruction cache is coupled with the instruction fetching unit through an external instruction access unit; the data cache interface is coupled to the first cache.
A twenty-first or twenty-second multi-pipeline processor according to the first aspect of the present application is provided, wherein each second pipeline couples a respective second external instruction access unit with a second non-cacheable external data interface; a second external instruction access unit coupled to the instruction fetch unit; a second non-cacheable external data interface couples the data access unit.
According to one of the twenty-first to twenty-third multi-pipeline processors of the first aspect of the present application, there is provided a twenty-fourth multi-pipeline processor according to the first aspect of the present application, wherein one or more third pipelines are coupled to a shared external instruction access unit; a shared external instruction access unit is coupled to the instruction fetch unit.
According to one of the twenty-first to twenty-fourth multi-pipeline processors of the first aspect of the present application, there is provided a twenty-fifth multi-pipeline processor according to the first aspect of the present application, wherein the one or more third pipelines are coupled to a shared third non-cacheable external data interface; a third non-cacheable external data interface couples the data access unit.
According to one of the twenty-first to twenty-fourth multi-pipeline processors of the first aspect of the present application, there is provided a twenty-sixth multi-pipeline processor according to the first aspect of the present application, wherein the one or more third pipelines are coupled to a third external instruction access unit; a third external instruction access unit is coupled to the instruction fetch unit.
According to one of the first to twenty-sixth multi-pipeline processors of the first aspect of the present application, there is provided the twenty-seventh multi-pipeline processor according to the first aspect of the present application, wherein a stall of any one of the first pipeline and the second pipeline does not affect operation of the other pipeline.
According to one of the sixteenth through twenty-seventh pipeline processors of the first aspect of the present application, there is provided the twenty-eighth pipeline processor according to the first aspect of the present application, wherein the third pipeline does not include a stack, and does not process function calls.
According to one of the first to twenty-eighth more pipelined processors of the first aspect of the present application, there is provided the twenty-ninth more pipelined processor according to the first aspect of the present application, wherein the first pipeline is heterogeneous with the second pipeline.
According to a sixth or seventh multi-pipelined processor of the first aspect of the present application, there is provided a thirty-first multi-pipelined processor according to the first aspect of the present application, wherein the second pipeline adds an indication of processing completed tasks to the completed task memory; the first pipeline acquires the processed completed task according to the instruction of the completed task memory.
According to a second aspect of the present application, there is provided a first task distribution method for a multi-pipeline processor according to the second aspect of the present application, comprising: executing instructions at the first pipeline to invoke the task distribution programming interface to instruct processing of the first task on the available pipelines; in response to the task distribution interface indicating a failure of the first task distribution, the first task is processed on the first pipeline.
According to a first task distribution method for a multi-pipeline processor according to a second aspect of the present application, there is provided a second task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: in response to the task distribution interface indicating that the first task distribution was successful, further instructions continue to be executed on the first pipeline or the task distribution programming interface is invoked to indicate that the second task is processed on the available pipeline.
According to a first or second task distribution method for a multi-pipeline processor according to a second aspect of the present application, there is provided a third task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: generating a task descriptor of a first task, wherein the task descriptor of the first task indicates an entry address of a task body of the first task and parameters for the first task; and providing the descriptor of the first task to the task distribution programming interface.
According to one of the first to third task distribution methods for a multi-pipeline processor according to the second aspect of the present application, there is provided a fourth task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: executing instructions in a first pipeline to obtain a processing result of completing the queue task.
According to one of the first to fourth task distribution methods for a multi-pipeline processor according to the second aspect of the present application, there is provided a fifth task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: setting a first Task Identifier (TID) for the first task; the first task is added to a task packet having a first task Packet Identifier (PID).
According to a fifth task distribution method for a multi-pipeline processor according to the second aspect of the present application, there is provided a sixth task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: in response to the task packet of the first task Packet Identifier (PID) being added a specified number of tasks, distributing all tasks of the task packet of the first task Packet Identifier (PID) to the available pipelines.
According to a fifth task distribution method for a multi-pipeline processor according to the second aspect of the present application, there is provided a seventh task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: the first task is distributed to the available pipelines in response to the task packet adding the first task to a first task Packet Identifier (PID).
According to one of the fifth to seventh task distribution methods for a multi-pipeline processor according to the second aspect of the present application, there is provided an eighth task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: the first task Packet Identifier (PID) is recovered in response to all tasks in the task packet of the first task Packet Identifier (PID) being processed.
According to one of the fifth to eighth task distribution methods for a multi-pipeline processor according to the second aspect of the present application, there is provided a ninth task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: a specified number of tasks of a task packet of a first task Packet Identifier (PID), a number of tasks that have been started to be processed, and/or a number of tasks that have been processed to be completed are recorded.
According to one of the first to ninth task distribution methods for a multi-pipeline processor according to the second aspect of the present application, there is provided a tenth task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: in response to the first pipeline not having resources to process the first task, executing an instruction at the first pipeline to invoke the task distribution programming interface to indicate that the first task is to be processed on an available pipeline.
According to one of the first to tenth task distribution methods for a multi-pipeline processor according to the second aspect of the present application, there is provided the eleventh task distribution method for a multi-pipeline processor according to the second aspect of the present application, wherein the available pipeline capable of processing the first task is identified according to the available capacity of the task memory associated with the pipeline, the indication of the task descriptor of the first task, and/or the resources of the pipeline.
According to an eleventh task distribution method for a multi-pipeline processor according to the second aspect of the present application, there is provided a task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: in response to invoking the task distribution programming interface, a task descriptor of the first task is added to a task memory of the available pipeline.
According to an eleventh or twelfth task distribution method for a multi-pipeline processor according to the second aspect of the present application, there is provided a thirteenth task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: in response to invoking the task distribution programming interface, a second pipeline is selected from the available pipelines to process the first task.
According to a thirteenth aspect of the present application, there is provided a task distribution method for a multi-pipeline processor according to the fourteenth aspect of the present application, further comprising: instructions are executed in the second pipeline to fetch the first task and process the first task.
A fourteenth task distribution method for a multi-pipeline processor according to a second aspect of the present application provides the fifteenth task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: instructions are executed in the second pipeline to retrieve task descriptors of the first task from the task memory.
According to one of the thirteenth to fifteenth task distribution methods for a multi-pipeline processor according to the second aspect of the present application, there is provided a sixteenth task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: the second pipeline obtains instructions of the entrance of the task body of the first task and executes the instructions to process the first task; and the processing result of the first task is written into the task memory.
According to one of the thirteenth to fifteenth task distribution methods for a multi-pipeline processor according to the second aspect of the present application, there is provided a seventeenth task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: the second pipeline obtains instructions of the entrance of the task body of the first task and executes the instructions to process the first task; and a second pipeline adds the processed first task to a completed task memory.
According to one of the first to seventeenth task distribution methods for a multi-pipeline processor according to the second aspect of the present application, there is provided an eighteenth task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: in response to invoking the task distribution interface, the task distributor selects a second pipeline from the possible pipelines and adds the task descriptor of the first task to the task memory of the second pipeline.
An eighteenth task distribution method for a multi-pipeline processor according to a second aspect of the present application provides the nineteenth task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: the task dispatcher indicates to the task dispatch interface that the first task dispatch was successful as a return value to invoke the task dispatch interface.
According to an eighteenth or nineteenth task distribution method for a multi-pipeline processor according to the second aspect of the present application, there is provided a twentieth task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: in response to not finding an available pipeline, the task dispatcher indicates to the task dispatch interface that the first task dispatch failure is a return value to invoke the task dispatch interface.
According to one of the first to twentieth task distribution methods for a multi-pipeline processor according to the second aspect of the present application, there is provided the twenty-first task distribution method for a multi-pipeline processor according to the second aspect of the present application, further comprising: executing instructions at the first pipeline to invoke the task distribution programming interface to instruct processing of a third task on the available pipeline; selecting a third pipeline from the available pipelines to process a third task in response to invoking the task distribution programming interface; wherein the task body entry address of the third task is the same as the task body entry address of the first task.
According to a twenty-first task distribution method for a multi-pipeline processor according to a second aspect of the present application, there is provided a twenty-second task distribution method for a multi-pipeline processor according to the second aspect of the present application, wherein a parameter of a third task is different from an address of a parameter of a first task; the processing result of the third task is different from the address of the processing result of the first task.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly introduce the drawings that are required to be used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may also be obtained according to these drawings for a person having ordinary skill in the art.
FIG. 1 illustrates a block diagram of a heterogeneous multi-pipeline processor according to an embodiment of the present application.
FIG. 2A illustrates a schematic diagram of distributing tasks to a pipeline according to an embodiment of the present application;
FIG. 2B illustrates a schematic diagram of pipeline commit task processing results in accordance with yet another embodiment of the present application;
FIG. 3 illustrates a block diagram of a heterogeneous multi-pipeline processor core, according to yet another embodiment of the present application;
FIG. 4A illustrates a block diagram of a high performance pipeline according to an embodiment of the present application;
FIG. 4B illustrates a block diagram of a general pipeline according to an embodiment of the present application;
FIG. 4C illustrates a block diagram of a low power pipeline according to an embodiment of the present application;
FIG. 5A illustrates a function call schematic of a prior art processor;
FIG. 5B illustrates a function call schematic of a processor according to an embodiment of the present application;
FIG. 6 illustrates a timing diagram of a distribution task according to an embodiment of the present application;
FIG. 7 illustrates task descriptors according to an embodiment of the present application;
FIG. 8 illustrates a schematic diagram of tasks and task packages according to an embodiment of the present application; and
FIG. 9 illustrates a set of task package descriptors according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
FIG. 1 illustrates a block diagram of a heterogeneous multi-pipeline processor core, according to an embodiment of the present application.
A heterogeneous multi-pipeline processor core according to an embodiment of the present application includes a main pipeline 110 and one or more auxiliary pipelines (120, 122). The main pipeline 110 is coupled to the auxiliary pipeline through a task dispatcher 130. The pipeline 110 uses the task dispatcher 130 to dispatch tasks to the auxiliary pipeline and the auxiliary pipeline processes the dispatched tasks.
Alternatively, the main pipeline 110 and the auxiliary pipelines (120, 122) have the same instruction set architecture (ISA, instruction Set Architecture), so that the same program can be executed by both the main pipeline 110 and any of the auxiliary pipelines. The complexity of the program development and compiling processes is reduced, and the complexity of the task distribution process is also reduced.
Still alternatively, the main pipeline 110 and the auxiliary pipelines (120, 122) each have different instruction set extensions under the same instruction set architecture. For example, the pipeline 110 executes instructions of both the 64-bit word length instruction set and the 32-bit word length instruction set to achieve better performance; while the auxiliary pipeline (120, 122) executes only instructions of the 32-bit word length instruction set. As yet another example, the main pipeline 110 supports all instruction set extensions of the instruction set architecture, while the auxiliary pipelines (120, 112) support only partial instruction set extensions, e.g., only execute vector instructions and/or floating point instructions. Further, in one example, tasks to be dispatched are compiled with instruction set extensions supported by both the main and auxiliary pipelines (120, 122), such that both the main and auxiliary pipelines can handle the dispatched tasks. In yet another example, tasks to be dispatched are compiled into two or more versions, e.g., a 32-bit instruction set extended version executed by the auxiliary pipeline (120, 112) and a 64-bit instruction set extended version executed by the main pipeline. Code is provided at the task's entry to examine the pipeline type or supported instruction set of the currently executing task and to select the task version supported by the current pipeline to load and run.
Optionally, the main pipeline 110 has different performance than the auxiliary pipelines (120, 122). For example, the pipeline 110 has more pipeline stages, and the auxiliary pipeline has fewer pipeline stages; the operating clock frequency of the main pipeline 110 is higher than the operating clock frequency of the auxiliary pipeline.
Each auxiliary pipeline (120, 122) includes a task queue (denoted Q) (170, 172). Task dispatcher 130 adds tasks dispatched to auxiliary pipeline 120 to task queue 170 and tasks dispatched to auxiliary pipeline 122 to task queue 172. Each auxiliary pipeline takes tasks from a task queue coupled to itself and processes them. The pipeline 110 is also coupled to a completion queue (174). Each auxiliary pipeline adds a task of processing completion to completion queue 174. The pipeline 110 retrieves the tasks processed for completion from the completion queue 174.
It will be appreciated that the task queues (170, 172) are used to temporarily store tasks. The tasks in the task queue may be task descriptors representing the tasks, the tasks themselves, including the task body, or in other various forms. Task memory may also be provided to store tasks in other storage ways than queues. Similarly, a completion queue (174) is used to temporarily store the processing results of the task. The processing results of the tasks in the completion queue are, for example, a return value of the processing task, a task descriptor representing the task, a descriptor indicating the processing result of the task, and the like. A completion task memory may also be provided to store tasks completed by processing other than queues.
Optionally, each auxiliary pipeline processes a task, storing the processing results of the completed task in a shared memory or cache accessible to both the auxiliary pipeline and the main pipeline. A descriptor indicating the result of the task processing is added to the completion queue 174. The pipeline 110 polls the completion queue 174 or provides an interrupt to the pipeline 110 to inform the pipeline 110 that a task is being processed to completion based on the completion queue 174 not being empty. The pipeline 110 obtains descriptors indicating the task processing results from the completion queue 174 and accesses the task processing results in the shared memory according to the descriptors indicating the task processing results. Descriptors indicating the results of task processing include, for example, success or failure of task processing. In response to the descriptor indicating the task processing result indicating that the task processing was successful, the pipeline directly discards the descriptor indicating the task processing result without further processing. In response to the descriptor indicating the task processing result indicating the task processing failure, the pipeline 110 performs error processing on the task indicated by the descriptor indicating the task processing result.
Still alternatively, completion queue 174 includes a variety of sub-queues. Descriptors indicating the task processing results of the successfully processed tasks are added to one sub-queue, while descriptors indicating the task processing results of the failed processed tasks are added to the other sub-queue.
Optionally, task descriptors are added to the task queue. The task descriptor indicates information such as an entry address, parameters, and/or task ID of the code of the task. Alternatively, both the main pipeline 110 and the auxiliary pipelines (120, 122) may access the complete memory address space, so that the code of the task is loaded according to the entry address in the task descriptor, and the task dispatcher 130 may dispatch the task to any auxiliary pipeline.
In some cases, the task distributor 130 fails to distribute the task. For example, the task queues of all the auxiliary pipelines have no free entries to accommodate new tasks. Since each pipeline supports the same instruction set architecture, tasks, particularly tasks that fail to be dispatched, may also be handled by the pipeline 110. In an alternative embodiment, in response to task dispatcher 130 indicating a failure to dispatch a task, pipeline 110 identifies an operating condition of the processor. If a large number of tasks occur within a short time period, resulting in failure of task distribution, and in a longer time period, the processor still has the capability of processing the occurring tasks, the task queue depth is increased, or the tasks to be distributed are cached. Alternatively, in response to task dispatcher 130 indicating a failure to dispatch a task, main pipeline 110 opens more auxiliary pipelines, enhancing task processing capacity.
In alternative embodiments, code executing in the pipeline can access the task queues (170, 172) and/or the completion queue 174. Thereby adding tasks to the task queues (170, 172), retrieving tasks from the task queues (170, 172), adding task processing results to the completion queue 174, and/or retrieving task processing results from the completion queue 174 by executing a program. Thus, the task distributor 130 may be omitted.
The heterogeneous multi-pipeline processor core according to the embodiments of the present application further includes an instruction fetch unit 140 and a data access unit 150. The main pipeline 110 and the auxiliary pipelines (120, 122) share an instruction fetch unit 140 and a data access unit 150. The pipeline loads instructions to be executed by instruction fetch unit 140 and reads or writes back data accessed by the instructions by data access unit 150.
Optionally, the main pipeline 110 and/or the auxiliary pipelines (120, 122) each have a priority. When there are multiple pipelines to load instructions and/or access data at the same time, instruction fetch unit 140 and/or data access unit 150 determines the order and/or bandwidth to service each pipeline based on priority.
A heterogeneous multi-pipeline processor core according to embodiments of the present application further includes a cache 160 and an optional cache 162. The cache 160 is dedicated to the main pipeline 110 and caches data accessed by the main pipeline. The cache 162 is optional for assisting the pipeline (120, 122) in caching data that is accessed by the assisting pipeline. In some examples, the heterogeneous multi-pipeline processor core does not include a cache 162, and data access requests of the auxiliary pipelines (120, 122) are processed directly by the data access unit 150.
By way of example, a heterogeneous multi-pipeline processor core according to embodiments of the present application is used in a network device. The network processor processes a large number of network messages simultaneously. Each of the auxiliary pipelines (120, 122) is adapted to handle a single message that is simple and the IO operations occupy a relatively high proportion. The pipeline 110 is suitable for processing tasks such as protocol processing with high computational complexity, quality of service control and the like.
In still alternative embodiments, one or more auxiliary pipelines (120, 122) also use task distributors to distribute tasks to the main pipeline 110 and process the distributed tasks by the main pipeline 110. The pipeline 110 includes a task queue. Task dispatcher 130 adds tasks dispatched to main pipeline 110 to the task queue of main pipeline 110. The pipeline 110 takes tasks from a task queue coupled to itself and processes them. The auxiliary pipeline (120, 122) is also coupled to the completion queue. The pipeline 110 adds the task that processed the completion to a completion queue coupled to the auxiliary pipeline that issued the task. The auxiliary pipeline obtains the processed completed task from the completion queue. In some cases, for example, the task queue of the main pipeline has no free entry, the task dispatcher 130 fails to dispatch the task to the main pipeline 110, in response to a failure to dispatch the task to the main pipeline, the auxiliary pipeline that issued the task processes the failed task itself, or the auxiliary pipeline instructs the task dispatcher to try to dispatch the task to the main pipeline again until the task is successfully dispatched to the main pipeline.
In still alternative embodiments, the pipeline 110 includes, for example, a floating point processing unit, while the auxiliary pipelines (120, 122) do not include a floating point processing unit. So that the pipeline 110 is able to execute floating point instructions and the auxiliary pipeline (120, 122) is unable to execute floating point instructions. When a pipeline processes a floating point task, the task identifies that the pipeline has a floating point processing unit by executing code, and executes a code segment that uses the floating point processing unit. When the auxiliary pipeline processes a floating point task, the task identifies that the floating point processing unit is not provided by the execution code, and executes a code segment in which the floating point processing unit is replaced by an integer arithmetic unit. Alternatively, when the auxiliary pipeline processes a floating point task, the task identifies itself as not having a floating point processing unit by the execution code, and the floating point processing task is dispatched to the main pipeline 110 for execution by the task dispatcher 130. So that tasks executing on the auxiliary pipeline also gain floating point processing power.
Still optionally, an identifier of the pipeline requesting the dispatch of the task is also indicated in the task descriptor, so that the pipeline processing the task is aware of to which pipeline's completion queue the task processing results were submitted. Still alternatively, the main pipeline 110 includes vector units, while the auxiliary pipelines (120, 122) do not include vector units. When the auxiliary pipeline processes a vector task, the task identifies that the vector unit is not provided by the execution code, and the vector task is distributed to the main pipeline 110 for execution through the task distributor 130.
Still optionally, one or more auxiliary pipelines include a floating point processing unit and/or a vector unit and are dedicated to processing floating point and/or vector tasks.
FIG. 2A illustrates a schematic diagram of distributing tasks to a pipeline according to an embodiment of the present application.
By way of example, the pipeline 110 distributes tasks to the various auxiliary pipelines via the task distributor 230. An auxiliary pipeline (120, 122, … … 127) is shown in fig. 2A.
Optionally, the task distributor comprises a plurality of portals. The inlets of the task distributor are coupled to one or more of the main pipeline and the auxiliary pipeline. So that each pipeline can distribute tasks to other pipelines through the task distributor.
Task dispatcher 230 assists in completing task dispatch by the main pipeline 110. There is a task queue (270, 272 … … 277) dedicated to each auxiliary pipeline. Adding a task to the task queue 270 means distributing the task to the auxiliary pipeline 120. Adding a task to the task queue 272 means distributing the task to the auxiliary pipeline 122. Adding a task to the task queue 277 means that the task is distributed to the auxiliary pipeline 127.
The task queue (270, 272, … …, 277) includes a plurality of entries. By way of example, each entry is of a size sufficient to accommodate a task descriptor. The task descriptor records information such as an entry address, parameters, and/or task ID of the code of the task. As a queue, the task dispatcher 230 adds task descriptors to the tail of the task queue, while the auxiliary pipeline obtains task descriptors from the head of the task queue and loads the codes and parameters corresponding to the tasks according to the instructions of the task descriptors. The task queue has a specified or configurable depth. The task dispatcher adds task descriptors to the task queue and also checks whether the task queue has space available to accommodate new task descriptors. Optionally, in response to adding a task descriptor to the task queue, the task dispatcher 230 also informs the pipeline that submitted the task of the success of task dispatch; in response to failure to add a task descriptor to the task queue, the task dispatcher 230 also informs the pipeline that submitted the task of the failure of task dispatch. In response to a task distribution failure, the pipeline may programmatically handle the task that failed distribution by itself. The pipeline sets its own Program Counter (PC) to the code entry address of the task that failed to be dispatched to process the task. Optionally, the main pipeline is also coupled to a task queue dedicated to itself, and tasks that fail in distribution are added to the task queue of the main pipeline.
Optionally, the task distributor 230 has a configurable or programmable task distribution policy. For example, the task distributor distributes tasks randomly, round-robin, or weighted round-robin to the auxiliary pipelines. Each pipeline may be prioritized. Still alternatively, code executed by the main pipeline has already designated a target auxiliary pipeline for processing tasks at the time of distributing the tasks, and the task distributor 230 fills the tasks into task queues corresponding to the target auxiliary pipeline according to instructions of the main pipeline.
In response to the task descriptor being obtained from the task queue, the auxiliary pipeline sets a Program Counter (PC) with an entry address of a code of the task indicated by the task descriptor, and loads an instruction according to the indication of the program counter. In one example, the same tasks are distributed for each auxiliary pipeline, with the same code entry address. So that as one of the auxiliary pipelines loads the code of the task, the code is cached so that the code can be fetched from the cache when the other pipeline loads the code.
The task descriptor also indicates a parameter for the task, or an address where the parameter is stored. And the auxiliary pipeline acquires task parameters according to the instruction of the task descriptor and processes the task. Alternatively, the task parameters distributed to the auxiliary pipelines may be the same or different, even for tasks having the same code, so that the auxiliary pipelines process different data packets in parallel with the same code. Still alternatively, the parameters of the task are read-only or updateable. The same read-only parameters distributed to the auxiliary pipelines may store only a single instance and be shared by the auxiliary pipelines. The updatable parameters distributed to each auxiliary pipeline provide each auxiliary pipeline with an instance of the updatable parameters so that the updating of the parameters by each pipeline is unaffected by the other pipelines.
According to embodiments of the present application, each of the main pipeline and the auxiliary pipeline has a complete instruction set architecture register set. Each pipeline maintains a stack specific to itself.
It will be appreciated that in addition to distributing tasks to the auxiliary pipeline in the form of task queues, those skilled in the art will appreciate that there are other ways to transfer information from the main pipeline to the auxiliary pipeline, for example, through shared memory, or dedicated/shared data channels. Optionally, the task queues of the auxiliary pipeline include multiple instances with different priorities, e.g., a high priority queue, a normal priority queue, and a low priority queue. The task dispatcher selects a task queue to be filled with task descriptors according to a specified or configured policy. The auxiliary pipeline selects a task queue in accordance with a specified or configured test and retrieves a task descriptor from the task queue.
The auxiliary pipeline fills the completion queue (e.g., completion queue 174 of FIG. 1) with the processing results of the task. The entry of the completion queue indicates a task ID, a return value of the task, success or failure of task processing, and the like. The pipeline 110 obtains the processing result of the task from the completion queue. And optionally, for tasks that fail to be processed, the pipeline 110 redistributes them.
FIG. 2B illustrates a schematic diagram of pipeline commit task processing results, according to yet another embodiment of the present application.
By way of example, the auxiliary pipelines (120, 122, … … 127) submit task processing results to the main pipeline 110 via the task reclaimer 280.
Optionally, the task retriever comprises a plurality of outlets. The exit of the task recycler is coupled to one or more of the main pipeline and the auxiliary pipeline. Therefore, each pipeline can acquire task processing results submitted by other pipelines through the task recoverer.
The task reclaimer 280 assists the auxiliary pipeline (120, 122, … … 127) in submitting task processing results. There is a completion queue (290, 292 … … 297) dedicated to each auxiliary pipeline. Adding the task processing results to the completion queue (290, 292 … … 297) means that the auxiliary pipeline has completed the submission of the task processing results.
The completion queue (290, 292 … … 297) includes a plurality of entries. By way of example, each entry is sufficiently large to accommodate a descriptor (simply referred to as a processing result descriptor) indicating the processing result of the task. The processing result descriptor records a task ID, a return value of the task, success or failure of task processing, and the like. Optionally, the processing result descriptor also indicates a pipeline that receives the processing result of the task (e.g., a pipeline that submitted the task). Still optionally, the processing result descriptor also indicates a shared memory address where the task processing result is stored.
The task reclaimer 280 adds a processing result descriptor to the tail of the completion queue, and the main pipeline 110 obtains the processing result descriptor from the head of the task queue, for example, and obtains the processing result of the task according to the instruction of the processing result descriptor.
Alternatively, in response to the processing result of the task indicating failure, the pipeline 110 redistributes the task of which processing failed through the task distributor 230.
Optionally, task reclaimer 280 includes an arbiter 285. Arbiter 285 selects a completion queue and provides the processing result descriptor of the selected completion queue to pipeline 110. For example, arbiter 285 provides the processing result descriptor to the pipeline 110 by providing an interrupt to the pipeline 110.
Optionally, the arbiter 285 has a configurable or programmable arbitration policy. For example, the arbiter will select the completion queue randomly, round-robin, or weighted round-robin. Each completion queue may be prioritized. Still alternatively, the arbiter provides the processing result descriptors to a plurality of pipelines.
FIG. 3 illustrates a block diagram of a heterogeneous multi-pipeline processor core, according to yet another embodiment of the present application.
The heterogeneous multi-pipelined processor core of the embodiment of FIG. 3 includes three pipelines, a high-performance pipeline 310 and one or more general pipelines (320, 322) and one or more low-power pipelines (324, 325, 326, 327). It will be appreciated that in embodiments according to the present application, a processor core may include other kinds of pipelines, each of which may have a variety of numbers.
The high-performance pipeline 310, the normal pipeline (320, 322) and the low-power pipeline (324, 325, 326, 327) have the same instruction set architecture (ISA, instruction Set Architecture) so that the same program can be executed by both the high-performance pipeline 310 and any of the other pipelines. Still alternatively, the high performance pipeline 310, the normal pipeline (320, 322), and the low power pipeline (324, 325, 326, 327) each have different instruction set extensions under the same instruction set architecture.
In the embodiment of fig. 3, the performance of the high performance pipeline is higher than the normal pipeline, which is higher than the low power pipeline. For example, a high performance pipeline has the highest clock frequency, the most pipeline stages, a low power pipeline has the lowest clock frequency and the least pipeline stages, and the clock frequency and pipeline stages of a normal pipeline are centered. So that the high performance pipeline 310 and the one or more normal pipelines (320, 322) and the one or more low power pipelines (324, 325, 326, 327) are each adapted to handle different types of tasks. For example, a single network packet is processed by a low power pipeline. The task of processing network packets is relatively simple, occurs frequently, and has data send/receive operations. If network packets are processed by a high-performance pipeline, a large number of pipeline pauses are caused due to the transmission/reception of the network packets and the switching between the packets, so that the processing capacity of the pipeline is difficult to fully utilize. And each of the plurality of low power pipelines processes the network data packet, pipeline pauses during processing only affect the current low power pipeline and do not affect other pipelines, so that frequent context switches or other pipeline pauses are not caused. Thereby facilitating full utilization of processor core resources (e.g., pipeline, instruction fetch unit bandwidth, data access unit bandwidth, cache, etc.) and resulting in higher overall processing performance.
The high-performance pipeline 310 distributes tasks to other pipelines through a task distributor 330. The task dispatcher 330 adds tasks to task queues coupled to other pipelines. Other pipelines acquire tasks from the task queues and process the tasks. Optionally, the high performance pipeline is further coupled to a completion queue, and other pipelines add tasks that have been processed to the completion queue, from which the high performance pipeline obtains information of the tasks that have been processed to completion. In some cases, high-performance pipeline 310 processes tasks that were not successfully dispatched in response to a task dispatch failure of the task dispatch queue.
The heterogeneous multi-pipeline processor core according to the embodiment of FIG. 3 further includes an instruction fetch unit 340 and a data access unit 350. The high performance pipeline 310 shares instruction fetch unit 340 and data access unit 350 with one or more normal pipelines (320, 322) and one or more low power pipelines (324, 325, 326, 327).
Optionally, the high performance pipeline 310 and the one or more normal pipelines (320, 322) and the one or more low power pipelines (324, 325, 326, 327) each have priority. When there are multiple pipelines to load instructions and/or access data at the same time, instruction fetch unit 340 and/or data access unit 350 determine the order and/or bandwidth to service each pipeline based on priority.
The heterogeneous multi-pipeline processor core according to the embodiment of FIG. 3 also includes a cache 360 and an optional cache 362. The cache 360 is dedicated to the main pipeline 310. The cache 362 is optional for one or more general pipelines (320, 322) and one or more low power pipelines (324, 325, 326, 327). Optionally, data coherency is not provided between cache 360 and cache 362, thereby reducing the complexity of the cache system and ensuring performance of cache 360. Still alternatively, caches 360 and 362 provide data coherency only for specified address spaces, so that high performance pipelines, normal pipelines, and low power pipelines can efficiently exchange data through the specified address spaces. Still alternatively, the cache 362 provides independent cache space for each of the one or more normal pipelines (320, 322) and the one or more low power pipelines (324, 325, 326, 327) without providing data coherency, or only data coherency with respect to a specified address space.
The high performance pipeline 310 has a larger instruction cache and/or data cache, and branch prediction units, which reduces to some extent the bandwidth requirements of the high performance pipeline 310 for the instruction fetch unit 340 and/or the data access unit 350, while saving instruction fetch unit 340 and/or data access unit 350 bandwidth just being available for use by the normal pipeline (320, 322) and/or the low power pipeline (324, 325, 326, 327). Thereby enabling instruction fetch unit 340 and/or data access unit 350 to be fully utilized.
In an alternative embodiment, the process scheduling tasks of the operating system are handled by a common pipeline (320, 322). Code segments (tasks) scheduled by management processes of the operating system are run in a common pipeline (320, 322) to distribute the processes to high performance pipelines, common pipelines, and/or low power pipelines for processing. The processes handled by the high-performance pipeline include task scheduling code segments that are executed by the high-performance pipeline to distribute tasks to the normal pipeline and/or the low-power pipeline. The low power pipeline is used only to process tasks and not to dispatch tasks.
FIG. 4A illustrates a block diagram of a high performance pipeline according to an embodiment of the present application.
The high performance pipeline includes the largest number of pipeline stages (410) compared to other types of pipelines, such that the high performance pipeline may operate at the relatively highest clock frequency. Pipeline stages (410) of the high performance pipeline provide features such as multi-issue, out-of-order execution, etc., to achieve higher instruction processing performance.
The high performance pipeline also includes complete general purpose registers 412 for the instruction set architecture, data storage 420, data cache interface 426, NC-EDI (non-cacheable external data interface) 424, instruction storage 414, instruction cache/branch prediction unit 416, and external instruction access unit 418. General purpose registers 412, data store 420, data cache 426, NC-EDI (non-cacheable external data interface) 424, instruction store 414, instruction cache/branch prediction unit 416, and external instruction access unit 418 are exclusive to a high performance pipeline. It is appreciated that the prior art processor features for improved performance are applicable to high performance pipelines.
An instruction memory (I-RAM) 414 is used to store instructions, and a data memory (D-RAM) 420 is used to store data. Instruction memory 414 and data memory 420 are coupled to pipeline stages of a high performance pipeline and have high bandwidth and low latency relative to external memory. The instruction memory 414 and the data memory 420 are, for example, visible to the instruction set architecture of each pipeline, and addresses in the instruction memory 414 and/or the data memory 420 used may be described in the instructions. The data store 420 is used to store variables used during stack and task processing, for example. Optionally, the stack is stored in a memory shared by the various pipelines.
Instruction Cache/branch prediction unit 416 provides, for example, cache (Cache) and/or branch prediction functions that are not visible to the instruction set architecture. Alternatively, instruction cache/branch prediction unit 416 has a smaller capacity, accommodates only instructions of a specified size, and the complete program is stored in external memory from which the uncached instruction portion of the program is fetched by external instruction access unit 418.
The data cache interface 426 is coupled to, for example, a cache 360 (see also FIG. 3), providing a cache, for example, invisible to the instruction set architecture, for accommodating data used by instructions.
NC-EDI (Non-cacheable external data interface ) 424 provides access to externally stored data and ensures that the accessed data is not cached.
Fig. 4B illustrates a block diagram of a general pipeline according to an embodiment of the present application.
The normal pipeline includes a number of pipeline stages (430) centered so that the normal pipeline can operate at a relatively centered clock frequency. Optionally, the pipeline stages (430) of the common pipeline do not support features such as multi-issue, out-of-order execution, etc., to reduce complexity and power consumption. Nor does a common pipeline include a branch prediction unit.
The normal pipeline includes complete general purpose registers 432 for the instruction set architecture. The normal pipeline also includes an external instruction access unit 438 and NC-EDI (non-cacheable external data interface) 444. Alternatively or in addition, the normal pipeline also includes a data store 440 and a data cache interface cache 446. The capacity of data store 440 may be configured to suit a variety of applications. Data cache interface 446 is coupled to, for example, cache 360. The capacity of the data cache provided by data cache interface 446 may be configured.
In alternative embodiments, the normal pipeline does not include instruction memory 434 and/or instruction cache 436. It will be appreciated that the instruction memory 434 and/or instruction cache 436 may also be provided for a common pipeline, and/or the capacity of the instruction memory 434 and/or instruction cache 436 may be configured to enhance performance and meet the needs of different applications.
Still alternatively, the data store 440 of the normal pipeline has a smaller capacity than the data store 420 of the high performance pipeline, the instruction cache 436 has a smaller capacity than the instruction cache/branch prediction unit 416, and the instruction store 434 has a smaller capacity than the instruction store 414.
FIG. 4C illustrates a block diagram of a low power pipeline according to an embodiment of the present application.
The low power pipeline is optimized for reduced power consumption and/or for handling a large number of concurrent simple tasks, such as processing network packets.
The low power pipeline includes a minimum number of pipeline stages (450, 470, 490) so that the low power pipeline can operate at a lower clock frequency. The pipeline stages (450, 470, 490) of the low power pipeline do not support characteristics such as multi-issue, out-of-order execution, etc., to reduce complexity and power consumption. The low power pipeline also does not include a branch prediction unit.
Within the same processor core, multiple low power pipelines may have the same or different configurations.
Referring to FIG. 4C, the pipeline stages (450, 470, 490) of the low power pipeline share an external instruction access unit 458. The shared external instruction access unit 458 serves pipeline stages (450, 470, 490) of respective low power pipelines to which it is coupled, e.g., in a round robin strategy.
Optionally, a shared instruction memory 454 and/or a shared instruction cache 456 are provided for the low power pipeline, and/or the capacity of the shared instruction memory 454 and/or the shared instruction cache 456 is configured to improve performance and meet the needs of different applications.
The low power pipeline includes complete general purpose registers (452, 472, 492) for the instruction set architecture.
In the embodiment of fig. 4C, pipeline stages (450 and 470) of the low power pipeline share NC-EDI (non-cacheable external data interface) 464. The shared NC-EDI (non-cacheable external data interface) 464 is serviced with pipeline stages (450 and 470) of the respective low power pipelines to which it is coupled, for example in a round robin strategy. Optionally, pipeline stages (450 and 470) of the low power pipeline share data memory 460 and data cache interface 466.
The capacity of data store 460 and/or a data cache used by data cache interface 466 may be configured to suit a variety of applications. Pipeline stage (490) of the low power pipeline monopolizes NC-EDI (non-cacheable external data interface) 496. Optionally, pipeline stage (490) of the low power pipeline monopolizes data memory 494 and data cache 498.
Still alternatively, the data memory (460, 494) of the low power pipeline may have a smaller size than the data memory 440 of the normal pipeline, the shared instruction cache 456 may have a smaller size than the instruction cache/branch prediction unit 436, and the shared instruction memory 454 may have a smaller size than the instruction memory 434.
In alternative embodiments, the low power pipeline (450, 470, 490) does not include instruction memory, instruction cache, data memory, and/or data cache interfaces.
The low-power-consumption pipeline has a small number of pipeline stages, so that the cost is low when the context switch occurs, and the method is suitable for replacing a high-performance pipeline to process interrupt-type tasks. The low-power-consumption pipeline is also suitable for running daemon tasks to manage task scheduling on the high-performance pipeline and/or the common pipeline and monitor the running state of the high-performance pipeline and/or the common pipeline in real time. The low power pipeline is also suitable for managing low-speed peripherals such as serial ports. By arranging a plurality of low-power-consumption pipelines, the plurality of low-power-consumption pipelines process a simple and large number of tasks such as network data packets in parallel, thereby being beneficial to improving the data packet throughput of the processor core and fully utilizing the data access units of the processor core. A plurality of low-power pipelines are adopted to process a large number of concurrent tasks, and the processing capacity is obtained through the number of pipelines. The processing power of a plurality of low power pipelines exceeds, for example, a high performance pipeline or a normal pipeline with the same chip area. The low-power-consumption pipeline is also suitable for processing tasks with low operation requirements, multiple access memory operations or multiple branch jumps. The method has the characteristics of multiple access and storage operations and/or multiple branch jumps, the pipeline is frequently in a waiting state in the processing process, the performance is difficult to fully develop, and the low-power consumption pipeline is simple, so that the method is suitable for processing the tasks.
Optionally, a normal pipeline and/or a low power pipeline is used to perform the daemon's tasks.
In yet another embodiment, the low power pipeline does not use private memory as a stack or even support stack operations to further simplify the functionality of the low power pipeline. In a stack-less configuration, programs running in the low power pipeline do not support function calls nor respond to interrupts.
Fig. 5A shows a function call schematic of a prior art processor.
Taking a Main function 510 as an example, the Main function executes to call a function 520. In FIG. 5A, solid black arrows indicate the logical order in which the processor pipeline executes code.
In response to the Main function 510 calling the function 520, the push code segment 512 is first executed to save the contexts of the Main function. Next, the code segments of function 520 are executed. Before the execution of the function 520 is completed, the bullet stack 512 code segment is executed to restore the saved contexts of the Main function, and then returns to the Main function 510. In the process of function call, the operations for saving the context and recovering the context frequently occur, and the execution efficiency of the function is reduced.
FIG. 5B illustrates a function call schematic of a processor according to an embodiment of the present application.
Referring also to FIG. 1, for example, the Main pipeline 110 runs the Main function 540, while the auxiliary pipelines (120, 122) run the task code segments 550. Optionally, the task code segment 550 is packaged as a "function" such that the Main function 540 distributes tasks to the auxiliary pipelines (120, 122) by, for example, calling a function, and the auxiliary pipelines (120, 122) process tasks by executing the task code segment 550.
To distribute tasks, the Main function 540 executes a distribute task code section 542 to add task descriptors to the task queues (see also fig. 1 and 2). Tasks distributed by the Main function 540 are, for example, asynchronous with respect to the Main function 540. The Main function 540 adds descriptors of tasks to the task queue, and the Main function 540 can then continue to execute without waiting for task processing to complete. Optionally, the dispatch task code segment 542 operates the task dispatcher 130 to add task descriptors to the task queue. Still alternatively, if the distributing task code segment 542 fails to distribute the task, the Main function 540 processes the task that failed to be distributed.
The auxiliary pipeline (120, 122) fetches task descriptors from the task queue by executing a code segment (552) that fetches tasks. The task is processed by retrieving the task code segment 550 according to the task descriptor. The auxiliary pipeline (120, 122) executes a code segment (552) of the fetch task in response to an interrupt or under specified conditions. The task descriptors that the auxiliary pipeline (120, 122) gets from the task queue are task descriptors that the Main function 540 performs distributing task code segments 542 to add to the task queue.
In one example, the task code segment 550 provides the task processing results to the master function. The task code segment 550 fills the completion queue by executing the code segment (552) to submit the task processing results. The code segment (552) is also, for example, asynchronous with respect to the task code segment 550. After the completion queue is filled by the task code segment 550 through the execution code segment 552, the tasks continue to be acquired and executed without waiting for the Main function 540 to acquire the task processing results from the completion queue. Under the direction of the Main function 540, the pipeline obtains the task processing results from the completion queue by executing a code segment (548). The pipeline executes the code segment in response to an interrupt or under specified conditions (548).
In yet another example, the task processing results of the task code segment 550 need not be reported to the Main function 540. The auxiliary pipeline (120, 122) thus does not have to execute the code segment (552). And the pipeline 110 need not execute the code segment 548.
In accordance with embodiments of the present application, the Main function 540 is asynchronous with the task code segment 550, and the Main function 540 does not have to wait for the task code segment 550 to be executed to completion.
While FIG. 5B illustrates a Main function 540 and a task code segment 550, it is understood that one or more pipelines of a processor core each run a Main function that distributes tasks, while one or more pipelines of a processor core each fetch tasks from a task queue and process them. In addition to the Main function, other functions can distribute tasks.
Fig. 6 illustrates a timing diagram of a distribution task according to an embodiment of the present application.
Referring also to fig. 1 and 5B, the pipeline 110 executes the Main function 540 (610) to dispatch task a (612) by, for example, invoking the dispatch task code segment 542. For example, the distribute task code section 542 generates task descriptors and provides the task descriptors to the task distributor 130.
Task dispatcher 130 performs dispatch of task a (620). For example, the task dispatcher 130 selects one of the auxiliary pipelines (120, 122) and adds a task descriptor to a task queue associated with the selected auxiliary pipeline (e.g., the auxiliary pipeline 120). In response to adding the task descriptor to the task queue associated with the auxiliary pipeline 120, the task dispatcher 130 indicates to the main pipeline 110 that task A was successfully dispatched to the auxiliary pipeline 120 (622). The pipeline 110 obtains the results of distributing task a (task a distribution success) provided by the task distributor 130, for example, by distributing the task code segment 542 (614). And Main function 510, for example, continues to run and task B is dispatched 650, as performed by pipeline 110.
As the task dispatcher 130 adds the task descriptor to the task queue associated with the auxiliary pipeline 120, the code segment (552) executing on the auxiliary pipeline 120 retrieves the task A indicated by the task descriptor from the task queue (630). Based on the acquired task descriptors, the auxiliary pipeline executes the task code segment 550 to execute task A (632). And optionally, the auxiliary pipeline 120 updates the completion queue based on the results of executing task a (634).
In response to the pipeline 110 distributing task B via, for example, code segment (542), the task distributor 130 performs distribution of task B (624). For example, task dispatcher 130 selects secondary pipeline 122 to process task B based on the number of task descriptors in the task queues associated with secondary pipeline 122 being less than the number of task descriptors in the task queues associated with secondary pipeline 120, adding the task descriptors of task B to the task queues associated with secondary pipeline 122. In response to adding the task descriptor to the task queue associated with the auxiliary pipeline 122, the task dispatcher 130 indicates to the main pipeline 110 that task B was successfully dispatched to the auxiliary pipeline 122 (626). The pipeline 110 obtains the results of distributing task B (task B distribution success) provided by the task distributor 130 through, for example, the distribution task code segment 542 (652). And Main function 510, for example, continues to run and task C is dispatched 654, as performed by pipeline 110.
The code segment (552) executing on the auxiliary pipeline 122 retrieves task B (640) indicated by the task descriptor from the task queue. Based on the acquired task descriptors, the auxiliary pipeline 122 executes the task code segment 550 to execute task B (642). And optionally, the auxiliary pipeline 122 updates the completion queue based on the results of executing task B (644).
In response to the pipeline 110 distributing task C, the task distributor 130 performs distribution of task C (628). For example, the task dispatcher 130 discovers that none of the auxiliary pipelines (120 and 122) can receive more tasks, and fails to dispatch task C to the main pipeline (629). The pipeline 110 obtains the results of distributing task C (task C distribution failure) provided by the task distributor 130 through, for example, the distribution task code segment 542 (655). And task C is performed by, for example, main function 510 executing on pipeline 110 (656). In response to task C being performed, the pipeline 110 re-performs other tasks (658).
Optionally, the pipeline 120 also accesses the completion queue in response to an interrupt, periodically, or under other specified conditions, to obtain, for example, the results of execution of task a and/or task B. It will be appreciated that after the auxiliary pipeline 120 updates the completion queue (634) with respect to the execution result of task a or after the auxiliary pipeline 122 updates the completion queue (644) with respect to the execution result of task B, the Main function 510 running on the Main pipeline 120 does not need to access the completion queue immediately, but rather, when the Main pipeline 120 is idle or in place, the completion queue is accessed to increase the execution efficiency.
Fig. 7 illustrates task descriptors according to an embodiment of the present application.
Task descriptor 710 is a task descriptor of task T1, and task descriptor 720 is a task descriptor of task T2. Function T1 (A, P) represents task T1, while function T2 (A, P) represents task T2. Task T1 or task T2 is distributed in the code by calling function T1 or function T2.
Taking function T1 (a, P) as an example, parameter a indicates the entry pointer of the task, while parameter P indicates the pointer of the parameter set of the task. Alternatively, the function representing the task includes more or fewer parameters.
Task descriptor 710, optionally, indicates the name (T1) and parameter list (A and P) of function T1 (A, P) representing task T1. Task descriptor 710 also includes a task entry pointer 712 and a task parameter set pointer 714. Task descriptor 720 also includes a task entry pointer 722 and a task parameter set pointer 724. A task entry pointer that indicates the address in memory space 750 of an entry of a code segment (e.g., task body 713 or task body 723) that processes a task. A parameter set pointer for a task, indicating an address in storage space 750 for a parameter set for the task (e.g., task parameter set 715 or task parameter set 725).
A task may require 0, 1 or more parameters, a task parameter set pointer P is used to indicate a task parameter set, and the parameters required for the task are set in the task parameter set, so that the parameters of the task are described with a single parameter P in a function (e.g., function T1 or function T2) representing the task, so that the task descriptor has a fixed size, for example, to reduce the complexity of the task descriptor.
As an example, the following code segments are employed to indicate the distribution task:
If(T(A,P)==FAIL)//--------(1)
{A(P);}//-----------------------(2)。
at (1), the task is attempted to be distributed by calling a function T (a, P). If the task distribution is successful, the function T (A, P) returns a value other than "FAIL", and the task distribution is completed. If at (1) the dispatch task FAILs, the function T (A, P) returns a value of "FAIL", then the code at (2) is executed, calling function A (P) to process the task.
If the call function T (A, P) distributes a task successfully, the task distributor 130 takes over the distribution of the task. Task dispatcher 130 is a function body of hardware or functions T (a, P).
Optionally, task T1 and task T2 have the same task body (e.g., task body 713) and different task parameter sets (e.g., task parameter set 715 and task parameter set 725, respectively).
Still optionally, the task descriptor also indicates a task return value set pointer. A task may provide 0, 1, or more return values, with a task return value set pointer indicating the address of the task return value set (task return value set 718 or task return value set 728) in storage space 750. Optionally, a task return value is added to an entry of the completion queue.
Optionally, the auxiliary pipelines (120, 122) have access to the task bodies (713 and 723), the task parameter sets (715 and 725), and the return value sets (718 and 728) of the storage space 750, so that the tasks T1 and T2 can be distributed to either of the auxiliary pipelines (120, 122). The pipeline 110 has access to only the return value sets (718 and 728) and not to the task volumes (713 and 723) and task parameter sets (715 and 725) of the storage space 750. Still alternatively, the return value sets (718 and 728) are provided in memory space accessible to both the main pipeline 110 and the auxiliary pipelines (120, 122), while the task bodies (713 and 723) and the task parameter sets (715 and 725) are provided in memory space accessible only to the auxiliary pipelines (120, 122).
According to an embodiment of the present application, the code at the entry of the task body optionally identifies whether a save context is required. When a task is executed by, for example, an auxiliary pipeline (a pipeline that does not produce a task), there is no need to save the context of the pipeline, whereas when a task is executed by a main pipeline (a pipeline that produces a task), there is a need to save the context of the pipeline. Accordingly, code at the exit of the task body identifies whether a resume context is needed. For example, code at the entry of the task body identifies which type of pipeline itself is running in by accessing the architecture registers, thereby determining whether a save context is needed. Optionally, when generating the task descriptor, a flag is set in the task descriptor or a storage space indicated by the task descriptor to indicate whether the context needs to be saved when the code of the task starts to execute, and/or whether the context needs to be restored when the code of the task completes execution.
There are a number of ways to determine the goal of task distribution (pipeline of code that performs processing tasks). For example, the target pipeline for task dispatch is indicated with parameters representing the functions of the task. For another example, in a program written in a high-level language, advice or targets of task distribution are indicated to a compiler by means of a flag, a compilation instruction, or the like, and a pipeline capable of being mobilized to process a task is generated in a function representing the task by the compiler. Still alternatively, at run-time, pipeline processing tasks are selected by the task dispatcher based on the load (e.g., task queue depth) of each pipeline, and tasks are processed by the pipeline that issued the tasks (e.g., pipeline 110) when none of the other pipelines can process the tasks.
FIG. 8 illustrates a schematic diagram of tasks and task packages according to an embodiment of the present application.
One or more tasks constitute a task package. The task is uniquely identified by TID (task identifier) and the task packet is uniquely identified by PID (task packet identifier). Referring to fig. 8, a task packet with PID 0 includes 1 task (task with TID 0), and a task packet with PID 2 includes 3 tasks (TID 2, 3, and 4, respectively). In an alternative embodiment, the function representing the task further comprises parameters indicating TID and/or PID.
And tracking the processing result of the task through the TID, or further processing the task. For example, in response to an entry in the completion queue indicating a task execution failure, the task that failed execution is identified by the TID and an attempt is made to re-execute the task represented by the current TID.
A group of related tasks is marked by a task package to promote flexibility and manageability of task scheduling. For example, after a plurality of tasks belonging to the same task package are processed, data for another task package is restarted. For example, a matrix operation is divided into operations of a plurality of sub-matrices, each sub-matrix is calculated as one task, and the calculation of all sub-matrices belonging to the same matrix is added to the same task package.
FIG. 9 illustrates a set of task package descriptors according to an embodiment of the present application.
Each task package is managed with a set of task package descriptors. Each entry of the task package descriptor set includes a counter indicating the number of tasks of a task package being processed. The elements of the task packet descriptor set are indexed with the PID. The number of elements of the task package descriptor set is not less than the maximum number of task packages supported.
The task package descriptor set is maintained by a task dispatcher (see also FIG. 1, e.g., task dispatcher 130) or a code segment of the pipeline execution that issues the task.
To distribute tasks, tasks need to be added to the task package. A task package may be created. And deleting the task package in response to the tasks in the task package being processed to completion. And when the task packet is created, the PID is allocated to the created task packet, and when the task packet is deleted, the PID of the deleted task packet is released.
In response to adding a task to a task packet having a PID of X, accessing an element in the task packet descriptor set with the PID of X as an index and incrementing a counter for the element; in response to completion of one of the tasks, e.g., the task packet with PID Z, an element in the task packet descriptor set is accessed with PID Z as an index and the element's counter is decremented.
Optionally, the elements of the task package descriptor set further include a specified number of tasks of the task package, a number of tasks in the task package that have begun to be processed, and/or a number of tasks in the task package that have been processed to completion.
In one example, the task package has a specified number of tasks (denoted as C), representing the maximum number of tasks that the task package can carry.
To distribute a task, the task is first added to a task packet (e.g., a task packet with PID 0). If the number of tasks in the task packet with PID of 0 reaches the maximum value, the task packet can not hold more tasks, a new task packet is created to hold the tasks to be distributed.
In response to a task in the task package being dispatched (to the auxiliary pipeline (120, 122) or the main pipeline 110), incrementing a task package descriptor by a number of tasks that have begun processing; in response to a task in the task package being processed to completion, the number of tasks of the task package descriptor that have been processed to completion is incremented. If the number of tasks of the task package descriptor that have been processed is the same as the specified number of tasks (C) of the task package descriptor, meaning that all tasks of the task package represented by the task package descriptor are processed, the task package descriptor may be released.
In an alternative embodiment, the task dispatcher recognizes that all tasks of the task package are processed, and reports that the task package is processed to, for example, the main thread 110 that issued the task. While processing results of tasks of the task package are not reported to, for example, the main thread 110 until all tasks of the task package are processed to be completed, so as to reduce the disturbance to the main thread 110.
As yet another example, the maximum number of tasks in the task package is specified at runtime. Such that the pipeline 110 issuing the task indicates to the task distributor the timing of the task being distributed and/or reports the timing of the task processing completion by specifying a maximum of the number of tasks in the task package, for example. For example, the task distributor starts task distribution when the number of added tasks of the task package reaches a maximum value; and/or the task dispatcher reports to the pipeline 110 that all tasks in the task package are processed to completion after they are executed.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application. It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (10)

1. A multi-pipeline processor core comprising a first pipeline, a second pipeline, an instruction fetch unit, a data access unit, a cache, and a task dispatcher, wherein the first pipeline and the second pipeline are heterogeneous;
the first pipeline and the second pipeline share an instruction fetching unit and a data access unit, acquire instructions through the instruction fetching unit, and acquire data required by instruction execution through the data access unit;
a task code section of the first pipeline running to generate a task descriptor and provide the task descriptor to a task distributor and distribute tasks to a second pipeline through the task distributor, the second pipeline processing the tasks acquired from the task distributor, the first pipeline not needing to wait for the tasks acquired by the second pipeline to be processed, wherein the task descriptor indicates the entry address and/or parameters of the codes of the tasks, the task code section for processing the tasks is packaged as a function, and the tasks are distributed to the second pipeline by calling a function mode; code at the entry of the task code segment for processing the task identifies whether a context needs to be saved, and when the task is executed by a pipeline that does not produce the task, the context of the pipeline need not be saved; and when a task is executed by a pipeline generating the task, the context of the pipeline needs to be saved; and code at the entry of the task code segment for processing the task to check the pipeline type or supported instruction set of the currently executing task and to select to load and run the task version supported by the current pipeline;
The cache is dedicated to the first pipeline;
the dispatch task code segment of the second pipeline operation identifies itself as not having a floating point processing unit or a vector processing unit and dispatches tasks through the task dispatcher to a first pipeline including the floating point unit and the vector unit, the first pipeline processing tasks obtained from the task dispatcher.
2. The processor core of claim 1, wherein the first pipeline has a greater number of pipeline stages than the second pipeline.
3. The processor core of claim 1 or 2, wherein the first pipeline and the second pipeline have the same instruction set architecture.
4. A processor core according to one of claims 1-3, wherein
The task dispatcher includes one or more task memories for each of the second pipelines, the task dispatcher adding task descriptors to the task memories in response to an indication of the first pipeline.
5. The processor core of claim 4, wherein
The second pipeline acquires task descriptors from the corresponding task memories, acquires tasks according to the instructions of the task descriptors and processes the tasks.
6. The processor core of claim 4 or 5, wherein
The processing result of the task completed by the second pipeline processing is added to a completed task memory;
The first pipeline acquires the processing result of the processed task from the task completion memory.
7. The processor core of one of claims 1 to 6, wherein
In response to the task distributor indicating a task distribution failure to the first pipeline, the first pipeline processes the task with the distribution failure.
8. The processor core of one of claims 1 to 6, wherein the first pipeline is coupled to a first general purpose register file; a second pipeline couples the second register file; the first register file and the second register file each provide a general purpose register of an instruction set architecture of the first pipeline or the second pipeline.
9. The processor core of one of claims 1 to 8, further comprising one or more third pipelines;
the first pipeline, the second pipeline and the third pipeline share an instruction fetching unit and a data access unit, an instruction is obtained through the instruction fetching unit, and data required by instruction execution is obtained through the data access unit.
10. The processor core of claim 9, wherein
The first pipeline distributes tasks to the second pipeline or the third pipeline through the task distributor, and the second pipeline or the third pipeline processes the tasks acquired from the task distributor.
CN201811144658.4A 2018-09-29 2018-09-29 MHP heterogeneous multi-pipeline processor Active CN109408118B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811144658.4A CN109408118B (en) 2018-09-29 2018-09-29 MHP heterogeneous multi-pipeline processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811144658.4A CN109408118B (en) 2018-09-29 2018-09-29 MHP heterogeneous multi-pipeline processor

Publications (2)

Publication Number Publication Date
CN109408118A CN109408118A (en) 2019-03-01
CN109408118B true CN109408118B (en) 2024-01-02

Family

ID=65466527

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811144658.4A Active CN109408118B (en) 2018-09-29 2018-09-29 MHP heterogeneous multi-pipeline processor

Country Status (1)

Country Link
CN (1) CN109408118B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112732416B (en) * 2021-01-18 2024-03-26 深圳中微电科技有限公司 Parallel data processing method and parallel processor for effectively eliminating data access delay
CN115129369A (en) * 2021-03-26 2022-09-30 上海阵量智能科技有限公司 Command distribution method, command distributor, chip and electronic device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004663A (en) * 2009-09-02 2011-04-06 中国银联股份有限公司 Multi-task concurrent scheduling system and method
CN102331923A (en) * 2011-10-13 2012-01-25 西安电子科技大学 Multi-core and multi-threading processor-based functional macropipeline implementing method
CN103067524A (en) * 2013-01-18 2013-04-24 浪潮电子信息产业股份有限公司 Ant colony optimization computing resource distribution method based on cloud computing environment
CN106227591A (en) * 2016-08-05 2016-12-14 中国科学院计算技术研究所 The method and apparatus carrying out radio communication scheduling in heterogeneous polynuclear SOC(system on a chip)
CN107239324A (en) * 2017-05-22 2017-10-10 阿里巴巴集团控股有限公司 Work flow processing method, apparatus and system
CN108021430A (en) * 2016-10-31 2018-05-11 杭州海康威视数字技术股份有限公司 A kind of distributed task scheduling processing method and processing device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9378023B2 (en) * 2012-06-13 2016-06-28 International Business Machines Corporation Cross-pipe serialization for multi-pipeline processor

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004663A (en) * 2009-09-02 2011-04-06 中国银联股份有限公司 Multi-task concurrent scheduling system and method
CN102331923A (en) * 2011-10-13 2012-01-25 西安电子科技大学 Multi-core and multi-threading processor-based functional macropipeline implementing method
CN103067524A (en) * 2013-01-18 2013-04-24 浪潮电子信息产业股份有限公司 Ant colony optimization computing resource distribution method based on cloud computing environment
CN106227591A (en) * 2016-08-05 2016-12-14 中国科学院计算技术研究所 The method and apparatus carrying out radio communication scheduling in heterogeneous polynuclear SOC(system on a chip)
CN108021430A (en) * 2016-10-31 2018-05-11 杭州海康威视数字技术股份有限公司 A kind of distributed task scheduling processing method and processing device
CN107239324A (en) * 2017-05-22 2017-10-10 阿里巴巴集团控股有限公司 Work flow processing method, apparatus and system

Also Published As

Publication number Publication date
CN109408118A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
US6671827B2 (en) Journaling for parallel hardware threads in multithreaded processor
US7418585B2 (en) Symmetric multiprocessor operating system for execution on non-independent lightweight thread contexts
US8087034B2 (en) Virtual processor methods and apparatus with unified event notification and consumer-produced memory operations
US6944850B2 (en) Hop method for stepping parallel hardware threads
US10761846B2 (en) Method for managing software threads dependent on condition variables
US20040205747A1 (en) Breakpoint for parallel hardware threads in multithreaded processor
US20060190946A1 (en) Symmetric multiprocessor operating system for execution on non-independent lightweight thread context
US8595747B2 (en) Efficient task scheduling by assigning fixed registers to scheduler
US6405234B2 (en) Full time operating system
US20230315526A1 (en) Lock-free work-stealing thread scheduler
CN109408118B (en) MHP heterogeneous multi-pipeline processor
CN109388429B (en) Task distribution method for MHP heterogeneous multi-pipeline processor
US10771554B2 (en) Cloud scaling with non-blocking non-spinning cross-domain event synchronization and data communication
US9946665B2 (en) Fetch less instruction processing (FLIP) computer architecture for central processing units (CPU)
Ritchie Operating Systems
CN112732416B (en) Parallel data processing method and parallel processor for effectively eliminating data access delay
US20230161616A1 (en) Communications across privilege domains within a central processing unit core
US20230418746A1 (en) Programmable core integrated with hardware pipeline of network interface device
US20070124567A1 (en) Processor system
JP2005182791A (en) General purpose embedded processor
Page et al. Operating Systems
Koranne et al. The Synergistic Processing Element

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant