WO2021037124A1 - 一种任务处理的方法以及任务处理装置 - Google Patents

一种任务处理的方法以及任务处理装置 Download PDF

Info

Publication number
WO2021037124A1
WO2021037124A1 PCT/CN2020/111649 CN2020111649W WO2021037124A1 WO 2021037124 A1 WO2021037124 A1 WO 2021037124A1 CN 2020111649 W CN2020111649 W CN 2020111649W WO 2021037124 A1 WO2021037124 A1 WO 2021037124A1
Authority
WO
WIPO (PCT)
Prior art keywords
target load
task
store
load task
executed
Prior art date
Application number
PCT/CN2020/111649
Other languages
English (en)
French (fr)
Inventor
陈铁
肖聪
王平
吴正成
张争争
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021037124A1 publication Critical patent/WO2021037124A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration

Definitions

  • This application relates to the field of computer technology, in particular to a task processing method and task processing device.
  • heterogeneous computing architectures composed of central processing units and hardware accelerators have been widely used.
  • This heterogeneous computing architecture can be used to increase the calculation rate of algorithms.
  • the realization principle of this heterogeneous computing architecture is to divide the algorithm to be accelerated into small-grained computing tasks, and combine the custom accelerator instructions in the heterogeneous computing architecture (hereinafter referred to as "custom instructions") to complete the entire algorithm to be accelerated Calculation.
  • Coarse-grained parallel computers are a common heterogeneous computing architecture. Since the division granularity of custom instructions of this heterogeneous computing architecture is relatively coarse, the execution time of a single custom instruction is longer, and the corresponding pipeline delay is correspondingly longer. Moreover, due to the coarse granularity, different custom instructions are prone to data dependence. Assume that each custom instruction contains 4 load tasks (L0, L1, L2, L3) and 4 store tasks (S0, S1, S2, S3), and there is a segment of execution between the load task and the store task of each custom instruction Time (execute), the two custom instructions are the first instruction and the second instruction, and the execution sequence diagram of the two custom instructions is shown in FIG. 1.
  • each custom instruction can be in any order, as long as the execution timing of the load task and store task in the custom instruction is in accordance with the normal data dependency.
  • What Figure 1 shows is just a simple situation. Assuming that the memory addresses of S0 in the first instruction and L1 in the second instruction are the same, the first instruction and the second instruction have a data dependency relationship, and this data dependency relationship is specifically read after write (read after write, RAW) dependent.
  • RAW read after write
  • developers will use static analysis (manual judgment or compiler judgment) to add synchronization (Sync) instructions between the first and second instructions that are dependent on each other.
  • the execution time of the second instruction is postponed to after the execution of the first instruction is completed, as shown in FIG. 2.
  • This processing method needs to wait for 8 beats (each beat is the time to execute a load task or store task).
  • each beat is the time to execute a load task or store task.
  • the execution time is delayed until the execution of the first instruction is completed, but the execution time of L1, which causes the data dependency relationship, is delayed until the execution of S0 is completed, and the L0 before L1 is executed normally, as shown in Figure 3.
  • the need to wait for 4 beats can minimize the time waiting for execution of custom instructions with data dependencies, thereby reducing unnecessary pipeline delay costs.
  • This method can also be called dynamic data dependency detection processing.
  • the execution time of L1 may be later than S0. Therefore, the actual data dependency of the first instruction and the second instruction is a "pseudo data dependency".
  • this "pseudo-data dependency" will also be judged as a data dependency, so the execution time of the second instruction needs to be postponed until the execution of the first instruction is completed, as shown in Figure 4. This causes additional pipeline delay costs and reduces the execution rate of custom instructions.
  • the ideal way to deal with "pseudo-data dependency" is to ignore its existence and execute custom instructions normally, which will not increase pipeline delay.
  • the embodiments of the present application provide a task processing method and task processing device, which can be applied to hardware accelerators that meet specific conditions, and reduce the cost of pipeline delay caused by RAW data dependency when custom instructions in the hardware accelerator are executed. .
  • the first aspect of the embodiments of the present application provides a task processing method, which is applied to a target hardware accelerator.
  • Each instruction to be executed in the target hardware accelerator includes at least one load task and at least one store task.
  • the target hardware The load task and store task contained in all instructions to be executed in the accelerator are executed sequentially through the load execution queue and the store execution queue respectively.
  • the method includes: determining whether the target load task meets the first preset condition, and the target load task is located in the load executes the first load task in the queue; if the target load task meets the first preset condition, it is determined whether the target load task meets the second preset condition; if the target load task meets the second preset condition, the target is determined
  • the load task has the execution conditions.
  • each instruction to be executed and the load task and store task included in the instruction to be executed carry an instruction uniquely corresponding to the instruction to be executed Number
  • the instruction number is used to indicate the execution order of each instruction to be executed. Specifically, a smaller instruction number indicates a higher execution order.
  • judging whether the target load task meets the first preset condition including: judging that the instruction number is less than all of the target load task Whether the store task has entered the store execution queue; if so, it is determined that the target load task meets the first preset condition.
  • determining whether the target load task meets a second preset condition includes:
  • the static analysis result it is judged whether the to-be-executed instruction corresponding to the target load task and the to-be-executed tasks corresponding to all the store tasks in the store execution queue have no data dependency.
  • the static analysis result is preset, and the static analysis result is used To indicate all the instructions to be executed that have data dependencies; if so, it is determined that the target load task meets the second preset condition; or, it is determined whether the memory address corresponding to the target load task matches all the stores in the store execution queue The memory addresses corresponding to the tasks are not the same; if so, it is determined that the target load task meets the second preset condition; or, it is determined whether the instruction numbers of all store tasks in the store execution queue with the same memory address as the target load task are greater than or equal to the target load task; if so, it is determined that the target load task meets the second preset condition.
  • judging whether the target load task meets the first preset condition including: judging that the instruction number is less than all of the target load task Whether the store task has entered the preset store buffer queue through the store execution queue; if so, it is determined that the target load task meets the first preset condition.
  • determining whether the target load task meets a second preset condition includes:
  • the static analysis result it is judged whether the to-be-executed instruction corresponding to the target load task and the to-be-executed tasks corresponding to all the store tasks in the store buffer queue have no data dependency.
  • the static analysis result is preset, and the static analysis result is used To indicate all instructions to be executed that have data dependencies; if so, determine that the target load task meets the second preset condition; or, determine whether the memory address corresponding to the target load task matches all of the store buffer queues The memory address corresponding to the store task is not the same; if it is, it is determined that the target load task meets the second preset condition; or, it is determined whether the instruction numbers of all store tasks with the same memory address as the target load task in the store buffer queue are greater than or equal to the target load task; if yes, it is determined that the target load task meets the second preset condition.
  • a second aspect of the present application provides a task processing device, where the task processing device is configured to execute the task processing method in the first aspect or any one of the possible implementations of the first aspect.
  • the task processing apparatus may include a module for executing the task processing method in the first aspect or any one of the possible implementation manners of the first aspect.
  • a third aspect of the present application provides a task processing device, the task processing device includes a processor, the processor is coupled to a memory, the memory is used to store instructions, and the processor is used to execute instructions stored in the memory, And execution of the instructions stored in the memory enables the processor to execute the task processing method in the first aspect or any one of the possible implementation manners of the first aspect.
  • the task processing apparatus further includes the memory.
  • the fourth aspect of the present application provides a computer-readable storage medium that stores instructions in the computer-readable storage medium, which when run on a computer, causes the computer to execute any one of the above-mentioned first aspect or any one of the first aspects.
  • the method of task processing in the realization mode is not limited to:
  • the fifth aspect of the present application provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the task processing method in the first aspect or any one of the possible implementations of the first aspect.
  • Each instruction to be executed in the target hardware accelerator includes at least one load task and at least one store task. All instructions to be executed in the target hardware accelerator include The load task and the store task are executed sequentially through the load execution queue and the store execution queue, respectively, by judging whether the target load task at the top of the load execution queue meets the first preset condition; if the target load task meets the first preset condition, it is judged Whether the target load task meets the second preset condition; if the target load task meets the second preset condition, it is determined that the target load task has the execution condition.
  • the target load task By judging the target load task through the first preset condition and the second preset condition, it can be determined whether the target load task will have a memory address conflict with some unexecuted store tasks, resulting in the occurrence of RAW data dependency, so as to determine the target Whether the load task has the execution conditions, if the execution conditions are met, the target load task can be executed directly, if the execution conditions are not met, the execution time of the target load task is postponed until the execution conditions are met.
  • this method it is possible to realize the processing method of data dependence in an ideal state, reduce the pipeline delay cost of executing instructions to be executed as much as possible, and can avoid the additional pipeline delay cost caused by "pseudo data dependence" pin.
  • Figure 1 is a schematic diagram of the execution sequence of two custom instructions in a hardware accelerator
  • Figure 2 is a schematic diagram of the execution sequence of two custom instructions in the static analysis processing method of RAW data dependency
  • Figure 3 is a schematic diagram of the execution sequence of two custom instructions in the ideal processing mode of RAW data dependency
  • Figure 4 is a schematic diagram of the execution timing comparison of two custom instructions in the static analysis processing mode of "pseudo-data dependency" and the ideal processing mode;
  • FIG. 5 is a schematic diagram of an embodiment of a task processing method provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of another embodiment of a task processing method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of another embodiment of a task processing method provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an embodiment of a task processing apparatus provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of another embodiment of a task processing apparatus provided by an embodiment of the present application.
  • the naming or numbering of steps appearing in this application does not mean that the steps in the method flow must be executed in the time/logical sequence indicated by the naming or numbering.
  • the named or numbered process steps can be implemented according to the The technical purpose changes the execution order, as long as the same or similar technical effects can be achieved.
  • the division of modules presented in this application is a logical division. In actual applications, there may be other divisions. For example, multiple modules can be combined or integrated in another system, or some features can be ignored
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, and the indirect coupling or communication connection between the modules may be electrical or other similar forms.
  • the modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed to multiple circuit modules, and some or all of the modules can be selected according to actual needs. The purpose of this application program.
  • each instruction to be executed includes at least one load task and at least one store task.
  • the load tasks and store tasks contained in all instructions to be executed in the target hardware accelerator are executed through the load execution queue and store, respectively.
  • the queue is executed sequentially. If the load task and store task that cause the RAW data dependency to be sent can be accurately analyzed, then the specific load task can be processed, so as to ensure the correct RAW data dependency while minimizing the need to execute the instructions to be executed
  • the pipeline delay cost, and the "false data dependency" can be identified, so as to avoid the additional pipeline delay cost caused by the "false data dependency".
  • an embodiment of the present application provides a task processing method.
  • the embodiment of the present application also provides a corresponding task processing device. Detailed descriptions are given below.
  • FIG. 5 is a schematic diagram of an embodiment of a task processing method provided by an embodiment of the application.
  • this embodiment may include:
  • the load task since the load tasks contained in all instructions to be executed are executed sequentially through the load execution queue, when a load task reaches the top of the load execution queue, the load task is the next load task to be executed.
  • the target load task is the load task at the top of the load execution queue, and the first preset condition is used to determine whether the target load task has the detection condition of the RAW data dependency relationship, which depends on the possibility of The target load task has a memory address conflict, which leads to whether all store tasks of the RAW data dependency relationship can be analyzed for the RAW data dependency relationship with the target load task.
  • the memory address conflict means that the memory address corresponding to a load task and a store task are the same. If so, the target load task meets the first preset condition.
  • the target load task and all store tasks that may have memory address conflicts with the target load task can be subjected to RAW data dependency analysis to determine whether there is any The store task that causes the dependency of the RAW data exists, so it is determined whether the target load task needs to be executed in a delay or immediately.
  • the target load task meets the second preset condition and can be executed immediately without delay.
  • all the store tasks that are analyzed for the RAW data dependency relationship with the target load task are store tasks that have not yet been executed. If the store task that has a memory address conflict with the target load task and causes the RAW data dependency relationship has been executed, The correct data dependency is ensured, and the execution time of the target load task is not affected by the completed store task.
  • the target load task if the target load task meets the second preset condition, it means that the target load task has the execution conditions and can be executed immediately.
  • the actual execution process is to send the target load task from the load execution queue to the corresponding Memory, the target load task can be executed.
  • the target load task is judged according to the first preset condition and the second preset condition, and it can be determined whether the target load task will conflict with the memory address of some unexecuted store tasks, resulting in RAW data dependency. Occurs to determine whether the target load task has the execution conditions. If the execution conditions are met, the target load task can be executed directly. If the execution conditions are not met, the execution time of the target load task is delayed until the execution conditions are met.
  • each instruction to be executed and the load task and store task contained in the instruction to be executed carry an instruction number uniquely corresponding to the instruction to be executed, and the instruction number is used to indicate each instruction to be executed.
  • the execution order of the executed instructions Specifically, a smaller instruction number indicates a higher execution order.
  • FIG. 6 is a schematic diagram of another embodiment of a task processing method provided by an embodiment of the present application.
  • this embodiment may include:
  • all store tasks with instruction numbers smaller than the target load task are store tasks that may have memory address conflicts with the target load task, resulting in RAW data dependency.
  • all these store tasks enter the store execution queue Only then can the target load task and these unexecuted store tasks in the store execution queue perform RAW data dependency analysis, and there is no need to perform the store tasks that have been sent from the store execution queue to the memory for execution.
  • determining that the target load task satisfies the second preset condition includes the following three situations:
  • the static analysis result it is judged whether the to-be-executed instruction corresponding to the target load task and the to-be-executed tasks corresponding to all the store tasks in the store execution queue do not have a RAW data dependency.
  • the static analysis result is preset. It can be generated by software personnel's manual analysis, or it can be generated by a professional compiler.
  • the static analysis result is used to indicate the instructions to be executed that have a data dependency among all the instructions to be executed. If there is a data dependency relationship between two instructions to be executed, according to the static analysis result, a Sync instruction is configured between the two instructions to be executed, and the Sync instruction is used to indicate the data dependency relationship between them. If the instruction to be executed corresponding to the target load task and the tasks to be executed corresponding to all store tasks in the store execution queue do not have a RAW data dependency, the target load task meets the second preset condition.
  • Case 2 Determine whether the memory address corresponding to the target load task is different from the memory addresses corresponding to all store tasks in the store execution queue; if so, it means that all store tasks in the store execution queue have no memory address conflicts with the target load task , Will not cause the occurrence of RAW data dependency, and the target load task meets the second preset condition.
  • Case 3 Determine whether the instruction numbers of all store tasks with the same memory address in the store execution queue as the target load task are greater than or equal to the target load task; if so, it means that all the memory addresses in the store execution queue are the same as the target load task
  • the execution timing of the store task is after the target load task, and there is no RAW data dependency on the to-be-executed instruction corresponding to the target load task, and the target load task meets the second preset condition.
  • step 604 is similar to the content of step 503 described above, and reference may be made to the specific description of step 503 above, which will not be repeated here.
  • a dynamic data dependency detection solution which can accurately analyze the load task and store task that cause the RAW data dependency to occur, and take corresponding measures for the load task to ensure Normal data dependency, not for the entire instruction to be executed, which minimizes the pipeline delay cost of the execution of the instruction to be executed, and also avoids the additional pipeline delay cost caused by "pseudo data dependency" pin.
  • FIG. 7 is a schematic diagram of another embodiment of a task processing method provided by an embodiment of the present application.
  • this embodiment may include:
  • all store tasks whose instruction numbers are less than the target load task are store tasks that may have memory address conflicts with the target load task, resulting in RAW data dependency.
  • the target load task and these unexecuted store tasks in the store buffer queue can be subjected to RAW data dependency analysis, and there is no need to perform the store tasks that have been sent from the store buffer queue to the memory for execution.
  • the specific method for judging whether all store tasks whose instruction numbers are less than the target load task has entered the preset store buffer queue through the store execution queue may be to determine whether a Sync identifier corresponding to the target load task has entered the store buffer queue.
  • the Sync identifier carries the same instruction number as the target load task, which is originally located in the store execution queue and is at the top of the store task that belongs to the same instruction to be executed as the target load task. When it enters the store buffer queue, it means that all store tasks whose instruction numbers are less than the target load task have entered the store buffer queue.
  • determining that the target load task satisfies the second preset condition includes the following three situations:
  • Case 1 it is judged whether the to-be-executed instruction corresponding to the target load task and the to-be-executed tasks corresponding to all the store tasks in the store buffer queue have no data dependency.
  • the static analysis result is preset, and the static The analysis result is used to indicate all the instructions to be executed that have data dependencies; if so, it is determined that the target load task meets the second preset condition.
  • the specific way to determine whether the to-be-executed instruction corresponding to the target load task has a data dependency relationship with the to-be-executed tasks corresponding to all store tasks in the store buffer queue may be to determine whether there is a Sync identifier in the load execution queue.
  • the Sync flag is used to indicate that there is a load task in the load execution queue that causes the RAW data dependency to occur. If the Sync identifier does not exist in the load execution queue, it means that the instructions to be executed corresponding to the target load task and the tasks to be executed corresponding to all store tasks in the store buffer queue do not have a data dependency, and the target load task satisfies the second preset Set conditions.
  • Case 2 Determine whether the memory address corresponding to the target load task is different from the memory addresses corresponding to all store tasks in the store buffer queue; if so, it means that all store tasks in the store buffer queue have no memory address conflicts with the target load task. Without causing the RAW data dependency to occur, the target load task meets the second preset condition.
  • Case 3 Determine whether the instruction numbers of all store tasks with the same memory address in the store buffer queue as the target load task are greater than or equal to the target load task; if so, it means that the memory address in the store buffer queue is the same as the target load task.
  • the execution timing of all store tasks is after the target load task, and there is no RAW data dependency on the to-be-executed instruction corresponding to the target load task, and the target load task meets the second preset condition.
  • step 704 is similar to the content of step 503 described above, and reference may be made to the specific description of step 503 above, which will not be repeated here.
  • this embodiment uses the store buffer queue to analyze the RAW data dependency relationship. Since the depth of the store execution queue cannot be changed arbitrarily, otherwise it will affect the performance of the hardware accelerator. When the depth of the store execution queue is large, The implementation of RAW data dependency analysis through the store execution queue is costly and requires high performance and power consumption. In this embodiment, the depth of the store execution queue can remain unchanged, but the depth of the store buffer queue can be adjusted according to different detection requirements to meet different performance and power consumption requirements.
  • Fig. 8 is a schematic diagram of an embodiment of a task processing apparatus provided by an embodiment of the present application.
  • the task processing device can be applied to a specific hardware accelerator architecture.
  • each instruction to be executed contains at least one load task and at least one store task.
  • the load tasks and store tasks contained in all instructions to be executed in this hardware accelerator architecture respectively pass through the load execution queue and store.
  • the execution queue is executed sequentially.
  • the task processing apparatus 80 may include:
  • the first judgment module 801 is configured to judge whether the target load task meets the first preset condition, and the target load task is the load task at the top of the load execution queue;
  • the second determining module 802 is configured to determine whether the target load task meets the second preset condition if the target load task meets the first preset condition;
  • the determining module 803 is configured to, if the target load task meets the second preset condition, determine that the target load task meets the execution conditions and can be executed immediately without delay.
  • each instruction to be executed and the load task and store task included in the instruction to be executed carry an instruction number uniquely corresponding to the instruction to be executed, and the instruction number is used to indicate each instruction number.
  • the first judgment module 801 is specifically configured to:
  • the second judgment module 802 is specifically configured to:
  • the static analysis result it is judged whether the to-be-executed instruction corresponding to the target load task and the to-be-executed tasks corresponding to all the store tasks in the store execution queue do not have a data dependency, and the static analysis result is preset,
  • the static analysis result is used to indicate all the instructions to be executed that have a data dependency relationship; if so, it is determined that the target load task satisfies the second preset condition;
  • the first judgment module 801 is specifically configured to:
  • the second judgment module 802 is specifically configured to:
  • the static analysis result it is judged whether the instruction to be executed corresponding to the target load task and the tasks to be executed corresponding to all store tasks in the store buffer queue do not have a data dependency, and the static analysis result is preset,
  • the static analysis result is used to indicate all the instructions to be executed that have a data dependency relationship; if so, it is determined that the target load task satisfies the second preset condition;
  • FIG. 9 is a schematic diagram of another embodiment of a task processing apparatus provided by an embodiment of the present application.
  • the task processing apparatus 90 may include: one or more processors 901.
  • the task processing apparatus 90 may further include a memory 902.
  • the processor 901 and the memory 902 are connected through a communication bus.
  • the processor 901 may be a general-purpose central processing unit (CPU), a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the program of the present application.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • the memory 902 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM) or other types of dynamic storage devices that can store information and instructions, or an electronic device. Erasable Programmable Read-Only Memory (EEPROM), CD-ROM or other optical disc storage, optical disc storage (including compact discs, laser discs, optical discs, digital universal discs, Blu-ray discs, etc.), magnetic disk storage media or other Magnetic storage devices, or any other media that can be used to carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but are not limited thereto.
  • the memory 902 may exist independently, and is connected to the processor 901 through a bus.
  • the memory 902 may also be integrated with the processor 901.
  • the memory 902 is used to store application program codes for executing the solutions of the present application, and the processor 901 controls the execution.
  • the processor 901 is configured to execute application program codes stored in the memory 902.
  • the processor 901 may include one or more CPUs, and each CPU may be a single-core processor or a multi-core processor.
  • the processor here may refer to one or more devices, circuits, and/or processing cores for processing data (for example, computer program instructions).
  • a computer-readable storage medium is provided, and an instruction is stored thereon, and when the instruction is executed, the method of the task processing apparatus in the foregoing method embodiment is executed.
  • a computer program product containing instructions is provided, and when the instructions are executed, the method of the task processing apparatus in the foregoing method embodiment is executed.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website site, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).
  • the program can be stored in a computer-readable storage medium, and the storage medium can include: ROM, RAM, magnetic disk or CD, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请公开了一种任务处理的方法,包括:判断目标load任务是否满足第一预设条件,该目标load任务是位于load执行队列首位的load任务;若该目标load任务满足第一预设条件,则判断该目标load任务是否满足第二预设条件;若该目标load任务满足第二预设条件,则确定该目标load任务具备执行条件。本申请实施例还提供相应的任务处理装置。本申请技术方案可以应用于满足特定条件的硬件加速器中,减少硬件加速器中的自定义指令执行时由于RAW数据依赖关系导致的流水时延花销。

Description

一种任务处理的方法以及任务处理装置
本申请要求于2019年08月30日提交中国专利局、申请号为201910818221.2、发明名称为“一种任务处理的方法以及任务处理装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,具体涉及一种任务处理的方法以及任务处理装置。
背景技术
随着计算机技术的发展,由中央处理器和硬件加速器组成的异构计算架构得到广泛应用,这种异构计算架构可用于提升算法的计算速率。这种异构计算架构的实现原理是将待加速算法分割为小粒度的计算任务,并结合该异构计算架构中的自定义加速器指令(以下简称“自定义指令”),完成整个待加速算法的计算。
粗粒度并行计算机(coarse-grained parallel computers)是一种常见的异构计算架构。由于这种异构计算架构的自定义指令的划分粒度较粗,因此单个自定义指令的执行时间较长,对应的流水时延也相应较长。而且,由于粒度较粗的原因,不同的自定义指令间容易有数据依赖关系。假设每个自定义指令包含4个load任务(L0、L1、L2、L3)和4个store任务(S0、S1、S2、S3),每个自定义指令的load任务和store任务中间有一段执行时间(execute),两个自定义指令为第一指令和第二指令,该两个自定义指令的执行时序示意图如图1。需要说明的是,每个自定义指令所包含的load任务和store任务的实际执行时序可以是任意顺序,只要保证该自定义指令中的load任务和store任务的执行时序符合正常的数据依赖关系,图1所示的只是一种简单的情况。假设第一指令中的S0和第二指令中的L1的内存地址相同,则该第一指令和第二指令具有数据依赖关系,这种数据依赖关系具体为写入后读取(read after write,RAW)依赖。为避免自定义指令在执行过程中因为RAW依赖导致执行错误,开发人员会通过静态分析(人工判断或编译器判断),在相互依赖的第一指令和第二指令间增加同步(Sync)指令,从而将第二指令的执行时间延后至第一指令执行完成后,如图2所示。这种处理方式需要等待8拍(每一拍为执行一个load任务或store任务的时间)的时间,为了尽量节省自定义指令的流水时延,在理想的处理方式中,无需将第二指令的执行时间延后至第一指令执行完成后,而是将造成数据依赖关系的L1的执行时间延后至S0执行完成后,在L1之前的L0则是正常执行,如图3所示,这样只需要等待4拍的时间,可以最大限度地减少有数据依赖关系的自定义指令等待执行的时间,从而减少无谓的流水时延花销,这种方式也可以称为动态数据依赖检测处理。
在自定义指令的实际调度中,L1的执行时间有可能晚于S0,因此,该第一指令和第二指令实际上的数据依赖关系为“伪数据依赖关系”。在静态分析中,这种“伪数据依赖关系”也会被判断为数据依赖关系,所以第二指令的执行时间也需要延后至第一指令执行完成后,如图4所示,这就会造成额外的流水时延花销,降低了自定义指令的执行速率,“伪数据依赖关系”的理想处理方式是忽略其存在,正常执行自定义指令,这就不会造成流水时延增加。如果可以在静态分析的结果中进一步分析出“伪数据依赖关系”或者两个相互依赖的 自定义指令中造成数据依赖的L1和S0,便可以不用延后第二指令的执行时间,或者只针对造成数据依赖的L1和S0操作进行相应的处理,从而最大限度地减少无谓的流水时延花销。
发明内容
本申请实施例提供了一种任务处理的方法及任务处理装置,可以应用于满足特定条件的硬件加速器中,减少硬件加速器中的自定义指令执行时由于RAW数据依赖关系导致的流水时延花销。
有鉴于此,本申请实施例第一方面提供一种任务处理的方法,应用于目标硬件加速器,该目标硬件加速器中的每一个待执行指令包含至少一个load任务和至少一个store任务,该目标硬件加速器中的全部待执行指令所包含的load任务和store任务分别通过load执行队列和store执行队列顺序执行,该方法包括:判断目标load任务是否满足第一预设条件,目标load任务是位于所述load执行队列首位的load任务;若目标load任务满足所述第一预设条件,则判断目标load任务是否满足第二预设条件;若目标load任务满足所述第二预设条件,则确定目标load任务具备执行条件。
由上述第一方面可知,通过第一预设条件和第二预设条件对目标load任务进行判断,可以确定目标load任务是否会和一些尚未执行的store任务的内存地址冲突,导致RAW数据依赖关系的发生,从而确定该目标load任务是否具备执行条件,如果具备执行条件则可以直接执行该目标load任务,若不具备执行条件则延后该目标load任务的执行时间直至具备执行条件。通过该方法,可以实现理想状态下对数据依赖关系的处理方法,尽可能地减少执行待执行指令的流水时延花销,并且可以避免“伪数据依赖关系”带来的额外的流水时延花销。
可选的,结合上述第一方面,在第一种可能的实现方式中,每一个待执行指令以及该待执行指令所包含的load任务和store任务均携带一个与该待执行指令唯一对应的指令编号,该指令编号用于表示每一个待执行指令的执行顺序,具体的,指令编号较小表示执行顺序较前。
可选的,结合上述第一方面第一种可能的实现方式,在第二种可能的实现方式中,判断目标load任务是否满足第一预设条件,包括:判断指令编号小于目标load任务的全部store任务是否已进入store执行队列;若是,则确定该目标load任务满足第一预设条件。
可选的,结合上述第一方面第二种可能的实现方式,在第三种可能的实现方式中,判断所述目标load任务是否满足第二预设条件,包括:
根据静态分析结果判断目标load任务对应的待执行指令是否与store执行队列中的全部store任务所对应的待执行任务均不具有数据依赖关系,该静态分析结果是预先设置的,该静态分析结果用于表示全部待执行指令中具有数据依赖关系的待执行指令;若是,则确定该目标load任务满足第二预设条件;或者,判断目标load任务对应的内存地址是否与store执行队列中的全部store任务对应的内存地址不相同;若是,则确定目标load任务满足第二预设条件;或者,判断store执行队列中内存地址与目标load任务相同的全部store任务的指令编号是否均大于或等于该目标load任务;若是,则确定该目标load任 务满足第二预设条件。
可选的,结合上述第一方面第一种可能的实现方式,在第四种可能的实现方式中,判断目标load任务是否满足第一预设条件,包括:判断指令编号小于目标load任务的全部store任务是否已通过store执行队列进入预设的store缓冲队列;若是,则确定目标load任务满足第一预设条件。
可选的,结合上述第一方面第四种可能的实现方式,在第五种可能的实现方式中,判断所述目标load任务是否满足第二预设条件,包括:
根据静态分析结果判断目标load任务对应的待执行指令是否与store缓冲队列中的全部store任务所对应的待执行任务均不具有数据依赖关系,该静态分析结果是预先设置的,该静态分析结果用于表示全部待执行指令中具有数据依赖关系的待执行指令;若是,则确定该目标load任务满足该第二预设条件;或者,判断目标load任务对应的内存地址是否与store缓冲队列中的全部store任务对应的内存地址不相同;若是,则确定目标load任务满足第二预设条件;或者,判断store缓冲队列中内存地址与目标load任务相同的全部store任务的指令编号是否均大于或等于目标load任务;若是,则确定目标load任务满足第二预设条件。
本申请第二方面提供一种任务处理装置,所述任务处理装置用于执行上述第一方面或第一方面任意一种可能的实现方式中的任务处理的方法。具体地,所述任务处理装置可以包括用于执行第一方面或第一方面任意一种可能的实现方式中的任务处理的方法的模块。
本申请第三方面提供一种任务处理装置,所述任务处理装置包括处理器,所述处理器与存储器耦合,所述存储器用于存储指令,所述处理器用于执行所述存储器存储的指令,并且对所述存储器中存储的指令的执行使得所述处理器执行第一方面或第一方面任意一种可能的实现方式中的任务处理的方法。可选的,所述任务处理装置还包括所述存储器。
本申请第四方面提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任意一种可能的实现方式中的任务处理的方法。
本申请第五方面提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任意一种可能的实现方式中的任务处理的方法。
本申请实施例的技术方案可应用于目标硬件加速器中,该目标硬件加速器中的每一个待执行指令包含至少一个load任务和至少一个store任务,该目标硬件加速器中的全部待执行指令所包含的load任务和store任务分别通过load执行队列和store执行队列顺序执行,通过判断位于load执行队列首位的目标load任务是否满足第一预设条件;若该目标load任务满足第一预设条件,则判断该目标load任务是否满足第二预设条件;若该目标load任务满足该第二预设条件,则确定该目标load任务具备执行条件。通过第一预设条件和第二预设条件对目标load任务进行判断,可以确定目标load任务是否会和一些尚未执行的store任务存在内存地址冲突,导致RAW数据依赖关系的发生,从而确定该目标load任务是否具备执行条件,如果具备执行条件则可以直接执行该目标load任务,若不具备执行条件则延后该目标load任务的执行时间直至具备执行条件。通过该方法,可以实 现理想状态下对数据依赖关系的处理方法,尽可能地减少执行待执行指令的流水时延花销,并且可以避免“伪数据依赖关系”带来的额外的流水时延花销。
附图说明
图1是硬件加速器中两个自定义指令的执行时序示意图;
图2是RAW数据依赖关系的静态分析处理方式中两个自定义指令的执行时序示意图;
图3是RAW数据依赖关系的理想处理方式中两个自定义指令的执行时序示意图;
图4是“伪数据依赖关系”的静态分析处理方式与理想处理方式中两个自定义指令的执行时序对比示意图;
图5是本申请实施例提供的任务处理的方法一个实施例示意图;
图6是本申请实施例提供的任务处理的方法另一实施例示意图;
图7是本申请实施例提供的任务处理的方法另一实施例示意图;
图8是本申请实施例提供的任务处理装置一个实施例示意图;
图9是本申请实施例提供的任务处理装置另一实施例示意图。
具体实施方式
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着图计算框架的演变和新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。本申请中所出现的模块的划分,是一种逻辑上的划分,实际应用中实现时可以有另外的划分方式,例如多个模块可以结合成或集成在另一个系统中,或一些特征可以忽略,或不执行,另外,所显示的或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,模块之间的间接耦合或通信连接可以是电性或其他类似的形式,本申请中均不作限定。并且,作为分离部件说明的模块可以是也可以不是物理上的分离,可以是也可以不是物理模块,或者可以分布到多个电路模块中,可以根据实际的需要选择其中的部分或全部模块来实现本申请方案的目的。
本申请实施例可应用于特定的硬件加速器架构中。在这种硬件加速器架构中,每一个待执行指令包含至少一个load任务和至少一个store任务,该目标硬件加速器中的全部待执 行指令所包含的load任务和store任务分别通过load执行队列和store执行队列顺序执行。如果可以准确地分析导致RAW数据依赖关系发送的load任务和store任务,便可以针对特定的load任务进行处理,从而在保证正确的RAW数据依赖关系的同时,最大限度地减少执行待执行指令所需要的流水时延花销,并且可以识别出“伪数据依赖关系”,从而避免“伪数据依赖关系”带来的额外的流水时延花销。
为了解决现有的数据依赖关系处理方式中流水时延花销较大的问题,本申请实施例提供一种任务处理的方法。本申请实施例还提供相应的任务处理装置。以下分别进行详细说明。
图5为本申请实施例提供的任务处理的方法一个实施例示意图。
如图5所示,本实施例可以包括:
501、判断目标load任务是否满足第一预设条件。
在本实施例中,由于全部待执行指令所包含的load任务是通过load执行队列顺序执行的,所以当一个load任务到达load执行队列的首位时,该load任务就是下一个被执行的load任务。在本实施例中,目标load任务就是位于该load执行队列首位的load任务,该第一预设条件是用于判断该目标load任务是否具备了RAW数据依赖关系的检测条件,这取决于可能与该目标load任务存在内存地址冲突从而导致RAW数据依赖关系的全部store任务是否可以和该目标load任务进行RAW数据依赖关系分析,内存地址冲突是指一个load任务和store任务对应的内存地址相同。若可以,则该目标load任务满足第一预设条件。
502、若目标load任务满足第一预设条件,则判断该目标load任务是否满足第二预设条件。
在本实施例中,当目标load任务满足第一预设条件时,可以将该目标load任务和可能与该目标load任务存在内存地址冲突的全部store任务进行RAW数据依赖关系分析,从而判断是否有导致RAW数据依赖关系的store任务存在,从而确定该目标load任务需要延时执行还是立刻执行。当没有导致RAW数据依赖关系的store任务存在时,该目标load任务满足第二预设条件,可以立刻执行,无需延时。
需要说明的是,上述与该目标load任务进行RAW数据依赖关系分析的全部store任务为尚未开始执行的store任务,与该目标load任务存在内存地址冲突导致RAW数据依赖关系的store任务若已执行,则保证了正确的数据依赖关系,目标load任务的执行时间不受已完成的store任务的影响。
503、若目标load任务满足第二预设条件,则确定目标load任务具备执行条件。
在本实施例中,若目标load任务满足第二预设条件,说明目标load任务具备了执行条件,即可以立刻执行,实际的执行过程为将该目标load任务从load执行队列中发送至相应的内存,该目标load任务就可以被执行。
在本实施例中,通过第一预设条件和第二预设条件对目标load任务进行判断,可以确定目标load任务是否会和一些尚未执行的store任务的内存地址冲突,导致RAW数据依赖关系的发生,从而确定该目标load任务是否具备执行条件,如果具备执行条件则可以直接执行该目标load任务,若不具备执行条件则延后该目标load任务的执行时间直至具备执行条 件。通过该方法,可以实现理想状态下对数据依赖关系的处理方法,尽可能地减少执行待执行指令的流水时延花销,并且可以避免“伪数据依赖关系”带来的额外的流水时延花销。
在一种具体的实施例中,每一个待执行指令以及该待执行指令所包含的load任务和store任务均携带一个与该待执行指令唯一对应的指令编号,该指令编号用于表示每一个待执行指令的执行顺序,具体的,指令编号较小表示执行顺序较前。以下结合具体的实施例进行说明。
图6是本申请实施例提供的任务处理的方法另一实施例示意图。
如图6所示,本实施例可以包括:
601、判断指令编号小于目标load任务的全部store任务是否已进入store执行队列。
在本实施例中,指令编号小于该目标load任务的全部store任务就是有可能与该目标load任务存在内存地址冲突,导致RAW数据依赖关系的store任务,当这些store任务全部进入store执行队列之后,才可以将该目标load任务和这些还在store执行队列中尚未执行的store任务进行RAW数据依赖关系分析,对于已经从store执行队列发送到内存中执行的store任务则无需进行。
602、若是,则确定目标load任务满足第一预设条件。
603、若目标load任务满足第一预设条件,则判断该目标load任务是否满足第二预设条件。
可选的,判断该目标load任务满足第二预设条件包括以下三种情况:
情况1、根据静态分析结果判断目标load任务对应的待执行指令是否与store执行队列中的全部store任务所对应的待执行任务均不具有RAW数据依赖关系,该静态分析结果是预先设置的,其可以通过软件人员人工分析生成,也可以通过专业的编译器分析生成。该静态分析结果用于表示全部待执行指令中具有数据依赖关系的待执行指令。若两个待执行指令之间具有数据依赖关系,则根据静态分析结果,该两个待执行指令之间会被配置一个Sync指令,该Sync指令用于指示其之间的数据依赖关系。若目标load任务对应的待执行指令与该store执行队列中的全部store任务所对应的待执行任务均不具有RAW数据依赖关系,则该目标load任务满足第二预设条件。
情况2、判断该目标load任务对应的内存地址是否与store执行队列中的全部store任务对应的内存地址不相同;若是,则说明store执行队列中的全部store任务与该目标load任务没有内存地址冲突,不会导致RAW数据依赖关系的发生,该目标load任务满足第二预设条件。
情况3、判断该store执行队列中内存地址与目标load任务相同的全部store任务的指令编号是否均大于或等于目标load任务;若是,则说明该store执行队列中内存地址与目标load任务相同的全部store任务的执行时序在目标load任务之后,不会与该目标load任务对应的待执行指令存在RAW数据依赖关系,该目标load任务满足第二预设条件。
604、若目标load任务满足第二预设条件,则确定目标load任务具备执行条件。
在本实施例中,步骤604的内容与上述步骤503的内容类似,可以参考上述对步骤503的具体描述,此处不再赘述。
通过本实施例的技术方案,提供了一种动态的数据依赖检测的解决方案,可以准确地分析出导致RAW数据依赖关系发生的load任务和store任务,并且针对该load任务采取相应的措施来保证正常的数据依赖关系,而非针对整个待执行指令,这在最大程度上减少了待执行指令执行的流水时延花销,同时也避免了“伪数据依赖关系”带来的额外流水时延花销。
图7是本申请实施例提供的任务处理的方法另一实施例示意图。
如图7所示,本实施例可以包括:
701、判断指令编号小于目标load任务的全部store任务是否已通过store执行队列进入预设的store缓冲队列。
在本实施例中,指令编号小于该目标load任务的全部store任务就是有可能与该目标load任务存在内存地址冲突,导致RAW数据依赖关系的store任务,当这些store任务通过store执行队列进入store缓冲队列之后,才可以将该目标load任务和这些还在store缓冲队列中尚未执行的store任务进行RAW数据依赖关系分析,对于已经从store缓冲队列发送到内存中执行的store任务则无需进行。
具体的,判断指令编号小于目标load任务的全部store任务是否已通过store执行队列进入预设的store缓冲队列的具体方式可以是,判断与目标load任务对应的一个Sync标识是否进入store缓冲队列,该Sync标识携带与目标load任务相同的指令编号,其原本位于store执行队列中,且在与目标load任务同属一个待执行指令的store任务的首位。当其进入store缓冲队列时,说明指令编号小于目标load任务的全部store任务都已进入store缓冲队列中。
702、若是,则确定目标load任务满足第一预设条件。
703、若目标load任务满足第一预设条件,则判断该目标load任务是否满足第二预设条件。
可选的,判断该目标load任务满足第二预设条件包括以下三种情况:
情况1、根据静态分析结果判断目标load任务对应的待执行指令是否与store缓冲队列中的全部store任务所对应的待执行任务均不具有数据依赖关系,该静态分析结果是预先设置的,该静态分析结果用于表示全部待执行指令中具有数据依赖关系的待执行指令;若是,则确定目标load任务满足第二预设条件。
具体的,判断目标load任务对应的待执行指令是否与store缓冲队列中的全部store任务所对应的待执行任务是否具有数据依赖关系的具体方式可以是,判断load执行队列中是否存在Sync标识,该Sync标识用于指示load执行队列中存在导致RAW数据依赖关系发生的load任务。若load执行队列中不存在该Sync标识,则说明目标load任务对应的待执行指令与store缓冲队列中的全部store任务所对应的待执行任务均不具有数据依赖关系,目标load任务满足第二预设条件。
情况2、判断目标load任务对应的内存地址是否与store缓冲队列中的全部store任务对应的内存地址不相同;若是,则说明store缓冲队列中的全部store任务与该目标load任务没有内存地址冲突,不会导致RAW数据依赖关系的发生,该目标load任务满足第二预设条件。
情况3、判断store缓冲队列中内存地址与目标load任务相同的全部store任务的指令编号是否均大于或等于所述目标load任务;若是,则说明该store缓冲队列中内存地址与目标load任务相同的全部store任务的执行时序在目标load任务之后,不会与该目标load任务对应的待执行指令存在RAW数据依赖关系,该目标load任务满足第二预设条件。
704、若目标load任务满足第二预设条件,则确定目标load任务具备执行条件。
在本实施例中,步骤704的内容与上述步骤503的内容类似,可以参考上述对步骤503的具体描述,此处不再赘述。
通过本实施例的技术方案,提供了另一种动态的数据依赖检测的解决方案,可以准确地分析出导致RAW数据依赖关系发生的load任务和store任务,并且针对该load任务采取相应的措施来保证正常的数据依赖关系,而非针对整个待执行指令,这在最大程度上减少了待执行指令执行的流水时延花销,同时也避免了“伪数据依赖关系”带来的额外流水时延花销。与上一实施例相比,本实施例通过store缓冲队列来进行RAW数据依赖关系分析,由于store执行队列的深度不能随意更改,否则会影响硬件加速器的性能,store执行队列的深度较大时,通过store执行队列进行RAW数据依赖关系分析的实现代价较大,对性能和功耗的要求较高。而本实施例中的store执行队列的深度可以保持不变,但是store缓冲队列的深度可以根据不同的检测需求进行调整,以满足不同的性能和功耗要求。
上面对本申请实施例提供的任务处理的方法进行了描述,下面对本申请实施例提供的任务处理装置进行描述。
图8是本申请实施例提供的任务处理装置一个实施例示意图。该任务处理装置可应用于特定的硬件加速器架构中。在这种硬件加速器架构中,每一个待执行指令包含至少一个load任务和至少一个store任务,这种硬件加速器架构中的全部待执行指令所包含的load任务和store任务分别通过load执行队列和store执行队列顺序执行。
如图8所示,本申请实施例提供的任务处理装置80可以包括:
第一判断模块801,用于判断目标load任务是否满足第一预设条件,该目标load任务是位于上述load执行队列首位的load任务;
第二判断模块802,用于若该目标load任务满足所述第一预设条件,则判断该目标load任务是否满足第二预设条件;
确定模块803,用于若该目标load任务满足第二预设条件,则确定该目标load任务具备执行条件,可以立刻执行,无需延时。
可选的,作为一个实施例,每一个待执行指令以及该待执行指令所包含的load任务和store任务均携带一个与该待执行指令唯一对应的指令编号,所述指令编号用于表示每一个待执行指令的执行顺序。
可选的,作为一个实施例,所述第一判断模块801,具体用于:
判断指令编号小于所述目标load任务的全部store任务是否已进入所述store执行队列;若是,则确定所述目标load任务满足所述第一预设条件。
可选的,作为一个实施例,所述第二判断模块802,具体用于:
根据静态分析结果判断所述目标load任务对应的待执行指令是否与所述store执行队 列中的全部store任务所对应的待执行任务均不具有数据依赖关系,所述静态分析结果是预先设置的,所述静态分析结果用于表示全部待执行指令中具有数据依赖关系的待执行指令;若是,则确定所述目标load任务满足所述第二预设条件;
或者,
判断所述目标load任务对应的内存地址是否与所述store执行队列中的全部store任务对应的内存地址不相同;若是,则确定所述目标load任务满足所述第二预设条件;
或者,
判断所述store执行队列中内存地址与所述目标load任务相同的全部store任务的指令编号是否均大于或等于所述目标load任务;若是,则确定所述目标load任务满足所述第二预设条件。
可选的,作为一个实施例,所述第一判断模块801,具体用于:
判断指令编号小于所述目标load任务的全部store任务是否已通过所述store执行队列进入预设的store缓冲队列;若是,则确定所述目标load任务满足所述第一预设条件。
可选的,作为一个实施例,所述第二判断模块802,具体用于:
根据静态分析结果判断所述目标load任务对应的待执行指令是否与所述store缓冲队列中的全部store任务所对应的待执行任务均不具有数据依赖关系,所述静态分析结果是预先设置的,所述静态分析结果用于表示全部待执行指令中具有数据依赖关系的待执行指令;若是,则确定所述目标load任务满足所述第二预设条件;
或者,
判断所述目标load任务对应的内存地址是否与所述store缓冲队列中的全部store任务对应的内存地址不相同;若是,则确定所述目标load任务满足所述第二预设条件;
或者,
判断所述store缓冲队列中内存地址与所述目标load任务相同的全部store任务的指令编号是否均大于或等于所述目标load任务;若是,则确定所述目标load任务满足所述第二预设条件。
图9是本申请实施例提供的任务处理装置另一实施例示意图。
如图9所示,本申请实施例提供的任务处理装置90可以包括:一个或多个处理器901,可选的,任务处理装置90还可以包括存储器902。处理器901和存储器902通过通信总线相连。
处理器901可以是一个通用中央处理器(CPU),微处理器,ASIC,或一个或多个用于控制本申请方案程序执行的集成电路。
存储器902可以是只读存储器(ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(EEPROM)、只读光盘(CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器902可以是独立存在,通过总线与处理器901相连接。存储器902也可以和处理器901集成在一起。
其中,所述存储器902用于存储执行本申请方案的应用程序代码,并由处理器901来控制执行。所述处理器901用于执行所述存储器902中存储的应用程序代码。
在具体实现中,处理器901可以包括一个或多个CPU,每个CPU可以是一个单核(single-core)处理器,也可以是一个多核(multi-Core)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
作为本实施例的另一种形式,提供一种计算机可读存储介质,其上存储有指令,该指令被执行时执行上述方法实施例中任务处理装置的方法。
作为本实施例的另一种形式,提供一种包含指令的计算机程序产品,该指令被执行时执行上述方法实施例中任务处理装置的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:ROM、RAM、磁盘或光盘等。
以上对本申请实施例所提供的任务处理的方法以及任务处理装置进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (14)

  1. 一种任务处理的方法,其特征在于,应用于目标硬件加速器,所述目标硬件加速器中的每一个待执行指令包含至少一个load任务和至少一个store任务,所述目标硬件加速器中的全部待执行指令所包含的load任务通过load执行队列顺序执行,所述目标硬件加速器中的全部待执行指令所包含的store任务通过store执行队列顺序执行,所述方法包括:
    判断目标load任务是否满足第一预设条件,所述目标load任务是位于所述load执行队列首位的load任务;
    若所述目标load任务满足所述第一预设条件,则判断所述目标load任务是否满足第二预设条件;
    若所述目标load任务满足所述第二预设条件,则确定所述目标load任务具备执行条件。
  2. 根据权利要求1所述的方法,其特征在于,每一个待执行指令以及该待执行指令所包含的load任务和store任务均携带一个与该待执行指令唯一对应的指令编号,所述指令编号用于表示每一个待执行指令的执行顺序。
  3. 根据权利要求2所述的方法,其特征在于,所述判断目标load任务是否满足第一预设条件,包括:
    判断指令编号小于所述目标load任务的全部store任务是否已进入所述store执行队列;
    若是,则确定所述目标load任务满足所述第一预设条件。
  4. 根据权利要求3所述的方法,其特征在于,所述判断所述目标load任务是否满足第二预设条件,包括:
    根据静态分析结果判断所述目标load任务对应的待执行指令是否与所述store执行队列中的全部store任务所对应的待执行任务均不具有数据依赖关系,所述静态分析结果是预先设置的,所述静态分析结果用于表示全部待执行指令中具有数据依赖关系的待执行指令;若是,则确定所述目标load任务满足所述第二预设条件;
    或者,
    判断所述目标load任务对应的内存地址是否与所述store执行队列中的全部store任务对应的内存地址不相同;若是,则确定所述目标load任务满足所述第二预设条件;
    或者,
    判断所述store执行队列中内存地址与所述目标load任务相同的全部store任务的指令编号是否均大于或等于所述目标load任务;若是,则确定所述目标load任务满足所述第二预设条件。
  5. 根据权利要求2所述的方法,其特征在于,所述判断目标load任务是否满足第一预设条件,包括:
    判断指令编号小于所述目标load任务的全部store任务是否已通过所述store执行队列进入预设的store缓冲队列;
    若是,则确定所述目标load任务满足所述第一预设条件。
  6. 根据权利要求5所述的方法,其特征在于,所述判断所述目标load任务是否满足第二预设条件,包括:
    根据静态分析结果判断所述目标load任务对应的待执行指令是否与所述store缓冲队列中的全部store任务所对应的待执行任务均不具有数据依赖关系,所述静态分析结果是预先设置的,所述静态分析结果用于表示全部待执行指令中具有数据依赖关系的待执行指令;若是,则确定所述目标load任务满足所述第二预设条件;
    或者,
    判断所述目标load任务对应的内存地址是否与所述store缓冲队列中的全部store任务对应的内存地址不相同;若是,则确定所述目标load任务满足所述第二预设条件;
    或者,
    判断所述store缓冲队列中内存地址与所述目标load任务相同的全部store任务的指令编号是否均大于或等于所述目标load任务;若是,则确定所述目标load任务满足所述第二预设条件。
  7. 一种任务处理装置,其特征在于,应用于目标硬件加速器,所述目标硬件加速器中的每一个待执行指令包含至少一个load任务和至少一个store任务,所述目标硬件加速器中的全部待执行指令所包含的load任务通过load执行队列顺序执行,所述目标硬件加速器中的全部待执行指令所包含的store任务通过store执行队列顺序执行,所述任务处理装置包括:
    第一判断模块,用于判断目标load任务是否满足第一预设条件,所述目标load任务是位于所述load执行队列首位的load任务;
    第二判断模块,用于若所述目标load任务满足所述第一预设条件,则判断所述目标load任务是否满足第二预设条件;
    确定模块,用于若所述目标load任务满足所述第二预设条件,则确定所述目标load任务具备执行条件。
  8. 根据权利要求7所述的任务处理装置,其特征在于,每一个待执行指令以及该待执行指令所包含的load任务和store任务均携带一个与该待执行指令唯一对应的指令编号,所述指令编号用于表示每一个待执行指令的执行顺序。
  9. 根据权利要求8所述的任务处理装置,其特征在于,所述第一判断模块,具体用于:
    判断指令编号小于所述目标load任务的全部store任务是否已进入所述store执行队列;若是,则确定所述目标load任务满足所述第一预设条件。
  10. 根据权利要求9所述的任务处理装置,其特征在于,所述第二判断模块,具体用于:
    根据静态分析结果判断所述目标load任务对应的待执行指令是否与所述store执行队列中的全部store任务所对应的待执行任务均不具有数据依赖关系,所述静态分析结果是预先设置的,所述静态分析结果用于表示全部待执行指令中具有数据依赖关系的待执行指令;若是,则确定所述目标load任务满足所述第二预设条件;
    或者,
    判断所述目标load任务对应的内存地址是否与所述store执行队列中的全部store任务对应的内存地址不相同;若是,则确定所述目标load任务满足所述第二预设条件;
    或者,
    判断所述store执行队列中内存地址与所述目标load任务相同的全部store任务的指令编号是否均大于或等于所述目标load任务;若是,则确定所述目标load任务满足所述第二预设条件。
  11. 根据权利要求8所述的任务处理装置,其特征在于,所述第一判断模块,具体用于:
    判断指令编号小于所述目标load任务的全部store任务是否已通过所述store执行队列进入预设的store缓冲队列;若是,则确定所述目标load任务满足所述第一预设条件。
  12. 根据权利要求11所述的任务处理装置,其特征在于,所述第二判断模块,具体用于:
    根据静态分析结果判断所述目标load任务对应的待执行指令是否与所述store缓冲队列中的全部store任务所对应的待执行任务均不具有数据依赖关系,所述静态分析结果是预先设置的,所述静态分析结果用于表示全部待执行指令中具有数据依赖关系的待执行指令;若是,则确定所述目标load任务满足所述第二预设条件;
    或者,
    判断所述目标load任务对应的内存地址是否与所述store缓冲队列中的全部store任务对应的内存地址不相同;若是,则确定所述目标load任务满足所述第二预设条件;
    或者,
    判断所述store缓冲队列中内存地址与所述目标load任务相同的全部store任务的指令编号是否均大于或等于所述目标load任务;若是,则确定所述目标load任务满足所述第二预设条件。
  13. 一种任务处理装置,包括处理器,所述处理器与存储器耦合,所述存储器用于存储计算机程序或指令,所述处理器用于执行存储器中的该计算机程序或指令,使得所述任务处理装置执行权利要求1至6中任一项所述的方法。
  14. 一种计算机可读存储介质,存储有计算机程序,其特征在于,所述程序被执行时实现如权利要求1至6中任一项所述的方法。
PCT/CN2020/111649 2019-08-30 2020-08-27 一种任务处理的方法以及任务处理装置 WO2021037124A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910818221.2 2019-08-30
CN201910818221.2A CN112445587A (zh) 2019-08-30 2019-08-30 一种任务处理的方法以及任务处理装置

Publications (1)

Publication Number Publication Date
WO2021037124A1 true WO2021037124A1 (zh) 2021-03-04

Family

ID=74683561

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/111649 WO2021037124A1 (zh) 2019-08-30 2020-08-27 一种任务处理的方法以及任务处理装置

Country Status (2)

Country Link
CN (1) CN112445587A (zh)
WO (1) WO2021037124A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117311950B (zh) * 2023-11-28 2024-04-26 宁德时代新能源科技股份有限公司 任务处理方法、任务处理装置、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1195809A (zh) * 1997-04-10 1998-10-14 国际商业机器公司 Store(存数)指令结果的前送
CN101352012A (zh) * 2005-10-07 2009-01-21 安吉尔系统公司 使用不同元件对流进行媒体数据处理以及控制方法
CN101571810A (zh) * 2009-05-31 2009-11-04 清华大学 执行程序的方法、验证程序结果的方法、装置及系统
US20140281409A1 (en) * 2013-03-15 2014-09-18 Soft Machines, Inc. Method and apparatus for nearest potential store tagging

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5467473A (en) * 1993-01-08 1995-11-14 International Business Machines Corporation Out of order instruction load and store comparison
KR100242460B1 (ko) * 1996-11-06 2000-08-01 김영환 저장로드간 바이패스를 위한 장치 및 그 방법
KR101084228B1 (ko) * 2007-06-20 2011-11-17 후지쯔 가부시끼가이샤 정보 처리 장치, 캐시 메모리 제어 장치 및 메모리 액세스 순서 보증 방법
CN102722401B (zh) * 2012-04-25 2014-07-09 华中科技大学 一种硬件事务内存系统中的伪相联多版本数据管理方法
IL232836A0 (en) * 2013-06-02 2014-08-31 Rocketick Technologies Ltd Efficient parallel computation of dependency problems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1195809A (zh) * 1997-04-10 1998-10-14 国际商业机器公司 Store(存数)指令结果的前送
CN101352012A (zh) * 2005-10-07 2009-01-21 安吉尔系统公司 使用不同元件对流进行媒体数据处理以及控制方法
CN101571810A (zh) * 2009-05-31 2009-11-04 清华大学 执行程序的方法、验证程序结果的方法、装置及系统
US20140281409A1 (en) * 2013-03-15 2014-09-18 Soft Machines, Inc. Method and apparatus for nearest potential store tagging

Also Published As

Publication number Publication date
CN112445587A (zh) 2021-03-05

Similar Documents

Publication Publication Date Title
US8850262B2 (en) Inter-processor failure detection and recovery
US20190361708A1 (en) Embedded scheduling of hardware resources for hardware acceleration
JP2002163105A (ja) データ依存関係検出装置
JP2020523674A (ja) システム内のキャッシュ転送のオーバーヘッドの削減
US10614004B2 (en) Memory transaction prioritization
US20170039095A1 (en) Switching a Locking Mode of an Object in a Multi-Thread Program
US20200042321A1 (en) Low power back-to-back wake up and issue for paired issue queue in a microprocessor
US20210089317A1 (en) Instruction processing apparatuses, processors, and processing methods
US20130055284A1 (en) Managing shared computer resources
JPH09138778A (ja) セマフォ命令用のセマフォ・バッファを用いた装置と方法
WO2021037124A1 (zh) 一种任务处理的方法以及任务处理装置
US8601488B2 (en) Controlling the task switch timing of a multitask system
US20080301402A1 (en) Method and System for Stealing Interrupt Vectors
JP2006085428A (ja) 並列処理システム、インタコネクションネットワーク、ノード及びネットワーク制御プログラム
CN111221573B (zh) 一种寄存器访问时序的管理方法、处理器、电子设备及计算机可读存储介质
US10635444B2 (en) Shared compare lanes for dependency wake up in a pair-based issue queue
US7552269B2 (en) Synchronizing a plurality of processors
AU2017438670B2 (en) Simulation device, simulation method, and simulation program
US11467844B2 (en) Storing multiple instructions in a single reordering buffer entry
US20150100759A1 (en) Pipelined finite state machine
US20040168154A1 (en) Software processing method and software processing system
US11392406B1 (en) Alternative interrupt reporting channels for microcontroller access devices
JPS581246A (ja) 命令処理順序制御方式
US10067720B2 (en) Synchronous input/output virtualization
US7877533B2 (en) Bus system, bus slave and bus control method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20857398

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20857398

Country of ref document: EP

Kind code of ref document: A1