WO2024051270A1 - Task execution method, apparatus, storage medium, and electronic device - Google Patents

Task execution method, apparatus, storage medium, and electronic device Download PDF

Info

Publication number
WO2024051270A1
WO2024051270A1 PCT/CN2023/101479 CN2023101479W WO2024051270A1 WO 2024051270 A1 WO2024051270 A1 WO 2024051270A1 CN 2023101479 W CN2023101479 W CN 2023101479W WO 2024051270 A1 WO2024051270 A1 WO 2024051270A1
Authority
WO
WIPO (PCT)
Prior art keywords
operator
target
combination
node
chip
Prior art date
Application number
PCT/CN2023/101479
Other languages
French (fr)
Chinese (zh)
Inventor
唐晓瑜
毛旷
潘秋红
汤昭荣
王颖
杨弢
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Publication of WO2024051270A1 publication Critical patent/WO2024051270A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This specification relates to the field of artificial intelligence technology, and in particular to methods, devices, storage media and electronic equipment for task execution.
  • an artificial intelligence model is widely used in fields such as autonomous driving and augmented reality.
  • a business platform When a business platform performs a task through an artificial intelligence model, it usually determines the operators required to perform the task from the operators contained in the artificial intelligence model, and runs these operators through the chip to perform the task.
  • This specification provides a task execution method, device, storage medium and electronic equipment to partially solve the above problems existing in the prior art.
  • This specification provides a task execution method, which includes: obtaining a first task request; determining the operator combination required to execute the first task corresponding to the first task request, and determining that at least one of the operator combinations can
  • the execution operator is used as the first target operator.
  • the executable operator can be run directly because it does not need to rely on the operation results of other operators; for the chip that is executing the second task, it is judged whether the remaining computing resources of the chip satisfy Preset conditions; if the result of the judgment is yes, determine at least one executable operator corresponding to the second task currently run by the chip as the second target operator; so as not to affect the operation of each of the chips.
  • Based on the computing resources allocated by the second target operator allocate at least part of the remaining computing resources of the chip to each of the first target operators, so as to run each of the second target operators through the chip. Run each of the first target operators in parallel to execute the first task.
  • allocating at least part of the remaining computing resources of the chip to each of the first target operators specifically includes: Select a candidate operator combination including the second target operator from the determined multiple parallelizable operator combinations; for each candidate operator combination, if the candidate operator combination includes the first target If the number of operators exceeds the preset first threshold, the candidate operator combination is determined as the target operator combination; the chip runs the target operator combination in parallel. each of the first target operators and each of the second target operators to perform the first task.
  • determining a feasible operator combination specifically includes: obtaining multiple target models; for each target model, determining the data transmission dependencies between operators included in the target model; According to the data transmission dependencies of each model, an operator combination that does not have a data transmission dependency relationship among all operators included in the multiple target models is determined as the parallelizable operator combination.
  • each of the multiple target models determine an operator combination that does not have a data transmission dependency with each other among the operators included in the multiple target models as a parallel operation.
  • the sub-combination specifically includes: determining a data flow graph corresponding to the target model according to the data transmission dependency determined for each target model.
  • each node is used to represent The operator included in the target model and the edge between two nodes are used to represent the existence of data transmission dependency between the two nodes; according to multiple data flow graphs corresponding to the multiple target models, determine The parallelizable operator combination.
  • determining a parallelizable operator combination according to the data flow graph corresponding to the multiple target models specifically includes: for each node included in the multiple data flow graph, determine the combination through multiple rounds of iterations.
  • the parallelizable operator combination corresponding to the node For each round of iteration, the target node combination in this round of iteration is determined, and the target node combination includes one target node or multiple target nodes that do not have mutual dependencies; for the multiple data flow graphs including For each other node other than each target node in the target node combination, if the other node does not depend on any target node in the target node combination, then the available node is determined based on the other node and the target node combination.
  • Parallel operator combination and use the determined parallelizable operator combination as the target node combination in the next round of iteration, or sequentially select the nodes in the next round of iteration from the nodes included in the multiple data flow graphs. target node until the preset termination conditions are met.
  • determining a parallelizable operator combination based on the other node and the target node or the other node and the target node combination specifically includes: merging the other node and the target node combination into a candidate operator combination ; If the running time of the candidate operator combination in the chip does not exceed the preset second threshold, it is determined that the candidate operator combination is a parallelizable operator combination.
  • sequentially selecting the target node in the next round of iteration from the nodes included in the plurality of data flow graphs specifically includes: sequentially selecting a node from the candidate node set as the target node in the next round of iteration.
  • the candidate node set includes all nodes included in the plurality of data flow graphs for which the running time of the corresponding operator in the chip does not exceed a preset second threshold.
  • determining the running time of the candidate operator combination in the chip specifically includes: obtaining the candidate operator combination. Relevant characteristics of the selected operator combination, including the calculation amount of the chip used by each operator included in the candidate operator combination, historical data bandwidth, historical running time, and the historical running time of each operator included in the candidate operator combination. At least one of the average, maximum, minimum, and variance of the data transmission size of the included operators; input the relevant features into the preset prediction model to predict and output the The running time of candidate operator combinations in the chip.
  • the parallelizable operator combination is determined when the chip is offline.
  • This specification provides a task execution device, including: an acquisition module, used to obtain a first task request; a first determination module, used to determine the operator combination required to execute the first task corresponding to the first task request, And determine at least one executable operator in the operator combination as the first target operator.
  • the executable operator can be run directly because it does not need to rely on the running results of other operators; the detection module is used to target the current The chip that performs the second task determines whether the remaining computing resources of the chip meet the preset conditions; the second determination module is used to determine the third current operation of the chip if the determination result of the detection module is yes.
  • At least one executable operator corresponding to the two tasks is used as the second target operator; the execution module is used to provide each of the second target operators with no influence on the computing resources allocated by the chip for each of the second target operators.
  • a target operator allocates at least part of the remaining computing resources of the chip to execute the first task by running each of the first target operators in parallel on the basis of running each of the second target operators. .
  • the execution module is specifically configured to select candidate operator combinations containing the second target operator from a plurality of predetermined parallelizable operator combinations; for each candidate operator combination , if the number of the first target operator included in the candidate operator combination exceeds the preset first threshold, then determine the candidate operator combination as the target operator combination; run the target operator in parallel by the chip Each of the first target operators and each of the second target operators in the combination are combined to perform the first task.
  • the device further includes a third determination module, configured to: obtain multiple target models; for each target model, determine the data transmission dependencies between operators included in the target model; according to the The data transmission dependencies of each of the multiple target models determine an operator combination that does not have a data transmission dependency with each other among all operators included in the multiple target models, as the parallelizable operator combination.
  • a third determination module configured to: obtain multiple target models; for each target model, determine the data transmission dependencies between operators included in the target model; according to the The data transmission dependencies of each of the multiple target models determine an operator combination that does not have a data transmission dependency with each other among all operators included in the multiple target models, as the parallelizable operator combination.
  • the third determination module is specifically configured to determine the data flow diagram corresponding to the target model according to the data transmission dependency determined for each target model.
  • each node is used to represent each operator included in the target model, and the edge between two nodes is used to represent the existence of a data transmission dependency relationship between the two nodes; according to the correspondence between the multiple target models
  • a plurality of the data flow graphs are used to determine the parallelizable operator combination.
  • the third determination module is specifically configured to, for each node included in the plurality of data flow graphs, determine the parallelizable operator combination corresponding to the node through multiple rounds of iterations.
  • the target node combination in this round of iteration is determined, and the target node combination includes one target node or multiple target nodes that do not have mutual dependencies; for the multiple data flow graphs including For each other node other than each target node in the target node combination, if the other node does not depend on any target node in the target node combination, then the available node is determined based on the other node and the target node combination.
  • Parallel operator combinations and use the determined parallelizable operator combination as the target node combination in the next round of iteration, or sequentially select the nodes in the next round of iteration from the nodes included in the multiple data flow graphs. target node until the preset termination conditions are met.
  • the third determination module is specifically configured to combine the other node and the target node combination into a candidate operator combination, and if the running time of the candidate operator combination in the chip does not exceed a predetermined time, If a threshold is set, the candidate operator combination is determined to be a parallel operator combination.
  • the third determination module is specifically configured to sequentially select a node from the candidate node set as the target node in the next round of iteration.
  • the candidate node set includes all nodes included in the plurality of data flow graphs for which the running time of the corresponding operator in the chip does not exceed a preset second threshold.
  • the third determination module is specifically configured to obtain relevant features of the candidate operator combination, where the relevant features include calculations of the chips used in the history of each operator included in the candidate operator combination. quantity, historical data bandwidth, historical running time, at least one of the average, maximum, minimum, and variance of the data transmission size of the operators included in the candidate operator combination; input the relevant features into the preset A prediction model is provided to predict and output the running time of the candidate operator combination in the chip through the prediction model.
  • the parallelizable operator combination is determined when the chip is offline.
  • This specification provides a computer-readable storage medium.
  • the storage medium stores a computer program.
  • the computer program is executed by a processor, the above task execution method is implemented.
  • This specification provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the program, the above task execution method is implemented.
  • the task execution method by determining an executable operator that does not need to rely on the running results of other operators among the operator combinations required to execute the first task corresponding to the current task request as the first target operator, And when the remaining computing resources of the chip that is executing the second task meet the preset conditions, the computing resources allocated by the chip to each executable operator (as the second target operator) corresponding to the second task are not affected. Basically, allocate at least part of the remaining computing resources of the chip to each first target operator, so that the chip can run each On the basis of the second target operator, each first target operator is run in parallel to perform the first task, thereby effectively improving the utilization of the chip's computing resources.
  • Figure 1 is a schematic diagram of a task execution method provided according to embodiments in this specification.
  • Figure 2 is a schematic diagram of a data flow diagram provided according to an embodiment in this specification.
  • Figure 3 is a schematic diagram of the process of determining a feasible operator combination according to an embodiment of this specification
  • Figure 4 is a functional module schematic diagram of a task execution device provided according to an embodiment of this specification.
  • FIG. 5 is a schematic structural diagram of an electronic device for task execution provided according to an embodiment of this specification.
  • This specification provides a task execution method, as shown in Figure 1.
  • the method includes the following steps S101 to S105.
  • step S101 obtain the first task request.
  • users can perform corresponding tasks through the artificial intelligence model deployed in the business platform.
  • the user can send a task request to the business platform through the device used by the user, so that after obtaining the task request, the business platform can dispatch the computing resources of the chip in response to the received task request to execute the task request corresponding task.
  • a user can send a product recommendation request to the business platform through the device used by the user. After receiving the product recommendation request, the business platform can call the chip to run the operator in the product recommendation model to perform the product recommendation task.
  • the execution subject used to implement the task execution method may refer to a server, etc. set on the business platform.
  • the designated equipment of the station may also refer to terminal equipment such as desktop computers and laptop computers.
  • terminal equipment such as desktop computers and laptop computers.
  • step S102 Determine the operator combination required to execute the first task corresponding to the first task request, and determine at least one executable operator included in the operator combination as the first target operator.
  • the executable operator does not need to rely on the running results of other operators, and can be directly run by the chip, for example.
  • the server may determine the operator combination required to execute the task corresponding to the task request, and determine at least one executable operator included in the operator combination as the first target operator.
  • the executable operators here can be run directly because they do not need to rely on the results of other operators.
  • the model used to perform the task is determined, and the operators included in the model used to perform the task are used to form the operator combination required to perform the task, and then the chip can be run.
  • the operators included in the operator combination are used to implement task execution.
  • the server can also run each executable operator through multiple chips. Therefore, the server can determine at least one executable operator in the operator combination as the first target operator. For example: Assume that the executable operators include operator a, operator b, operator c and operator d. The server can run operator a through chip one, operator b through chip two, and operator c through chip three. operator and d operator.
  • the server determines the operator combination that needs to be run in response to the task request issued by the user, it can determine the operator that can be directly run by the chip as the first target operator from the operator combination. Running the first target operator through the chip allows other operators in the operator combination that depend on the operation result of the first target operator to continue to run through the chip.
  • the server finishes running the first target operator through the chip, the operators that depend on the first target operator in the operator combination corresponding to the first task will become Operators that can be executed directly by the chip.
  • the server can re-determine at least one new operator that can be directly run by the chip from the operator combination as the first target operator.
  • step S103 For the chip that is executing the second task, determine whether the remaining computing resources of the chip meet the preset conditions.
  • step S104 If the judgment result in step S103 is yes, the executable operator corresponding to the second task currently run by the chip is obtained as the second target operator.
  • the server After the server determines the first target operator that currently needs to be run through the chip, it can determine whether the remaining computing resources of the chip meet the preset conditions for the chip that is executing the second task. If the judgment result is yes, it can use the The chip runs the first target operator corresponding to the first task in parallel on the basis of executing the second task. If the judgment result is no, a chip in an idle state can be allocated to the first task to run the first target operator corresponding to the first task; or, if there is currently no chip in an idle state, it can be made The first task is in a waiting state.
  • the above preset conditions can be set according to actual needs. For example: determine whether the remaining computing resources of the chip reach the specified condition threshold, etc.
  • step S105 On the basis of not affecting the computing resources allocated by the chip to each of the second target operators, allocate at least part of the remaining computing resources of the chip to each of the first target operators, and use the chip to The first task is executed by running each of the first target operators in parallel on the basis of running each of the second target operators.
  • the server can allocate at least part of the remaining computing resources of the chip to each first target operator on the basis of not affecting the computing resources allocated by the chip to each second target operator, and execute by running each first target operator in parallel. Corresponding first task.
  • the server can filter out a parallelizable operator combination including the second target operator as a candidate operator combination from the predetermined parallelizable operator combinations. For each candidate operator combination, if the candidate operator combination The number of first target operators included in exceeds the preset first threshold, then the candidate operator combination is determined as the target operator combination, and the chip runs each first target operator in the target operator combination in parallel. and each second objective operator.
  • the server when the server needs to execute the first task in parallel through the chip that is executing the second task, it needs to predict the first target operator that needs to be executed and the second target operator that is currently running. Whether the two target operators can be run in parallel, and the above prediction process may cause the response delay of the task request to increase. Based on this, the server can pre-determine all operator combinations that can be executed in parallel as parallel operator combinations in the offline state. Then, in actual applications, it can directly filter out the combinations with the second target from each parallel operator combination. operator and the combination of target operators matched by the first target operator to make the chip run in parallel.
  • the method for the server to determine each parallel operator combination may include: obtaining multiple target models; for each target model, determining the data transmission dependency between operators included in the target model; corresponding to each other according to the target model The data transmission dependency relationship is determined, and an operator combination that has no data transmission dependency relationship among all operators included in the multiple target models is determined as the parallelizable operator combination.
  • FIG. 2 is a schematic diagram of a data flow diagram provided according to an embodiment of this specification.
  • the server can determine the data flow diagram corresponding to each target model based on the determined data transmission dependencies.
  • each node is used to represent the operators contained in the target model, and the edge between two nodes is used to represent the data transmission dependency between the two nodes.
  • edge between two nodes is used to represent the data transmission dependency between the two nodes.
  • the server can determine the parallelizable operator combination corresponding to the node through multiple rounds of iteration.
  • the server can: determine the target node combination in this round of iteration, the target node combination includes one target node or multiple target nodes that do not have mutual dependencies; For each other node other than each target node in the node combination, determine whether the other node depends on any target node in the target node combination.
  • the server can determine the parallelizable operator combination based on the other node and the target node combination, and use the determined parallelizable operator combination as the next The target node combination in one round of iteration, or the target node in the next round of iteration is selected sequentially from the nodes included in the data flow graph.
  • the server can sequentially determine according to the data flow graph whether the target node in the target node combination is the parent node or ancestor node of the other node. If there is a target node in the target node combination that is the parent of the other node node or ancestor node, it can be determined that the other node depends on the target node in the target node combination. Among them, in the data flow graph, if there is an edge from the first node to the second node, the first node is the parent node and the second node is the son node. Furthermore, the parent node and ancestor node of the first node are the ancestor nodes of the second node.
  • the server may determine each obtained target node combination as a parallel operator combination.
  • the above-mentioned preset termination condition may be that all nodes included in the data flow graph are used as target nodes to participate in the iterative processing.
  • the above preset termination condition can also be that the number of rounds of iteration reaches a specified number of rounds.
  • the server determines a parallel operator combination based on the combination of the other nodes and the target node, which may be specifically: the server combines the other nodes and the target node combination into a candidate operator combination, and determines that the candidate operator combination is in The running time in the chip. If the running time of the candidate operator combination in the chip does not exceed the preset second threshold, the candidate operator combination is determined to be a parallelizable operator combination.
  • the server can also predict the running time of each candidate operator combination in the chip, and further evaluate each candidate operator combination based on the predicted running time of each candidate operator combination in the chip. Ground screening.
  • the server can also determine the running time of the operator corresponding to the node in the chip for each node included in the data flow graph. If the operator corresponding to the node is in the chip, If the running time does not exceed the preset second threshold, the node is added to the preset candidate node set. In this way, when determining multiple rounds of iterations of parallelizable operator combinations, one node can be selected sequentially from the candidate node set as the target node in the next round of iterations.
  • the server determines the running time of the candidate operator combination in the chip, which can be specifically as follows: the server obtains the relevant features of the candidate operator combination, inputs the relevant features into the preset prediction model, so that the candidate operator combination is predicted and output by the prediction model run time in the chip.
  • the relevant features here include: the calculation amount of the chip used in the history of each operator included in the candidate operator combination, historical data bandwidth, historical running time, and the data transmission size of the operators included in the candidate operator combination. At least one of average, maximum, minimum, and variance. It should be noted that, when the candidate operator combination only includes one operator, what is obtained through the above method is the running time of the operator in the chip.
  • the training method of the above prediction model may be: inputting a sample operator combination including at least one sample operator into the prediction model, predicting and outputting the running time of the sample operator combination in the chip through the prediction model; and, in a minimum manner The deviation between the running time of the sample operator combination output by the prediction model in the chip and the actual running time of the sample operator combination simulated offline is used to train the prediction model.
  • FIG. 3 is a schematic diagram of a process for determining a parallelizable operator combination provided according to an embodiment in this specification.
  • the server can encode the operator corresponding to each node in the data flow graph.
  • the encoding method may be to sequentially encode each node in the data flow graph in ascending order from 0 to N. Among them, N represents the number of all nodes in the data flow graph.
  • the server can predict that the running time of the operator corresponding to node i (hereinafter also referred to as the i-th operator, or operator i) on the chip is ti.
  • the running time ti of operator i is less than the predetermined
  • i is an integer greater than or equal to 0 and less than N, indicating the encoding of the node and the operator corresponding to the node.
  • the server can determine whether set 1 is empty. When set 1 is not empty, The server can select operator x in set 1, delete the operator x from set 1, and combine each operator in set 2 with operator x to obtain operator combination x′. Further, if there is no data dependency between operators in operator combination x′ and the running time of operator combination x′ in a single chip does not exceed the second threshold, then operator combination x′ is added to the set. 1 and the result set R, otherwise the operator combination x′ is discarded. Repeat the above process until set 1 is empty. At this time, each operator combination in the result set R is a parallel operator combination.
  • the device includes an acquisition module 401, a first determination module 402, a detection module 403, a second determination module 404, and an execution module 405.
  • Obtaining module 401 is used to obtain the first task request.
  • the first determination module 402 is used to determine the operator combination required to execute the first task corresponding to the first task request, and determine at least one executable operator in the operator combination as the first target operator, The executable operator does not need to rely on the running results of other operators and can be run directly.
  • the detection module 403 is used to determine whether the remaining computing resources of the chip that is executing the second task meet the preset conditions.
  • the second determination module 404 is configured to determine at least one executable operator corresponding to the second task currently run by the chip as each second target operator if the determination result of the detection module 403 is yes.
  • Execution module 405 is configured to allocate at least part of the remaining computing resources of the chip to each of the first target operators on the basis of not affecting the computing resources allocated by the chip to each of the second target operators.
  • the chip runs each of the first target operators in parallel on the basis of running each of the second target operators.
  • the execution module 405 is specifically configured to select candidate operator combinations containing the second target operator from a plurality of predetermined parallelizable operator combinations; for each candidate operator combination, If the number of the first target operator included in the candidate operator combination exceeds the preset first threshold, the candidate operator combination is determined to be the target operator combination; the chip runs the target operator combination in parallel.
  • the device further includes: a third determining module 406.
  • the third determination module 406 is specifically used to: Obtain multiple target models; for each target model, determine the data transmission dependencies between operators included in the target model; determine the multiple target models based on the data transmission dependencies corresponding to the multiple target models. Operator combinations that have no data transmission dependencies among all included operators are regarded as parallelizable operator combinations.
  • the third determination module 406 is specifically configured to: determine the data flow graph corresponding to each target model according to the determined data transmission dependency relationship.
  • each node uses To represent the operators included in the target model, the edge between two nodes is used to represent the existence of a data transmission dependency relationship between the two nodes; according to the data flow graph, a combination of parallel operators is determined.
  • the third determination module 406 is specifically configured to: for each node included in the data flow graph, determine the parallelizable operator combination corresponding to the node through multiple rounds of iteration. Wherein, for each round of iteration, the third determination module 406 is specifically used to: determine a target node combination in this round of iteration, where the target node combination includes one target node or multiple target nodes that do not have mutual dependencies; For each other node included in the data flow graph except each target node in the target node combination, if the other node does not depend on any target node in the target node combination, then according to the other The combination of nodes and the target node determines a parallelizable operator combination, and the determined parallelizable operator combination is used as the target node combination in the next round of iteration, or the next one is selected sequentially from the nodes included in the data flow graph. The target node in one iteration; and, after determining that the preset termination conditions are met, a parallelizable operator combination
  • the third determination module 406 is specifically configured to: combine the other node and the target node combination into a candidate operator combination, and determine the running time of the candidate operator combination in the chip; if If the running time of the candidate operator combination in the chip does not exceed a preset threshold, the candidate operator combination is determined to be a parallelizable operator combination.
  • the third determination module 406 is specifically configured to: for each node included in the data flow graph, determine the running time of the operator corresponding to the node in the chip; if the operator corresponding to the node If the running time of the child in the chip does not exceed the preset threshold, then the node is added to the preset candidate node set; a node is selected in sequence from the candidate node set as the target node in the next round of iteration.
  • the third determination module 406 is specifically configured to: obtain relevant features of the candidate operator combination, where the relevant features include the historical calculation amount of the chip used for each operator included in the candidate operator combination. , historical data bandwidth, historical running time, at least one of the average, maximum, minimum, and variance of the data transmission size of the operators included in the candidate operator combination; input the relevant features into the preset prediction In the model, the running time of the candidate operator combination in the chip is predicted and outputted through the prediction model.
  • the parallelizable operator combination is determined when the chip is offline.
  • This specification also provides a computer-readable storage medium that stores a computer program, and the computer program can be used to perform the above method for task execution.
  • the electronic device includes a processor, internal bus, network interface, memory and non-volatile memory, and of course may also include other hardware required for business.
  • the processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it to implement the above method for task execution.
  • PLD Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • HDL Hardware Description Language
  • HDL High-Speed Integrated Circuit Hardware Description Language
  • ABEL Advanced Boolean Expression Language
  • AHDL Advanced Boolean Expression Language
  • Confluence CUPL
  • HDCal Component Description Language
  • JHDL Java Hardware Description Language
  • Lava Lava
  • Lola MyHDL
  • PALASM RHDL
  • VHDL Very-High-Speed Integrated Circuit Hardware Description Language
  • Verilog Verilog
  • the controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (eg, software or firmware) executable by the (micro)processor. , logic gates, switches, Application Specific Integrated Circuit (ASIC), programmable logic controllers and embedded microcontrollers.
  • controllers include but are not limited to the following microcontrollers: ARC 625D, Atmel AT91SAM, For Microchip PIC18F26K20 and Silicone Labs C8051F320, the memory controller can also be implemented as part of the memory's control logic.
  • the controller in addition to implementing the controller in the form of pure computer-readable program code, the controller can be completely programmed with logic gates, switches, application-specific integrated circuits, programmable logic controllers and embedded logic by logically programming the method steps. Microcontroller, etc. to achieve the same function. Therefore, this controller can be considered as a hardware component, and the devices included therein for implementing various functions can also be considered as structures within the hardware component. Or even, the means for implementing various functions can be considered as structures within hardware components as well as software modules implementing the methods.
  • a typical implementation device is a computer.
  • the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.
  • embodiments of the present specification may be provided as methods, systems, or computer program products.
  • the present description may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects.
  • the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk memory, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means,
  • the instruction means implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in a process or processes in the flow diagram and/or in a block or blocks in the block diagram.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-permanent storage in computer-readable media, random access memory (RAM) and/or non-volatile memory in the form of read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash random access memory
  • Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information.
  • Information may be computer-readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • read-only memory read-only memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technology
  • compact disc read-only memory CD-ROM
  • DVD digital versatile disc
  • Magnetic tape cassettes tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • embodiments of the present specification may be provided as methods, systems, or computer program products.
  • the present description may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects.
  • the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk memory, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • the present description may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through communications networks.
  • program modules may be located in both local and remote computer storage media including storage devices.

Abstract

A task execution method, an apparatus, a storage medium, and an electronic device, which can pre-determine all operator combinations able to be concurrently executed, and thereby when first task execution is carried out in response to a first task request initiated by a user, it can be determined whether there are operator combinations able to be concurrently executed that match between various executable operators needed for executing the first task and operators for a second task being executed by a chip executing said second task, and if yes, the first task can be concurrently executed by the chip executing the second task, thus improving the utilization of computing resources of the chip.

Description

任务执行的方法、装置、存储介质及电子设备Methods, devices, storage media and electronic equipment for task execution 技术领域Technical field
本说明书涉及人工智能技术领域,尤其涉及任务执行的方法、装置、存储介质及电子设备。This specification relates to the field of artificial intelligence technology, and in particular to methods, devices, storage media and electronic equipment for task execution.
背景技术Background technique
目前,各类人工智能模型在诸如自动驾驶、增强现实等领域中具有广泛的应用。业务平台在通过人工智能模型执行任务时,通常会从人工智能模型包含的算子中确定出执行任务所需的算子,并通过芯片运行这些算子以执行任务。Currently, various artificial intelligence models are widely used in fields such as autonomous driving and augmented reality. When a business platform performs a task through an artificial intelligence model, it usually determines the operators required to perform the task from the operators contained in the artificial intelligence model, and runs these operators through the chip to perform the task.
在芯片运行这些算子的过程中,只是占用了芯片的部分计算资源,从而会导致芯片的计算资源的浪费,并降低了芯片的计算资源的利用率。因此,如何能够提升芯片的计算资源的利用率,则是一个亟待解决的问题。When the chip runs these operators, only part of the chip's computing resources are occupied, which leads to a waste of the chip's computing resources and reduces the utilization of the chip's computing resources. Therefore, how to improve the utilization of the chip's computing resources is an urgent problem to be solved.
发明内容Contents of the invention
本说明书提供一种任务执行方法、装置、存储介质及电子设备,以部分的解决现有技术存在的上述问题。This specification provides a task execution method, device, storage medium and electronic equipment to partially solve the above problems existing in the prior art.
本说明书提供了一种任务执行方法,包括:获取第一任务请求;确定执行所述第一任务请求对应的第一任务所需的算子组合,并确定所述算子组合中的至少一个可执行算子作为第一目标算子,所述可执行算子因不需要依赖其他算子的运行结果而可直接运行;针对正在执行第二任务的芯片,判断所述芯片剩余的计算资源是否满足预设条件;若所述判断的结果为是,则确定所述芯片当前运行的所述第二任务对应的至少一个可执行算子,作为第二目标算子;以不影响所述芯片为各所述第二目标算子分配的计算资源为基础,为各所述第一目标算子分配所述芯片至少部分剩余计算资源,以通过所述芯片在运行各所述第二目标算子的基础上并行运行各所述第一目标算子,来执行所述第一任务。This specification provides a task execution method, which includes: obtaining a first task request; determining the operator combination required to execute the first task corresponding to the first task request, and determining that at least one of the operator combinations can The execution operator is used as the first target operator. The executable operator can be run directly because it does not need to rely on the operation results of other operators; for the chip that is executing the second task, it is judged whether the remaining computing resources of the chip satisfy Preset conditions; if the result of the judgment is yes, determine at least one executable operator corresponding to the second task currently run by the chip as the second target operator; so as not to affect the operation of each of the chips. Based on the computing resources allocated by the second target operator, allocate at least part of the remaining computing resources of the chip to each of the first target operators, so as to run each of the second target operators through the chip. Run each of the first target operators in parallel to execute the first task.
可选地,以不影响所述芯片为各所述第二目标算子分配的计算资源为基础,为各所述第一目标算子分配所述芯片至少部分剩余计算资源,具体包括:从预先确定的多个可并行算子组合中筛选出包含有所述第二目标算子的候选算子组合;针对每个所述候选算子组合,若该候选算子组合中包含所述第一目标算子的数量超过预设的第一阈值,则确定该候选算子组合作为目标算子组合;由所述芯片并行运行所述目标算子组合中 的各所述第一目标算子和各所述第二目标算子,以执行所述第一任务。Optionally, on the basis of not affecting the computing resources allocated by the chip to each of the second target operators, allocating at least part of the remaining computing resources of the chip to each of the first target operators specifically includes: Select a candidate operator combination including the second target operator from the determined multiple parallelizable operator combinations; for each candidate operator combination, if the candidate operator combination includes the first target If the number of operators exceeds the preset first threshold, the candidate operator combination is determined as the target operator combination; the chip runs the target operator combination in parallel. each of the first target operators and each of the second target operators to perform the first task.
可选地,确定可行性算子组合,具体包括:获取多个目标模型;针对每个所述目标模型,确定该目标模型包含的算子之间的数据传输依赖关系;根据所述多个目标模型各自的所述数据传输依赖关系,确定所述多个目标模型包含的所有算子中互相之间不存在数据传输依赖关系的算子组合,作为所述可并行算子组合。Optionally, determining a feasible operator combination specifically includes: obtaining multiple target models; for each target model, determining the data transmission dependencies between operators included in the target model; According to the data transmission dependencies of each model, an operator combination that does not have a data transmission dependency relationship among all operators included in the multiple target models is determined as the parallelizable operator combination.
可选地,根据所述多个目标模型各自的所述数据传输依赖关系,确定所述多个目标模型包含的算子中互相之间不存在数据传输依赖关系的算子组合,作为可并行算子组合,具体包括:根据针对每个所述目标模型所确定出的所述数据传输依赖关系,确定所述目标模型对应的数据流图,在所述数据流图中,每个节点用于表征所述目标模型包含的算子,两个节点之间的边用于表征所述两个节点之间存在数据传输依赖关系;根据所述多个目标模型对应的多个所述数据流图,确定所述可并行算子组合。Optionally, according to the data transmission dependencies of each of the multiple target models, determine an operator combination that does not have a data transmission dependency with each other among the operators included in the multiple target models as a parallel operation. The sub-combination specifically includes: determining a data flow graph corresponding to the target model according to the data transmission dependency determined for each target model. In the data flow graph, each node is used to represent The operator included in the target model and the edge between two nodes are used to represent the existence of data transmission dependency between the two nodes; according to multiple data flow graphs corresponding to the multiple target models, determine The parallelizable operator combination.
可选地,根据所述多个目标模型对应的所述数据流图,确定可并行算子组合,具体包括:针对所述多个数据流图包含的每个节点,通过多轮迭代,确定该节点对应的可并行算子组合。其中,针对每轮迭代,确定该轮迭代中的目标节点组合,所述目标节点组合包括一个目标节点或包括相互不存在依赖关系的多个目标节点;针对所述多个数据流图包含的除所述目标节点组合中的各目标节点之外的每个其他节点,若该其他节点不依赖于所述目标节点组合中的任意一个目标节点,则根据该其他节点和所述目标节点组合确定可并行算子组合,并将确定出的所述可并行算子组合作为下一轮迭代中的目标节点组合,或从所述多个数据流图包含的节点中依次选取出下一轮迭代中的目标节点,直到满足预设的终止条件。Optionally, determining a parallelizable operator combination according to the data flow graph corresponding to the multiple target models specifically includes: for each node included in the multiple data flow graph, determine the combination through multiple rounds of iterations. The parallelizable operator combination corresponding to the node. Wherein, for each round of iteration, the target node combination in this round of iteration is determined, and the target node combination includes one target node or multiple target nodes that do not have mutual dependencies; for the multiple data flow graphs including For each other node other than each target node in the target node combination, if the other node does not depend on any target node in the target node combination, then the available node is determined based on the other node and the target node combination. Parallel operator combination, and use the determined parallelizable operator combination as the target node combination in the next round of iteration, or sequentially select the nodes in the next round of iteration from the nodes included in the multiple data flow graphs. target node until the preset termination conditions are met.
可选地,根据该其他节点和所述目标节点或该其他节点和所述目标节点组合,确定可并行算子组合,具体包括:将该其他节点和所述目标节点组合合并为候选算子组合;若所述候选算子组合在所述芯片中的运行时间不超过预设的第二阈值,则确定该候选算子组合为可并行算子组合。Optionally, determining a parallelizable operator combination based on the other node and the target node or the other node and the target node combination specifically includes: merging the other node and the target node combination into a candidate operator combination ; If the running time of the candidate operator combination in the chip does not exceed the preset second threshold, it is determined that the candidate operator combination is a parallelizable operator combination.
可选地,从所述多个数据流图包含的节点中依次选取出下一轮迭代中的目标节点,具体包括:从候选节点集中依次选取出一个节点,作为下一轮迭代中的目标节点。其中,所述候选节点集包括所述多个数据流图中包含的、对应的算子在所述芯片中的运行时间不超过预设的第二阈值的所有节点。Optionally, sequentially selecting the target node in the next round of iteration from the nodes included in the plurality of data flow graphs specifically includes: sequentially selecting a node from the candidate node set as the target node in the next round of iteration. . Wherein, the candidate node set includes all nodes included in the plurality of data flow graphs for which the running time of the corresponding operator in the chip does not exceed a preset second threshold.
可选地,确定所述候选算子组合在所述芯片中的运行时间,具体包括:获取所述候 选算子组合的相关特征,所述相关特征包括所述候选算子组合中包含的每个算子的历史所使用芯片的计算量,历史数据带宽,历史运行时间,所述候选算子组合中包含的算子的数据传输大小的平均值、最大值、最小值、方差中的至少一种;将所述相关特征输入到预设的预测模型中,以通过所述预测模型预测并输出所述候选算子组合在所述芯片中的运行时间。Optionally, determining the running time of the candidate operator combination in the chip specifically includes: obtaining the candidate operator combination. Relevant characteristics of the selected operator combination, including the calculation amount of the chip used by each operator included in the candidate operator combination, historical data bandwidth, historical running time, and the historical running time of each operator included in the candidate operator combination. At least one of the average, maximum, minimum, and variance of the data transmission size of the included operators; input the relevant features into the preset prediction model to predict and output the The running time of candidate operator combinations in the chip.
可选地,所述可并行算子组合是在所述芯片处于离线状态时确定的。Optionally, the parallelizable operator combination is determined when the chip is offline.
本说明书提供了一种任务执行装置,包括:获取模块,用于获取第一任务请求;第一确定模块,用于确定执行所述第一任务请求对应的第一任务所需的算子组合,并确定所述算子组合中的至少一个可执行算子作为第一目标算子,所述可执行算子因不需要依赖其他算子的运行结果而可直接运行;检测模块,用于针对正在执行第二任务的芯片,判断所述芯片剩余的计算资源是否满足预设条件;第二确定模块,用于若所述检测模块的判断结果为是,则确定所述芯片当前运行的所述第二任务对应的至少一个可执行算子,作为第二目标算子;执行模块,用于以不影响所述芯片为各所述第二目标算子分配的计算资源为基础,为各所述第一目标算子分配所述芯片至少部分剩余计算资源,以通过所述芯片在运行各所述第二目标算子的基础上并行运行各所述第一目标算子,来执行所述第一任务。This specification provides a task execution device, including: an acquisition module, used to obtain a first task request; a first determination module, used to determine the operator combination required to execute the first task corresponding to the first task request, And determine at least one executable operator in the operator combination as the first target operator. The executable operator can be run directly because it does not need to rely on the running results of other operators; the detection module is used to target the current The chip that performs the second task determines whether the remaining computing resources of the chip meet the preset conditions; the second determination module is used to determine the third current operation of the chip if the determination result of the detection module is yes. At least one executable operator corresponding to the two tasks is used as the second target operator; the execution module is used to provide each of the second target operators with no influence on the computing resources allocated by the chip for each of the second target operators. A target operator allocates at least part of the remaining computing resources of the chip to execute the first task by running each of the first target operators in parallel on the basis of running each of the second target operators. .
可选地,所述执行模块具体用于,从预先确定的多个可并行算子组合中筛选出包含有所述第二目标算子的候选算子组合;针对每个所述候选算子组合,若该候选算子组合中包含所述第一目标算子的数量超过预设的第一阈值,则确定该候选算子组合作为目标算子组合;由所述芯片并行运行所述目标算子组合中的各所述第一目标算子和各所述第二目标算子,以执行所述第一任务。Optionally, the execution module is specifically configured to select candidate operator combinations containing the second target operator from a plurality of predetermined parallelizable operator combinations; for each candidate operator combination , if the number of the first target operator included in the candidate operator combination exceeds the preset first threshold, then determine the candidate operator combination as the target operator combination; run the target operator in parallel by the chip Each of the first target operators and each of the second target operators in the combination are combined to perform the first task.
可选地,所述装置还包括第三确定模块,用于:获取多个目标模型;针对每个所述目标模型,确定该目标模型包含的算子之间的数据传输依赖关系;根据所述多个目标模型各自的所述数据传输依赖关系,确定所述多个目标模型包含的所有算子中互相之间不存在数据传输依赖关系的算子组合,作为所述可并行算子组合。Optionally, the device further includes a third determination module, configured to: obtain multiple target models; for each target model, determine the data transmission dependencies between operators included in the target model; according to the The data transmission dependencies of each of the multiple target models determine an operator combination that does not have a data transmission dependency with each other among all operators included in the multiple target models, as the parallelizable operator combination.
可选地,所述第三确定模块具体用于,根据针对每个所述目标模型所确定出的所述数据传输依赖关系,确定所述目标模型对应的数据流图,在所述数据流图中,每个节点用于表征所述目标模型包含的每个算子,两个节点之间的边用于表征所述两个节点之间存在数据传输依赖关系;根据所述多个目标模型对应的多个所述数据流图,确定所述可并行算子组合。 Optionally, the third determination module is specifically configured to determine the data flow diagram corresponding to the target model according to the data transmission dependency determined for each target model. In the data flow diagram , each node is used to represent each operator included in the target model, and the edge between two nodes is used to represent the existence of a data transmission dependency relationship between the two nodes; according to the correspondence between the multiple target models A plurality of the data flow graphs are used to determine the parallelizable operator combination.
可选地,所述第三确定模块具体用于,针对所述多个数据流图包含的每个节点,通过多轮迭代,确定该节点对应的可并行算子组合。其中,针对每轮迭代,确定该轮迭代中的目标节点组合,所述目标节点组合包括一个目标节点或包括相互不存在依赖关系的多个目标节点;针对所述多个数据流图包含的除所述目标节点组合中的各目标节点之外的每个其他节点,若该其他节点不依赖于所述目标节点组合中的任意一个目标节点,则根据该其他节点和所述目标节点组合确定可并行算子组合,并将确定出的所述可并行算子组合作为下一轮迭代中的目标节点组合,或从所述多个数据流图包含的节点中依次选取出下一轮迭代中的目标节点,直到满足预设的终止条件后。Optionally, the third determination module is specifically configured to, for each node included in the plurality of data flow graphs, determine the parallelizable operator combination corresponding to the node through multiple rounds of iterations. Wherein, for each round of iteration, the target node combination in this round of iteration is determined, and the target node combination includes one target node or multiple target nodes that do not have mutual dependencies; for the multiple data flow graphs including For each other node other than each target node in the target node combination, if the other node does not depend on any target node in the target node combination, then the available node is determined based on the other node and the target node combination. Parallel operator combinations, and use the determined parallelizable operator combination as the target node combination in the next round of iteration, or sequentially select the nodes in the next round of iteration from the nodes included in the multiple data flow graphs. target node until the preset termination conditions are met.
可选地,所述第三确定模块具体用于,将该其他节点和所述目标节点组合合并为候选算子组合,并若所述候选算子组合在所述芯片中的运行时间不超过预设阈值,则确定该候选算子组合为可并行算子组合。Optionally, the third determination module is specifically configured to combine the other node and the target node combination into a candidate operator combination, and if the running time of the candidate operator combination in the chip does not exceed a predetermined time, If a threshold is set, the candidate operator combination is determined to be a parallel operator combination.
可选地,所述第三确定模块具体用于,从候选节点集中依次选取出一个节点,作为下一轮迭代中的目标节点。其中,所述候选节点集包括所述多个数据流图中包含的、对应的算子在所述芯片中的运行时间不超过预设的第二阈值的所有节点。Optionally, the third determination module is specifically configured to sequentially select a node from the candidate node set as the target node in the next round of iteration. Wherein, the candidate node set includes all nodes included in the plurality of data flow graphs for which the running time of the corresponding operator in the chip does not exceed a preset second threshold.
可选地,所述第三确定模块具体用于,获取所述候选算子组合的相关特征,所述相关特征包括所述候选算子组合中包含的每个算子的历史所使用芯片的计算量,历史数据带宽,历史运行时间,所述候选算子组合中包含的算子的数据传输大小的平均值、最大值、最小值、方差中的至少一种;将所述相关特征输入到预设的预测模型中,以通过所述预测模型预测并输出所述候选算子组合在所述芯片中的运行时间。Optionally, the third determination module is specifically configured to obtain relevant features of the candidate operator combination, where the relevant features include calculations of the chips used in the history of each operator included in the candidate operator combination. quantity, historical data bandwidth, historical running time, at least one of the average, maximum, minimum, and variance of the data transmission size of the operators included in the candidate operator combination; input the relevant features into the preset A prediction model is provided to predict and output the running time of the candidate operator combination in the chip through the prediction model.
可选地,所述可并行算子组合是在所述芯片处于离线状态时确定的。Optionally, the parallelizable operator combination is determined when the chip is offline.
本说明书提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时实现上述任务执行方法。This specification provides a computer-readable storage medium. The storage medium stores a computer program. When the computer program is executed by a processor, the above task execution method is implemented.
本说明书提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述任务执行方法。This specification provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the above task execution method is implemented.
根据本说明书实施例提供的任务执行方法,通过确定执行当前任务请求对应的第一任务所需的算子组合中不需要依赖其他算子的运行结果的可执行算子作为第一目标算子,并在正在执行第二任务的芯片剩余的计算资源满足预设条件的情况下,以不影响该芯片为第二任务对应的各可执行算子(作为第二目标算子)分配的计算资源为基础,为各第一目标算子分配该芯片至少部分剩余计算资源,使得该芯片可以通过在运行各 第二目标算子的基础上并行运行各第一目标算子来执行第一任务,从而可以有效提升芯片的计算资源的利用率。According to the task execution method provided by the embodiment of this specification, by determining an executable operator that does not need to rely on the running results of other operators among the operator combinations required to execute the first task corresponding to the current task request as the first target operator, And when the remaining computing resources of the chip that is executing the second task meet the preset conditions, the computing resources allocated by the chip to each executable operator (as the second target operator) corresponding to the second task are not affected. Basically, allocate at least part of the remaining computing resources of the chip to each first target operator, so that the chip can run each On the basis of the second target operator, each first target operator is run in parallel to perform the first task, thereby effectively improving the utilization of the chip's computing resources.
附图说明Description of the drawings
附图用来提供对本说明书的进一步理解,构成本说明书的一部分,本说明书的示意性实施例及其说明用于解释本说明书,并不构成对本说明书的不当限定。在附图中:The accompanying drawings are used to provide a further understanding of this specification and constitute a part of this specification. The illustrative embodiments and descriptions of this specification are used to explain this specification and do not constitute an improper limitation of this specification. In the attached picture:
图1为根据本说明书中实施例提供的一种任务执行方法的示意图;Figure 1 is a schematic diagram of a task execution method provided according to embodiments in this specification;
图2为根据本说明书中实施例提供的数据流图的示意图;Figure 2 is a schematic diagram of a data flow diagram provided according to an embodiment in this specification;
图3为根据本说明书实施例提供的确定可行性算子组合的过程示意图;Figure 3 is a schematic diagram of the process of determining a feasible operator combination according to an embodiment of this specification;
图4为根据本说明书中实施例提供的一种任务执行装置的功能模块示意图;Figure 4 is a functional module schematic diagram of a task execution device provided according to an embodiment of this specification;
图5为根据本说明书实施例提供的一种用于任务执行的电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device for task execution provided according to an embodiment of this specification.
具体实施方式Detailed ways
为使本说明书的目的、技术方案和优点更加清楚,下面将结合本说明书具体实施例及相应的附图对本说明书技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本说明书保护的范围。In order to make the purpose, technical solutions and advantages of this specification more clear, the technical solutions of this specification will be clearly and completely described below in conjunction with specific embodiments of this specification and the corresponding drawings. Obviously, the described embodiments are only some of the embodiments of this specification, but not all of the embodiments. Based on the embodiments in this specification, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this specification.
以下结合附图,详细说明本说明书各实施例提供的技术方案。The technical solutions provided by each embodiment of this specification will be described in detail below with reference to the accompanying drawings.
本说明书中提供了一种任务执行方法,如图1所示,该方法包括以下步骤S101至步骤S105。This specification provides a task execution method, as shown in Figure 1. The method includes the following steps S101 to S105.
在步骤S101:获取第一任务请求。In step S101: obtain the first task request.
在本说明书中,用户可以通过部署在业务平台中的人工智能模型执行相应的任务。具言之,用户可通过用户所使用的设备向业务平台发送任务请求,以使业务平台在获取到该任务请求后,可以响应于接收到的任务请求调度芯片的计算资源来执行该任务请求对应的任务。例如,用户可以通过用户所使用的设备向业务平台发送商品推荐请求,业务平台在接收到商品推荐请求后,可以调用芯片运行商品推荐模型中的算子来执行商品推荐任务。In this manual, users can perform corresponding tasks through the artificial intelligence model deployed in the business platform. Specifically, the user can send a task request to the business platform through the device used by the user, so that after obtaining the task request, the business platform can dispatch the computing resources of the chip in response to the received task request to execute the task request corresponding task. For example, a user can send a product recommendation request to the business platform through the device used by the user. After receiving the product recommendation request, the business platform can call the chip to run the operator in the product recommendation model to perform the product recommendation task.
在本说明书中,用于实现任务执行方法的执行主体可以是指服务器等设置于业务平 台的指定设备,也可以是指诸如台式电脑、笔记本电脑等终端设备。为了便于描述,下面仅以服务器是执行主体为例,对本说明书提供的任务执行方法进行说明。In this specification, the execution subject used to implement the task execution method may refer to a server, etc. set on the business platform. The designated equipment of the station may also refer to terminal equipment such as desktop computers and laptop computers. For the convenience of description, the following only takes the server as the execution subject as an example to describe the task execution method provided in this manual.
在步骤S102:确定执行所述第一任务请求对应的第一任务所需的算子组合,并确定所述算子组合包含的至少一个可执行算子作为第一目标算子。其中,所述可执行算子因不需要依赖其他算子的运行结果,而可例如由芯片直接运行。In step S102: Determine the operator combination required to execute the first task corresponding to the first task request, and determine at least one executable operator included in the operator combination as the first target operator. The executable operator does not need to rely on the running results of other operators, and can be directly run by the chip, for example.
例如,服务器可以确定执行任务请求对应的任务所需的算子组合,并确定算子组合包含的至少一个可执行算子作为第一目标算子。这里的可执行算子因不需要依赖其他算子的运行结果而可直接运行。For example, the server may determine the operator combination required to execute the task corresponding to the task request, and determine at least one executable operator included in the operator combination as the first target operator. The executable operators here can be run directly because they do not need to rely on the results of other operators.
具体地,当服务器需要执行某个任务时,确定用于执行该任务的模型,将用于执行该任务的模型包含的算子构成执行该任务所需的算子组合,进而可以通过芯片运行该算子组合中包含的算子来实现任务执行。其中,服务器确定出的用于执行该任务的模型可以为多个,例如:执行A任务需要运行模型一的a算子和b算子,以及需要运行模型二的c算子和d算子。Specifically, when the server needs to perform a certain task, the model used to perform the task is determined, and the operators included in the model used to perform the task are used to form the operator combination required to perform the task, and then the chip can be run. The operators included in the operator combination are used to implement task execution. There may be multiple models determined by the server for executing the task. For example, executing task A requires running operator a and operator b of model one, and operator c and operator d of model two needs to be run.
在实际应用场景中,服务器也可以通过多个芯片来运行各可执行算子,因此,服务器可以确定算子组合中的至少一个可执行算子作为第一目标算子。例如:假设可执行算子包括a算子、b算子、c算子和d算子,服务器可以通过芯片一来运行a算子、通过芯片二来运行b算子、通过芯片三来运行c算子和d算子。In actual application scenarios, the server can also run each executable operator through multiple chips. Therefore, the server can determine at least one executable operator in the operator combination as the first target operator. For example: Assume that the executable operators include operator a, operator b, operator c and operator d. The server can run operator a through chip one, operator b through chip two, and operator c through chip three. operator and d operator.
进一步地,算子之间可能存在一定的数据传输依赖关系。例如,某个算子在运行时需要使用其他算子的数据处理结果作为该算子运行的参数。因此,若该算子所依赖的算子未执行,则该算子运行所需的参数尚未确定,从而使得该算子无法直接运行。所以,服务器在确定响应用户发出的任务请求所需要运行的算子组合后,可以从算子组合中确定可以由芯片直接运行的算子作为第一目标算子。通过芯片运行第一目标算子,使得算子组合中依赖于第一目标算子的运行结果的其他算子,可以继续通过芯片运行。Furthermore, there may be certain data transmission dependencies between operators. For example, when an operator is running, it needs to use the data processing results of other operators as parameters for the operation of the operator. Therefore, if the operator on which the operator depends has not been executed, the parameters required for the operation of the operator have not yet been determined, making it impossible to run the operator directly. Therefore, after the server determines the operator combination that needs to be run in response to the task request issued by the user, it can determine the operator that can be directly run by the chip as the first target operator from the operator combination. Running the first target operator through the chip allows other operators in the operator combination that depend on the operation result of the first target operator to continue to run through the chip.
另外,当服务器通过芯片运行完第一目标算子后,上述第一任务对应的算子组合中依赖于第一目标算子的算子在获取到第一目标算子的执行结果后,将成为可以由芯片直接执行的算子。此时,服务器可以重新从算子组合中确定出新的可以直接由芯片运行的至少一个算子作为第一目标算子。In addition, after the server finishes running the first target operator through the chip, the operators that depend on the first target operator in the operator combination corresponding to the first task will become Operators that can be executed directly by the chip. At this time, the server can re-determine at least one new operator that can be directly run by the chip from the operator combination as the first target operator.
在步骤S103:针对正在执行第二任务的芯片,判断所述芯片剩余的计算资源是否满足预设条件。 In step S103: For the chip that is executing the second task, determine whether the remaining computing resources of the chip meet the preset conditions.
在步骤S104:若步骤S103的判断结果为是,则获取所述芯片当前运行的第二任务对应的可执行算子作为第二目标算子。In step S104: If the judgment result in step S103 is yes, the executable operator corresponding to the second task currently run by the chip is obtained as the second target operator.
服务器在确定出当前需要通过芯片运行的第一目标算子后,可以针对正在执行第二任务的芯片,判断该芯片剩余的计算资源是否满足预设条件,若判断结果为是,则可以通过该芯片在执行第二任务的基础上并行运行第一任务对应的第一目标算子。若判断结果为否,则可以为该第一任务分配一个处于空闲状态的芯片,以运行该第一任务对应的第一目标算子;或者,若当前不存在处于空闲状态的芯片,则可以使得该第一任务处于等待状态。其中,上述的预设条件可以根据实际需要设定。例如:判断芯片剩余的计算资源是否达到指定条件阈值等。After the server determines the first target operator that currently needs to be run through the chip, it can determine whether the remaining computing resources of the chip meet the preset conditions for the chip that is executing the second task. If the judgment result is yes, it can use the The chip runs the first target operator corresponding to the first task in parallel on the basis of executing the second task. If the judgment result is no, a chip in an idle state can be allocated to the first task to run the first target operator corresponding to the first task; or, if there is currently no chip in an idle state, it can be made The first task is in a waiting state. Among them, the above preset conditions can be set according to actual needs. For example: determine whether the remaining computing resources of the chip reach the specified condition threshold, etc.
在步骤S105:以不影响所述芯片为各所述第二目标算子分配的计算资源为基础,为各所述第一目标算子分配所述芯片至少部分剩余计算资源,并由所述芯片通过在运行各所述第二目标算子的基础上并行运行各所述第一目标算子,来执行所述第一任务。In step S105: On the basis of not affecting the computing resources allocated by the chip to each of the second target operators, allocate at least part of the remaining computing resources of the chip to each of the first target operators, and use the chip to The first task is executed by running each of the first target operators in parallel on the basis of running each of the second target operators.
例如,服务器可以以不影响芯片为各第二目标算子分配的计算资源为基础,为各第一目标算子分配该芯片至少部分剩余计算资源,并通过并行运行各第一目标算子来执行对应的第一任务。For example, the server can allocate at least part of the remaining computing resources of the chip to each first target operator on the basis of not affecting the computing resources allocated by the chip to each second target operator, and execute by running each first target operator in parallel. Corresponding first task.
具体地,服务器可以从预先确定的可并行算子组合中筛选出包含有第二目标算子的可并行算子组合作为候选算子组合,针对每个候选算子组合,若该候选算子组合中包含第一目标算子的数量超过预设的第一阈值,则确定该候选算子组合作为目标算子组合,并由所述芯片并行运行该目标算子组合中的各第一目标算子和各第二目标算子。Specifically, the server can filter out a parallelizable operator combination including the second target operator as a candidate operator combination from the predetermined parallelizable operator combinations. For each candidate operator combination, if the candidate operator combination The number of first target operators included in exceeds the preset first threshold, then the candidate operator combination is determined as the target operator combination, and the chip runs each first target operator in the target operator combination in parallel. and each second objective operator.
在实际应用场景中,当服务器需要通过正在执行第二任务的芯片并行执行第一任务时,则需要预测执行该第一所需要运行的第一目标算子与当前正在运行的第二任务的第二目标算子之间是否可以并行运行,而上述的预测过程可能会导致任务请求的响应延时增加。基于此,服务器可以在离线状态下,预先确定出所有的可以并行执行的算子组合作为可并行算子组合,进而在实际应用中,可以直接从各可并行算子组合筛选出与第二目标算子以及第一目标算子匹配的目标算子组合来使芯片并行运行。In an actual application scenario, when the server needs to execute the first task in parallel through the chip that is executing the second task, it needs to predict the first target operator that needs to be executed and the second target operator that is currently running. Whether the two target operators can be run in parallel, and the above prediction process may cause the response delay of the task request to increase. Based on this, the server can pre-determine all operator combinations that can be executed in parallel as parallel operator combinations in the offline state. Then, in actual applications, it can directly filter out the combinations with the second target from each parallel operator combination. operator and the combination of target operators matched by the first target operator to make the chip run in parallel.
其中,服务器确定各可并行算子组合的方法可以包括:获取多个目标模型;针对每个目标模型,确定该目标模型包含的算子之间的数据传输依赖关系;根据所述目标模型各自对应的数据传输依赖关系,确定所述多个目标模型包含的所有算子中互相之间不存在数据传输依赖关系的算子组合,作为所述可并行算子组合。 Among them, the method for the server to determine each parallel operator combination may include: obtaining multiple target models; for each target model, determining the data transmission dependency between operators included in the target model; corresponding to each other according to the target model The data transmission dependency relationship is determined, and an operator combination that has no data transmission dependency relationship among all operators included in the multiple target models is determined as the parallelizable operator combination.
图2为根据本说明书实施例提供的数据流图的示意图。从图2中可以看出,服务器可以根据确定出的数据传输依赖关系,确定各目标模型对应的数据流图。在数据流图中,每个节点用于表征目标模型包含的算子,两个节点之间的边用于表征这两个节点之间存在数据传输依赖关系。其中,当一个节点A和一个节点B之间存在一条从节点A指向节点B的边时,则说明节点B依赖于节点A。Figure 2 is a schematic diagram of a data flow diagram provided according to an embodiment of this specification. As can be seen from Figure 2, the server can determine the data flow diagram corresponding to each target model based on the determined data transmission dependencies. In the data flow graph, each node is used to represent the operators contained in the target model, and the edge between two nodes is used to represent the data transmission dependency between the two nodes. Among them, when there is an edge from node A to node B between a node A and a node B, it means that node B depends on node A.
进一步地,服务器可以针对数据流图包含的每个节点,通过多轮迭代,确定该节点对应的可并行算子组合。Further, for each node included in the data flow graph, the server can determine the parallelizable operator combination corresponding to the node through multiple rounds of iteration.
其中,针对每轮迭代,服务器可以:确定该轮迭代中的目标节点组合,所述目标节点组合包括一个目标节点或包括相互不存在依赖关系的多个目标节点;针对数据流图包含的除目标节点组合中的各目标节点之外的每个其他节点,判断该其他节点是否依赖于目标节点组合中的任意一个目标节点。Among them, for each round of iteration, the server can: determine the target node combination in this round of iteration, the target node combination includes one target node or multiple target nodes that do not have mutual dependencies; For each other node other than each target node in the node combination, determine whether the other node depends on any target node in the target node combination.
若该其他节点不依赖于目标节点组合中的任意一个目标节点,则服务器可根据该其他节点和所述目标节点组合确定可并行算子组合,并将确定出的可并行算子组合作为下一轮迭代中的目标节点组合,或从数据流图包含的节点中依次选取出下一轮迭代中的目标节点。If the other nodes do not depend on any target node in the target node combination, the server can determine the parallelizable operator combination based on the other node and the target node combination, and use the determined parallelizable operator combination as the next The target node combination in one round of iteration, or the target node in the next round of iteration is selected sequentially from the nodes included in the data flow graph.
针对每个其他节点,服务器可以根据数据流图依次判断目标节点组合中的目标节点是否为该其他节点的父亲节点或祖先节点,若所述目标节点组合中存在一个目标节点是该其他节点的父亲节点或祖先节点,则可以确定该其他节点依赖于该目标节点组合中的目标节点。其中,在数据流图中,若存在从第一节点指向第二节点的边,则第一节点为父亲节点,第二节点为儿子节点。并且,第一节点的父亲节点以及祖先节点为第二节点的祖先节点。For each other node, the server can sequentially determine according to the data flow graph whether the target node in the target node combination is the parent node or ancestor node of the other node. If there is a target node in the target node combination that is the parent of the other node node or ancestor node, it can be determined that the other node depends on the target node in the target node combination. Among them, in the data flow graph, if there is an edge from the first node to the second node, the first node is the parent node and the second node is the son node. Furthermore, the parent node and ancestor node of the first node are the ancestor nodes of the second node.
进一步地,服务器可以在确定满足预设的终止条件时,将所得到的各目标节点组合确定为可并行算子组合。其中,上述预设的终止条件可以为将数据流图中包含的所有节点均作为目标节点参与过迭代处理。当然,上述预设的终止条件还可以为迭代的轮数达到指定轮数。Further, when the server determines that the preset termination condition is met, the server may determine each obtained target node combination as a parallel operator combination. The above-mentioned preset termination condition may be that all nodes included in the data flow graph are used as target nodes to participate in the iterative processing. Of course, the above preset termination condition can also be that the number of rounds of iteration reaches a specified number of rounds.
其中,服务器根据该其他节点和所述目标节点组合确定可并行算子组合,可以具体为:服务器将该其他节点与所述目标节点组合合并为候选算子组合,并确定该候选算子组合在芯片中的运行时间,若该候选算子组合在芯片中的运行时间不超过预设的第二阈值,则确定该候选算子组合为可并行算子组合。 Wherein, the server determines a parallel operator combination based on the combination of the other nodes and the target node, which may be specifically: the server combines the other nodes and the target node combination into a candidate operator combination, and determines that the candidate operator combination is in The running time in the chip. If the running time of the candidate operator combination in the chip does not exceed the preset second threshold, the candidate operator combination is determined to be a parallelizable operator combination.
从上述内容中可以看出,服务器还可以预测各候选算子组合在芯片中的运行时间,并根据预测出的各候选算子组合在芯片中的运行时间,来对各候选算子组合进行进一步地的筛选。As can be seen from the above content, the server can also predict the running time of each candidate operator combination in the chip, and further evaluate each candidate operator combination based on the predicted running time of each candidate operator combination in the chip. Ground screening.
在实际应用场景中,可能存在部分算子在芯片中的运行时间较长,而不适合与其他算子一起作为可并行算子组合。因此,服务器还可以在确定可并行算子组合之前,针对数据流图中包含的每个节点,确定该节点对应的算子在芯片中的运行时间,若该节点对应的算子在芯片中的运行时间不超过预设的第二阈值,则将该节点添加到预设的候选节点集中。这样,在确定可并行算子组合的多轮迭代中时,可以从该候选节点集中依次选取出一个节点作为下一轮迭代中的目标节点。In actual application scenarios, there may be some operators that take a long time to run on the chip and are not suitable for combination with other operators as parallel operators. Therefore, before determining the parallelizable operator combination, the server can also determine the running time of the operator corresponding to the node in the chip for each node included in the data flow graph. If the operator corresponding to the node is in the chip, If the running time does not exceed the preset second threshold, the node is added to the preset candidate node set. In this way, when determining multiple rounds of iterations of parallelizable operator combinations, one node can be selected sequentially from the candidate node set as the target node in the next round of iterations.
服务器确定候选算子组合在芯片中的运行时间,可以具体为:服务器获取候选算子组合的相关特征,将相关特征输入到预设的预测模型中,以由预测模型预测并输出候选算子组合在芯片中的运行时间。其中,这里的相关特征包括:候选算子组合中包含的每个算子的历史所使用芯片的计算量,历史数据带宽,历史运行时间,候选算子组合中包含的算子的数据传输大小的平均值、最大值、最小值、方差中的至少一种。需要说明的是,在候选算子组合仅包括一个算子的情况下,经由上述方法得到的则为该算子在芯片中的运行时间。The server determines the running time of the candidate operator combination in the chip, which can be specifically as follows: the server obtains the relevant features of the candidate operator combination, inputs the relevant features into the preset prediction model, so that the candidate operator combination is predicted and output by the prediction model run time in the chip. Among them, the relevant features here include: the calculation amount of the chip used in the history of each operator included in the candidate operator combination, historical data bandwidth, historical running time, and the data transmission size of the operators included in the candidate operator combination. At least one of average, maximum, minimum, and variance. It should be noted that, when the candidate operator combination only includes one operator, what is obtained through the above method is the running time of the operator in the chip.
其中,上述预测模型的训练方法可以为:将包括至少一个样本算子的样本算子组合输入到预测模型中,通过预测模型预测并输出样本算子组合在芯片中的运行时间;以及,以最小化预测模型输出的样本算子组合在芯片中的运行时间和离线模拟出的样本算子组合的实际运行时间之间的偏差,对预测模型进行训练。The training method of the above prediction model may be: inputting a sample operator combination including at least one sample operator into the prediction model, predicting and outputting the running time of the sample operator combination in the chip through the prediction model; and, in a minimum manner The deviation between the running time of the sample operator combination output by the prediction model in the chip and the actual running time of the sample operator combination simulated offline is used to train the prediction model.
图3为根据本说明书中实施例提供的用于确定可并行算子组合的过程的示意图。FIG. 3 is a schematic diagram of a process for determining a parallelizable operator combination provided according to an embodiment in this specification.
结合图3可以看出,服务器可以为数据流图中的每一个节点对应的算子进行编码。其中,编码的方法可以为针对数据流图中的每个节点,按照从0到N的升序依次进行编码。其中,N表示数据流图中所有节点的个数。As can be seen from Figure 3, the server can encode the operator corresponding to each node in the data flow graph. The encoding method may be to sequentially encode each node in the data flow graph in ascending order from 0 to N. Among them, N represents the number of all nodes in the data flow graph.
进一步地,服务器可以预测节点i对应的算子(以下也可称为第i个算子,或算子i)在芯片上的运行时间为ti,当算子i的运行时间ti小于预设的第二阈值的时候,把算子分别添加到集合1和集合2中,同时还添加到结果集合R中。其中,i为大于等于0且小于N的整数,表示节点以及节点对应的算子的编码。Further, the server can predict that the running time of the operator corresponding to node i (hereinafter also referred to as the i-th operator, or operator i) on the chip is ti. When the running time ti of operator i is less than the predetermined When setting the second threshold, add the operator to set 1 and set 2 respectively, and also add it to the result set R. Among them, i is an integer greater than or equal to 0 and less than N, indicating the encoding of the node and the operator corresponding to the node.
在每轮迭代的过程中,服务器可以判断集合1是否为空。当集合1不为空的时候, 服务器可以选取集合1中的算子x,并将该算子x从集合1中删除,将集合2中的每个算子与算子x进行组合,得到算子组合x′。进一步,若算子组合x′中的算子之间不存在数据依赖关系、且该算子组合x′在单个芯片中的运行时间不超过第二阈值,则把算子组合x′添加到集合1以及结果集合R中,否则丢弃该算子组合x′。重复上述过程,直到集合1为空,此时结果集合R中的各算子组合即均为可并行算子组合。During each iteration, the server can determine whether set 1 is empty. When set 1 is not empty, The server can select operator x in set 1, delete the operator x from set 1, and combine each operator in set 2 with operator x to obtain operator combination x′. Further, if there is no data dependency between operators in operator combination x′ and the running time of operator combination x′ in a single chip does not exceed the second threshold, then operator combination x′ is added to the set. 1 and the result set R, otherwise the operator combination x′ is discarded. Repeat the above process until set 1 is empty. At this time, each operator combination in the result set R is a parallel operator combination.
从上述内容中可以看出,通过预先确定出所有可以并行执行的算子组合,并在响应于用户发起的任务请求要执行的第一任务对应的可执行算子与芯片正在执行的第二任务的可执行算子能够组合出所述可以并行执行的算子组合时,通过利用该芯片在执行第二任务的基础上并行执行第一任务,可以有效提升芯片的计算资源的利用率。As can be seen from the above content, by predetermining all operator combinations that can be executed in parallel, and matching the executable operator corresponding to the first task to be executed in response to the user-initiated task request with the second task being executed by the chip When the executable operators can combine the operator combinations that can be executed in parallel, by using the chip to execute the first task in parallel on the basis of executing the second task, the utilization of the chip's computing resources can be effectively improved.
以上为本说明书的一个或多个实施例提供的任务执行的方法。基于同样的思路,本说明书还提供了相应的任务执行装置。如图4所示,该装置包括获取模块401、第一确定模块402、检测模块403、第二确定模块404、执行模块405。The above is a task execution method provided by one or more embodiments of this specification. Based on the same idea, this manual also provides corresponding task execution devices. As shown in Figure 4, the device includes an acquisition module 401, a first determination module 402, a detection module 403, a second determination module 404, and an execution module 405.
获取模块401,用于获取第一任务请求。Obtaining module 401 is used to obtain the first task request.
第一确定模块402,用于确定执行所述第一任务请求对应的第一任务所需的算子组合,并确定所述算子组合中的至少一个可执行算子作为第一目标算子,所述可执行算子因不需要依赖其他算子的运行结果而可直接运行。The first determination module 402 is used to determine the operator combination required to execute the first task corresponding to the first task request, and determine at least one executable operator in the operator combination as the first target operator, The executable operator does not need to rely on the running results of other operators and can be run directly.
检测模块403,用于针对正在执行第二任务的芯片,判断所述芯片剩余的计算资源是否满足预设条件。The detection module 403 is used to determine whether the remaining computing resources of the chip that is executing the second task meet the preset conditions.
第二确定模块404,用于若所述检测模块403的判断结果为是,则确定所述芯片当前运行的第二任务对应的至少一个可执行算子,作为各第二目标算子。The second determination module 404 is configured to determine at least one executable operator corresponding to the second task currently run by the chip as each second target operator if the determination result of the detection module 403 is yes.
执行模块405,用于以不影响所述芯片为各所述第二目标算子分配的计算资源为基础,为各所述第一目标算子分配所述芯片至少部分剩余计算资源,以由所述芯片通过在运行各所述第二目标算子的基础上并行运行各所述第一目标算子。Execution module 405 is configured to allocate at least part of the remaining computing resources of the chip to each of the first target operators on the basis of not affecting the computing resources allocated by the chip to each of the second target operators. The chip runs each of the first target operators in parallel on the basis of running each of the second target operators.
可选地,所述执行模块405具体用于,从预先确定的多个可并行算子组合中筛选出包含有所述第二目标算子的候选算子组合;针对每个候选算子组合,若该候选算子组合中包含所述第一目标算子的数量超过预设的第一阈值,则确定该候选算子组合为目标算子组合;由所述芯片并行运行所述目标算子组合中的各第一目标算子和各第二目标算子。Optionally, the execution module 405 is specifically configured to select candidate operator combinations containing the second target operator from a plurality of predetermined parallelizable operator combinations; for each candidate operator combination, If the number of the first target operator included in the candidate operator combination exceeds the preset first threshold, the candidate operator combination is determined to be the target operator combination; the chip runs the target operator combination in parallel. Each first target operator and each second target operator in .
可选地,所述装置还包括:第三确定模块406。所述第三确定模块406具体用于: 获取多个目标模型;针对每个目标模型,确定该目标模型包含的算子之间的数据传输依赖关系;根据所述多个目标模型各自对应的数据传输依赖关系,确定所述多个目标模型包含的所有算子中互相之间不存在数据传输依赖关系的算子组合,作为可并行算子组合。Optionally, the device further includes: a third determining module 406. The third determination module 406 is specifically used to: Obtain multiple target models; for each target model, determine the data transmission dependencies between operators included in the target model; determine the multiple target models based on the data transmission dependencies corresponding to the multiple target models. Operator combinations that have no data transmission dependencies among all included operators are regarded as parallelizable operator combinations.
可选地,所述第三确定模块406具体用于:根据确定出的所述数据传输依赖关系,确定所述各目标模型对应的数据流图,在所述数据流图中,每个节点用于表征目标模型包含的算子,两个节点之间的边用于表征所述两个节点之间存在数据传输依赖关系;根据所述数据流图,确定可并行算子组合。Optionally, the third determination module 406 is specifically configured to: determine the data flow graph corresponding to each target model according to the determined data transmission dependency relationship. In the data flow graph, each node uses To represent the operators included in the target model, the edge between two nodes is used to represent the existence of a data transmission dependency relationship between the two nodes; according to the data flow graph, a combination of parallel operators is determined.
可选地,所述第三确定模块406具体用于:针对所述数据流图包含的每个节点,通过多轮迭代,确定该节点对应的可并行算子组合。其中,针对每轮迭代,所述第三确定模块406具体用于:确定该轮迭代中的目标节点组合,所述目标节点组合包括一个目标节点或包括相互不存在依赖关系的多个目标节点;针对所述数据流图包含的除所述目标节点组合中的各目标节点之外的每个其他节点,若该其他节点不依赖于所述目标节点组合中的任意一个目标节点,则根据该其他节点和所述目标节点组合确定可并行算子组合,并将确定出的可并行算子组合作为下一轮迭代中的目标节点组合,或从所述数据流图包含的节点中依次选取出下一轮迭代中的目标节点;以及,在确定满足预设的终止条件后,得到可并行算子组合。Optionally, the third determination module 406 is specifically configured to: for each node included in the data flow graph, determine the parallelizable operator combination corresponding to the node through multiple rounds of iteration. Wherein, for each round of iteration, the third determination module 406 is specifically used to: determine a target node combination in this round of iteration, where the target node combination includes one target node or multiple target nodes that do not have mutual dependencies; For each other node included in the data flow graph except each target node in the target node combination, if the other node does not depend on any target node in the target node combination, then according to the other The combination of nodes and the target node determines a parallelizable operator combination, and the determined parallelizable operator combination is used as the target node combination in the next round of iteration, or the next one is selected sequentially from the nodes included in the data flow graph. The target node in one iteration; and, after determining that the preset termination conditions are met, a parallelizable operator combination is obtained.
可选地,所述第三确定模块406具体用于:将该其他节点与所述目标节点组合合并为候选算子组合,并确定所述候选算子组合在所述芯片中的运行时间;若所述候选算子组合在所述芯片中的运行时间不超过预设阈值,则确定该候选算子组合为可并行算子组合。Optionally, the third determination module 406 is specifically configured to: combine the other node and the target node combination into a candidate operator combination, and determine the running time of the candidate operator combination in the chip; if If the running time of the candidate operator combination in the chip does not exceed a preset threshold, the candidate operator combination is determined to be a parallelizable operator combination.
可选地,所述第三确定模块406具体用于:针对所述数据流图中包含的每个节点,确定该节点对应的算子在所述芯片中的运行时间;若该节点对应的算子在所述芯片中的运行时间不超过预设阈值,则将该节点添加到预设的候选节点集中;从所述候选节点集中依次选取出一个节点,作为下一轮迭代中的目标节点。Optionally, the third determination module 406 is specifically configured to: for each node included in the data flow graph, determine the running time of the operator corresponding to the node in the chip; if the operator corresponding to the node If the running time of the child in the chip does not exceed the preset threshold, then the node is added to the preset candidate node set; a node is selected in sequence from the candidate node set as the target node in the next round of iteration.
可选地,所述第三确定模块406具体用于:获取所述候选算子组合的相关特征,所述相关特征包括候选算子组合中包含的每个算子的历史所使用芯片的计算量,历史数据带宽,历史运行时间,候选算子组合中包含的算子的数据传输大小的平均值、最大值、最小值、方差中的至少一种;将所述相关特征输入到预设的预测模型中,以通过所述预测模型预测并输出所述候选算子组合在所述芯片中的运行时间。 Optionally, the third determination module 406 is specifically configured to: obtain relevant features of the candidate operator combination, where the relevant features include the historical calculation amount of the chip used for each operator included in the candidate operator combination. , historical data bandwidth, historical running time, at least one of the average, maximum, minimum, and variance of the data transmission size of the operators included in the candidate operator combination; input the relevant features into the preset prediction In the model, the running time of the candidate operator combination in the chip is predicted and outputted through the prediction model.
可选地,所述可并行算子组合是在所述芯片处于离线状态时确定的。Optionally, the parallelizable operator combination is determined when the chip is offline.
本说明书还提供了一种计算机可读存储介质,该存储介质存储有计算机程序,计算机程序可用于执行上述用于任务执行的方法。This specification also provides a computer-readable storage medium that stores a computer program, and the computer program can be used to perform the above method for task execution.
本说明书还提供了图5所示的用于任务执行的电子设备的示意结构图。如图5所示,在硬件层面,该电子设备包括处理器、内部总线、网络接口、内存以及非易失性存储器,当然还可能包括其他业务所需要的硬件。处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行,以实现上述用于任务执行的方法。This specification also provides a schematic structural diagram of the electronic device used for task execution shown in Figure 5. As shown in Figure 5, at the hardware level, the electronic device includes a processor, internal bus, network interface, memory and non-volatile memory, and of course may also include other hardware required for business. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it to implement the above method for task execution.
当然,除了软件实现方式之外,本说明书并不排除其他实现方式,比如逻辑器件抑或软硬件结合的方式等等,也就是说以下处理流程的执行主体并不限定于各个逻辑单元,也可以是硬件或逻辑器件。Of course, in addition to software implementation, this specification does not exclude other implementation methods, such as logic devices or a combination of software and hardware, etc. That is to say, the execution subject of the following processing flow is not limited to each logical unit, and may also be hardware or logic device.
在20世纪90年代,对于一个技术的改进可以很明显地区分是硬件上的改进(例如,对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而,随着技术的发展,当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此,不能说一个方法流程的改进就不能用硬件实体模块来实现。例如,可编程逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable Gate Array,FPGA))就是这样一种集成电路,其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上,而不需要请芯片制造厂商来设计和制作专用的集成电路芯片。而且,如今,取代手工地制作集成电路芯片,这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现,它与程序开发撰写时所用的软件编译器相类似,而要编译之前的原始代码也得用特定的编程语言来撰写,此称之为硬件描述语言(Hardware Description Language,HDL),而HDL也并非仅有一种,而是有许多种,如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等,目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)与Verilog。本领域技术人员也应该清楚,只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中,就可以很容易得到实现该逻辑方法流程的硬件电路。 In the 1990s, improvements in a technology could be clearly distinguished as hardware improvements (for example, improvements in circuit structures such as diodes, transistors, switches, etc.) or software improvements (improvements in method processes). However, with the development of technology, many improvements in today's method processes can be regarded as direct improvements in hardware circuit structures. Designers almost always obtain the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that an improvement of a method flow cannot be implemented using hardware entity modules. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is such an integrated circuit whose logic functions are determined by the user programming the device. Designers can program themselves to "integrate" a digital system on a PLD, instead of asking chip manufacturers to design and produce dedicated integrated circuit chips. Moreover, nowadays, instead of manually making integrated circuit chips, this kind of programming is mostly implemented using "logic compiler" software, which is similar to the software compiler used in program development and writing. Before compiling, The original code must also be written in a specific programming language, which is called Hardware Description Language (HDL). There is not only one type of HDL, but many types, such as ABEL (Advanced Boolean Expression Language) , AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., are currently the most commonly used The most popular ones are VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog. Those skilled in the art should also know that by simply logically programming the method flow using the above-mentioned hardware description languages and programming it into the integrated circuit, the hardware circuit that implements the logical method flow can be easily obtained.
控制器可以按任何适当的方式实现,例如,控制器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式,控制器的例子包括但不限于以下微控制器:ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320,存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (eg, software or firmware) executable by the (micro)processor. , logic gates, switches, Application Specific Integrated Circuit (ASIC), programmable logic controllers and embedded microcontrollers. Examples of controllers include but are not limited to the following microcontrollers: ARC 625D, Atmel AT91SAM, For Microchip PIC18F26K20 and Silicone Labs C8051F320, the memory controller can also be implemented as part of the memory's control logic. Those skilled in the art also know that in addition to implementing the controller in the form of pure computer-readable program code, the controller can be completely programmed with logic gates, switches, application-specific integrated circuits, programmable logic controllers and embedded logic by logically programming the method steps. Microcontroller, etc. to achieve the same function. Therefore, this controller can be considered as a hardware component, and the devices included therein for implementing various functions can also be considered as structures within the hardware component. Or even, the means for implementing various functions can be considered as structures within hardware components as well as software modules implementing the methods.
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The systems, devices, modules or units described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本说明书时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing this specification, the functions of each unit can be implemented in the same or multiple software and/or hardware.
本领域内的技术人员应明白,本说明书的实施例可提供为方法、系统、或计算机程序产品。因此,本说明书可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本说明书可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will understand that embodiments of the present specification may be provided as methods, systems, or computer program products. Thus, the present description may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk memory, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本说明书是参照根据本说明书实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程图数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程图数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个 方框或多个方框中指定的功能的装置。The specification is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the specification. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable graph data processing device to produce a machine such that the instructions are executed by the processor of the computer or other programmable graph data processing device. Generate a process or processes used to implement a process in a flowchart and/or a block diagram A device for the functions specified in a box or boxes.
这些计算机程序指令也可存储在能引导计算机或其他可编程图数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, The instruction means implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程图数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. The instructions provide steps for implementing the functions specified in a process or processes in the flow diagram and/or in a block or blocks in the block diagram.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-permanent storage in computer-readable media, random access memory (RAM) and/or non-volatile memory in the form of read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory. (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device. As defined in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements not only includes those elements, but also includes Other elements are not expressly listed or are inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes the stated element.
本领域技术人员应明白,本说明书的实施例可提供为方法、系统或计算机程序产品。 因此,本说明书可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本说明书可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present specification may be provided as methods, systems, or computer program products. Thus, the present description may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk memory, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本说明书可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本说明书,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。This specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The present description may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through communications networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。 Each embodiment in this specification is described in a progressive manner. The same and similar parts between the various embodiments can be referred to each other. Each embodiment focuses on its differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.

Claims (20)

  1. 一种任务执行方法,其特征在于,包括:A task execution method, characterized by including:
    获取第一任务请求;Get the first task request;
    确定执行所述第一任务请求对应的第一任务所需的算子组合,并确定所述算子组合中的至少一个可执行算子作为第一目标算子,所述可执行算子因不需要依赖其他算子的运行结果而可直接运行;Determine the operator combination required to execute the first task corresponding to the first task request, and determine at least one executable operator in the operator combination as the first target operator. The executable operator is not It needs to rely on the running results of other operators and can be run directly;
    针对正在执行第二任务的芯片,判断所述芯片剩余的计算资源是否满足预设条件;For the chip that is executing the second task, determine whether the remaining computing resources of the chip meet the preset conditions;
    若所述判断的结果为是,则If the result of the judgment is yes, then
    确定所述芯片当前运行的所述第二任务对应的至少一个可执行算子,作为第二目标算子;Determine at least one executable operator corresponding to the second task currently run by the chip as a second target operator;
    以不影响所述芯片为各所述第二目标算子分配的计算资源为基础,为各所述第一目标算子分配所述芯片至少部分剩余计算资源,以通过所述芯片在运行各所述第二目标算子的基础上并行运行各所述第一目标算子,来执行所述第一任务。On the basis of not affecting the computing resources allocated by the chip to each of the second target operators, allocate at least part of the remaining computing resources of the chip to each of the first target operators, so that the chip can run each On the basis of the second target operator, each of the first target operators is run in parallel to execute the first task.
  2. 如权利要求1所述的方法,其特征在于,以不影响所述芯片为各所述第二目标算子分配的计算资源为基础,为各所述第一目标算子分配所述芯片至少部分剩余计算资源,具体包括:The method according to claim 1, characterized in that, on the basis of not affecting the computing resources allocated by the chip to each of the second target operators, at least part of the chip is allocated to each of the first target operators. Remaining computing resources include:
    从预先确定的多个可并行算子组合中筛选出包含有所述第二目标算子的候选算子组合;Screen out a candidate operator combination including the second target operator from a plurality of predetermined parallelizable operator combinations;
    针对每个所述候选算子组合,若该候选算子组合中包含所述第一目标算子的数量超过预设的第一阈值,则确定该候选算子组合作为目标算子组合;For each candidate operator combination, if the number of the first target operator included in the candidate operator combination exceeds a preset first threshold, determine the candidate operator combination as the target operator combination;
    由所述芯片并行运行所述目标算子组合中的各所述第一目标算子和各所述第二目标算子。Each of the first target operators and each of the second target operators in the target operator combination are run in parallel by the chip.
  3. 如权利要求2所述的方法,其特征在于,确定可行性算子组合,具体包括:The method according to claim 2, characterized in that determining feasible operator combinations specifically includes:
    获取多个目标模型;Get multiple target models;
    针对每个所述目标模型,确定该目标模型包含的算子之间的数据传输依赖关系;For each of the target models, determine the data transmission dependencies between operators included in the target model;
    根据所述多个目标模型各自的所述数据传输依赖关系,确定所述多个目标模型包含的所有算子中互相之间不存在数据传输依赖关系的算子组合,作为所述可并行算子组合。According to the data transmission dependencies of each of the multiple target models, an operator combination that has no data transmission dependency on each other among all operators included in the multiple target models is determined as the parallelizable operator combination.
  4. 如权利要求3所述的方法,其特征在于,根据所述多个目标模型各自的所述数据传输依赖关系,确定所述多个目标模型包含的所有算子中互相之间不存在数据传输依赖关系的算子组合,作为所述可并行算子组合,具体包括: The method according to claim 3, characterized in that, according to the data transmission dependencies of each of the plurality of target models, it is determined that all operators included in the plurality of target models do not have data transmission dependencies on each other. Relational operator combinations, as the parallelizable operator combinations, specifically include:
    根据针对每个所述目标模型所确定出的所述数据传输依赖关系,确定所述目标模型对应的数据流图,在所述数据流图中,每个节点用于表征所述目标模型包含的算子,两个节点之间的边用于表征所述两个节点之间存在数据传输依赖关系;According to the data transmission dependency determined for each target model, a data flow graph corresponding to the target model is determined. In the data flow graph, each node is used to represent the data contained in the target model. Operator, the edge between two nodes is used to represent the existence of data transmission dependency between the two nodes;
    根据所述多个目标模型对应的多个所述数据流图,确定所述可并行算子组合。The parallelizable operator combination is determined according to multiple data flow graphs corresponding to the multiple target models.
  5. 如权利要求4所述的方法,其特征在于,根据所述多个目标模型对应的多个所述数据流图,确定所述可并行算子组合,具体包括:The method according to claim 4, characterized in that determining the parallelizable operator combination according to multiple data flow graphs corresponding to the multiple target models specifically includes:
    针对所述多个数据流图包含的每个节点,通过多轮迭代,确定该节点对应的可并行算子组合;其中For each node included in the multiple data flow graphs, through multiple rounds of iteration, determine the parallelizable operator combination corresponding to the node; where
    针对每轮迭代,For each iteration,
    确定该轮迭代中的目标节点组合,所述目标节点组合包括一个目标节点或包括相互不存在依赖关系的多个目标节点;Determine the target node combination in this round of iteration, where the target node combination includes one target node or multiple target nodes that do not have mutual dependencies;
    针对所述多个数据流图包含的除所述目标节点组合中的各目标节点之外的每个其他节点,若该其他节点不依赖于所述目标节点组合中的任意一个目标节点,则根据该其他节点和所述目标节点组合确定可并行算子组合,并For each other node included in the plurality of data flow graphs except for each target node in the target node combination, if the other node does not depend on any target node in the target node combination, then according to The combination of the other nodes and the target node determines the parallelizable operator combination, and
    将确定出的所述可并行算子组合作为下一轮迭代中的目标节点组合,或从所述多个数据流图包含的节点中依次选取出下一轮迭代中的目标节点,直到满足预设的终止条件。The determined parallelizable operator combination is used as the target node combination in the next round of iteration, or the target node in the next round of iteration is selected sequentially from the nodes included in the multiple data flow graphs until the predetermined set termination conditions.
  6. 如权利要求5所述的方法,其特征在于,根据该其他节点和所述目标节点组合确定可并行算子组合,具体包括:The method according to claim 5, characterized in that determining a parallelizable operator combination based on the combination of the other nodes and the target node specifically includes:
    将该其他节点和所述目标节点组合合并为候选算子组合;Combine the other nodes and the target node combination into a candidate operator combination;
    若所述候选算子组合在所述芯片中的运行时间不超过预设的第二阈值,则确定该候选算子组合为可并行算子组合。If the running time of the candidate operator combination in the chip does not exceed a preset second threshold, the candidate operator combination is determined to be a parallelizable operator combination.
  7. 如权利要求5所述的方法,其特征在于,从所述多个数据流图包含的节点中依次选取出下一轮迭代中的目标节点,具体包括:The method according to claim 5, characterized in that, sequentially selecting the target node in the next round of iteration from the nodes included in the plurality of data flow graphs specifically includes:
    从候选节点集中依次选取出一个节点,作为下一轮迭代中的目标节点,Select a node from the candidate node set in sequence as the target node in the next round of iteration,
    其中,所述候选节点集包括所述多个数据流图中包含的、对应的算子在所述芯片中的运行时间不超过预设的第二阈值的所有节点。Wherein, the candidate node set includes all nodes included in the plurality of data flow graphs for which the running time of the corresponding operator in the chip does not exceed a preset second threshold.
  8. 如权利要求6所述的方法,其特征在于,确定所述候选算子组合在所述芯片中的运行时间,具体包括:The method of claim 6, wherein determining the running time of the candidate operator combination in the chip specifically includes:
    获取所述候选算子组合的相关特征,所述相关特征包括所述候选算子组合中包含的每个算子的历史所使用芯片的计算量,历史数据带宽,历史运行时间,所述候选算 子组合中包含的算子的数据传输大小的平均值、最大值、最小值、方差中的至少一种;Obtain the relevant features of the candidate operator combination, which include the historical calculation amount of the chip used by each operator included in the candidate operator combination, historical data bandwidth, historical running time, the candidate operator At least one of the average, maximum, minimum, and variance of the data transmission size of the operators included in the subcombination;
    将所述相关特征输入到预设的预测模型中,以通过所述预测模型预测并输出所述候选算子组合在所述芯片中的运行时间。The relevant features are input into a preset prediction model to predict and output the running time of the candidate operator combination in the chip through the prediction model.
  9. 如权利要求2所述的方法,其特征在于,所述可并行算子组合是在所述芯片处于离线状态时确定的。The method of claim 2, wherein the parallelizable operator combination is determined when the chip is offline.
  10. 一种任务执行装置,其特征在于,包括:A task execution device, characterized by including:
    获取模块,用于获取第一任务请求;Obtain module, used to obtain the first task request;
    第一确定模块,用于确定执行所述第一任务请求对应的第一任务所需的算子组合,并确定所述算子组合中的至少一个可执行算子作为第一目标算子,所述可执行算子因不需要依赖其他算子的运行结果而可直接运行;The first determination module is used to determine the operator combination required to execute the first task corresponding to the first task request, and determine at least one executable operator in the operator combination as the first target operator, so The above executable operators can be run directly because they do not need to rely on the results of other operators;
    检测模块,用于针对正在执行第二任务的芯片,判断所述芯片剩余的计算资源是否满足预设条件;A detection module, configured to determine whether the remaining computing resources of the chip that is executing the second task meet the preset conditions;
    第二确定模块,用于若所述检测模块的判断结果为是,则确定所述芯片当前运行的所述第二任务对应的至少一个可执行算子,作为第二目标算子;a second determination module, configured to determine at least one executable operator corresponding to the second task currently run by the chip as a second target operator if the judgment result of the detection module is yes;
    执行模块,用于以不影响所述芯片为各所述第二目标算子分配的计算资源为基础,为各所述第一目标算子分配所述芯片至少部分剩余计算资源,以通过所述芯片在运行各所述第二目标算子的基础上并行运行各所述第一目标算子,来执行所述第一任务。An execution module, configured to allocate at least part of the remaining computing resources of the chip to each of the first target operators on the basis of not affecting the computing resources allocated by the chip to each of the second target operators, so as to pass the The chip runs each of the first target operators in parallel on the basis of running each of the second target operators to perform the first task.
  11. 如权利要求10所述的装置,其特征在于,所述执行模块具体用于,The device according to claim 10, characterized in that the execution module is specifically used to:
    从预先确定的多个可并行算子组合中筛选出包含有所述第二目标算子的候选算子组合;Screen out a candidate operator combination including the second target operator from a plurality of predetermined parallelizable operator combinations;
    针对每个所述候选算子组合,若该候选算子组合中包含所述第一目标算子的数量超过预设的第一阈值,则确定该候选算子组合作为目标算子组合;For each candidate operator combination, if the number of the first target operator included in the candidate operator combination exceeds a preset first threshold, determine the candidate operator combination as the target operator combination;
    由所述芯片并行运行所述目标算子组合中的各所述第一目标算子和各所述第二目标算子,以执行所述第一任务。Each of the first target operators and each of the second target operators in the target operator combination are run in parallel by the chip to perform the first task.
  12. 如权利要求10所述的装置,其特征在于,所述装置还包括第三确定模块,其用于,The device according to claim 10, characterized in that the device further includes a third determination module configured to:
    获取多个目标模型;Get multiple target models;
    针对每个所述目标模型,确定该目标模型包含的算子之间的数据传输依赖关系;For each of the target models, determine the data transmission dependencies between operators included in the target model;
    根据所述多个目标模型各自的所述数据传输依赖关系,确定所述多个目标模型包含的所有算子中互相之间不存在数据传输依赖关系的算子组合,作为所述可并行算子组合。 According to the data transmission dependencies of each of the multiple target models, an operator combination that has no data transmission dependency on each other among all operators included in the multiple target models is determined as the parallelizable operator combination.
  13. 如权利要求12所述的装置,其特征在于,所述第三确定模块具体用于,The device according to claim 12, characterized in that the third determination module is specifically used to:
    根据针对每个所述目标模型所确定出的所述数据传输依赖关系,确定所述目标模型对应的数据流图,在所述数据流图中,每个节点用于表征所述目标模型包含的每个算子,两个节点之间的边用于表征所述两个节点之间存在数据传输依赖关系;According to the data transmission dependency determined for each target model, a data flow graph corresponding to the target model is determined. In the data flow graph, each node is used to represent the data contained in the target model. For each operator, the edge between two nodes is used to represent the existence of data transmission dependency between the two nodes;
    根据所述多个目标模型对应的多个所述数据流图,确定所述可并行算子组合。The parallelizable operator combination is determined according to multiple data flow graphs corresponding to the multiple target models.
  14. 如权利要求13所述的装置,其特征在于,针对根据所述多个目标模型对应的多个所述数据流图,确定所述可并行算子组合,所述第三确定模块具体用于,The device of claim 13, wherein the third determination module is specifically configured to determine the parallelizable operator combination for multiple data flow graphs corresponding to the multiple target models,
    针对所述多个数据流图包含的每个节点,通过多轮迭代,确定该节点对应的可并行算子组合;其中For each node included in the multiple data flow graphs, through multiple rounds of iteration, determine the parallelizable operator combination corresponding to the node; where
    针对每轮迭代,For each iteration,
    确定该轮迭代中的目标节点组合,所述目标节点组合包括一个目标节点或包括相互不存在依赖关系的多个目标节点;Determine the target node combination in this round of iteration, where the target node combination includes one target node or multiple target nodes that do not have mutual dependencies;
    针对所述多个数据流图包含的除所述目标节点组合中的各目标节点之外的每个其他节点,若该其他节点不依赖于所述目标节点组合中的任意一个目标节点,则根据该其他节点和所述目标节点组合确定可并行算子组合,并For each other node included in the plurality of data flow graphs except for each target node in the target node combination, if the other node does not depend on any target node in the target node combination, then according to The combination of the other nodes and the target node determines the parallelizable operator combination, and
    将确定出的所述可并行算子组合作为下一轮迭代中的目标节点组合,或从所述多个数据流图包含的节点中依次选取出下一轮迭代中的目标节点,直到满足预设的终止条件后。The determined parallelizable operator combination is used as the target node combination in the next round of iteration, or the target node in the next round of iteration is selected sequentially from the nodes included in the multiple data flow graphs until the predetermined After setting the termination condition.
  15. 如权利要求14所述的装置,其特征在于,针对根据该其他节点和所述目标节点组合确定可并行算子组合,所述第三确定模块具体用于,The apparatus of claim 14, wherein the third determination module is specifically configured to determine a parallelizable operator combination based on the combination of the other nodes and the target node,
    将该其他节点和所述目标节点组合合并为候选算子组合,并Combine the other node and the target node combinations into candidate operator combinations, and
    若所述候选算子组合在所述芯片中的运行时间不超过预设阈值,则确定该候选算子组合为可并行算子组合。If the running time of the candidate operator combination in the chip does not exceed the preset threshold, the candidate operator combination is determined to be a parallelizable operator combination.
  16. 如权利要求14所述的装置,其特征在于,针对从所述多个数据流图包含的节点中依次选取出下一轮迭代中的目标节点,所述第三确定模块具体用于,The device of claim 14, wherein the third determination module is specifically configured to sequentially select the target node in the next round of iteration from the nodes included in the plurality of data flow graphs,
    从候选节点集中依次选取出一个节点,作为下一轮迭代中的目标节点,Select a node from the candidate node set in sequence as the target node in the next round of iteration,
    其中,所述候选节点集包括所述多个数据流图中包含的、对应的算子在所述芯片中的运行时间不超过预设的第二阈值的所有节点。Wherein, the candidate node set includes all nodes included in the plurality of data flow graphs for which the running time of the corresponding operator in the chip does not exceed a preset second threshold.
  17. 如权利要求15所述的装置,其特征在于,为了确定所述候选算子组合在所述芯片中的运行时间,所述第三确定模块具体用于,The device according to claim 15, characterized in that, in order to determine the running time of the candidate operator combination in the chip, the third determination module is specifically configured to:
    获取所述候选算子组合的相关特征,所述相关特征包括所述候选算子组合中包含 的每个算子的历史所使用芯片的计算量,历史数据带宽,历史运行时间,所述候选算子组合中包含的算子的数据传输大小的平均值、最大值、最小值、方差中的至少一种;Obtain the relevant features of the candidate operator combination, where the relevant features include the The calculation amount of the chip used in the history of each operator, the historical data bandwidth, the historical running time, the average, maximum, minimum and variance of the data transmission size of the operators included in the candidate operator combination. at least one;
    将所述相关特征输入到预设的预测模型中,以通过所述预测模型预测并输出所述候选算子组合在所述芯片中的运行时间。The relevant features are input into a preset prediction model to predict and output the running time of the candidate operator combination in the chip through the prediction model.
  18. 如权利要求11所述的装置,其特征在于,所述可并行算子组合是在所述芯片处于离线状态时确定的。The apparatus of claim 11, wherein the parallelizable operator combination is determined when the chip is offline.
  19. 一种计算机可读存储介质,其特征在于,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时实现上述权利要求1至9任一项所述的方法。A computer-readable storage medium, characterized in that the storage medium stores a computer program, and when the computer program is executed by a processor, the method described in any one of claims 1 to 9 is implemented.
  20. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现上述权利要求1至9任一项所述的方法。 An electronic device, including a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that when the processor executes the program, it implements the requirements of any one of the above claims 1 to 9. method described.
PCT/CN2023/101479 2023-05-08 2023-06-20 Task execution method, apparatus, storage medium, and electronic device WO2024051270A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310509060.5A CN116225669B (en) 2023-05-08 2023-05-08 Task execution method and device, storage medium and electronic equipment
CN202310509060.5 2023-05-08

Publications (1)

Publication Number Publication Date
WO2024051270A1 true WO2024051270A1 (en) 2024-03-14

Family

ID=86579092

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101479 WO2024051270A1 (en) 2023-05-08 2023-06-20 Task execution method, apparatus, storage medium, and electronic device

Country Status (2)

Country Link
CN (1) CN116225669B (en)
WO (1) WO2024051270A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116225669B (en) * 2023-05-08 2024-01-09 之江实验室 Task execution method and device, storage medium and electronic equipment
CN116880995B (en) * 2023-09-08 2024-01-09 之江实验室 Execution method and device of model task, storage medium and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070283358A1 (en) * 2006-06-06 2007-12-06 Hironori Kasahara Method for controlling heterogeneous multiprocessor and multigrain parallelizing compiler
US20110321051A1 (en) * 2010-06-25 2011-12-29 Ebay Inc. Task scheduling based on dependencies and resources
WO2020026010A2 (en) * 2018-08-02 2020-02-06 优视科技新加坡有限公司 Task execution control method, device, equipment/terminal/server and storage medium
CN112035229A (en) * 2020-08-31 2020-12-04 腾讯科技(深圳)有限公司 Calculation graph processing method and device and storage medium
CN112068957A (en) * 2020-08-27 2020-12-11 北京灵汐科技有限公司 Resource allocation method, device, computer equipment and storage medium
CN112199196A (en) * 2020-10-21 2021-01-08 上海交通大学 Resource allocation method, medium and server
CN115237582A (en) * 2022-09-22 2022-10-25 摩尔线程智能科技(北京)有限责任公司 Method for processing multiple tasks, processing equipment and heterogeneous computing system
CN116225669A (en) * 2023-05-08 2023-06-06 之江实验室 Task execution method and device, storage medium and electronic equipment

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521056B (en) * 2011-12-28 2013-08-14 用友软件股份有限公司 Task allocation device and task allocation method
CN103838621B (en) * 2012-11-27 2017-05-10 中国电信股份有限公司 Method and system for scheduling routine work and scheduling nodes
CN106095586A (en) * 2016-06-23 2016-11-09 东软集团股份有限公司 A kind of method for allocating tasks, Apparatus and system
CN107291545B (en) * 2017-08-07 2019-12-10 星环信息科技(上海)有限公司 Task scheduling method and device for multiple users in computing cluster
CN110554909A (en) * 2019-09-06 2019-12-10 腾讯科技(深圳)有限公司 task scheduling processing method and device and computer equipment
US11704185B2 (en) * 2020-07-14 2023-07-18 Microsoft Technology Licensing, Llc Machine learning-based techniques for providing focus to problematic compute resources represented via a dependency graph
CN112256420B (en) * 2020-10-30 2022-12-02 重庆紫光华山智安科技有限公司 Task allocation method and device and electronic equipment
CN112596898A (en) * 2020-12-16 2021-04-02 北京三快在线科技有限公司 Task executor scheduling method and device
CN115309562A (en) * 2021-05-07 2022-11-08 北京三快在线科技有限公司 Operator calling system, operator generating method and electronic equipment
CN114138440A (en) * 2021-11-30 2022-03-04 上海阵量智能科技有限公司 Operator execution device, operator scheduling device, method and chip
CN114168302A (en) * 2021-12-28 2022-03-11 中国建设银行股份有限公司 Task scheduling method, device, equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070283358A1 (en) * 2006-06-06 2007-12-06 Hironori Kasahara Method for controlling heterogeneous multiprocessor and multigrain parallelizing compiler
US20110321051A1 (en) * 2010-06-25 2011-12-29 Ebay Inc. Task scheduling based on dependencies and resources
WO2020026010A2 (en) * 2018-08-02 2020-02-06 优视科技新加坡有限公司 Task execution control method, device, equipment/terminal/server and storage medium
CN112068957A (en) * 2020-08-27 2020-12-11 北京灵汐科技有限公司 Resource allocation method, device, computer equipment and storage medium
CN112035229A (en) * 2020-08-31 2020-12-04 腾讯科技(深圳)有限公司 Calculation graph processing method and device and storage medium
CN112199196A (en) * 2020-10-21 2021-01-08 上海交通大学 Resource allocation method, medium and server
CN115237582A (en) * 2022-09-22 2022-10-25 摩尔线程智能科技(北京)有限责任公司 Method for processing multiple tasks, processing equipment and heterogeneous computing system
CN116225669A (en) * 2023-05-08 2023-06-06 之江实验室 Task execution method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN116225669B (en) 2024-01-09
CN116225669A (en) 2023-06-06

Similar Documents

Publication Publication Date Title
WO2024051270A1 (en) Task execution method, apparatus, storage medium, and electronic device
JP2022008497A (en) Correlation of stack segment strength in emerging relationships
US10025637B2 (en) System and method for runtime grouping of processing elements in streaming applications
US7739635B2 (en) Conjunctive BDD building and variable quantification using case-splitting
WO2022139879A1 (en) Methods, systems, articles of manufacture and apparatus to optimize resources in edge networks
US11775269B2 (en) Generating a synchronous digital circuit from a source code construct defining a function call
WO2024007849A1 (en) Distributed training container scheduling for intelligent computing
US20190065284A1 (en) Hybrid acceleration in a processing environment
CN108415695A (en) A kind of data processing method, device and equipment based on visualization component
WO2022148086A1 (en) Information processing method and apparatus, and device and storage medium
JP2023545970A (en) Query engine autoscaling for enterprise-level big data workloads
CN108536613B (en) Data cleaning method and device and server
CN114429195A (en) Performance optimization method and device for hybrid expert model training
WO2024017177A1 (en) Method and apparatus for executing service, storage medium and device
US20180225333A1 (en) Data write/import performance in a database through distributed memory
WO2023231342A1 (en) Method and apparatus for automatically executing contract on the basis of variable state
CN107493205B (en) Method and device for predicting capacity expansion performance of equipment cluster
US11379468B1 (en) Control flow graph refining via execution data
CN107645541B (en) Data storage method and device and server
CN115795342B (en) Method and device for classifying business scenes, storage medium and electronic equipment
CN116755893B (en) Job scheduling method and device of deep learning-oriented distributed computing system
CN117076336B (en) Testing method and device of cloud edge cooperative system, storage medium and equipment
CN117348999B (en) Service execution system and service execution method
CN112506652A (en) Dynamic resource partitioning method
CN116204324A (en) Task execution method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23861973

Country of ref document: EP

Kind code of ref document: A1