WO2024051270A1

WO2024051270A1 - Task execution method, apparatus, storage medium, and electronic device

Info

Publication number: WO2024051270A1
Application number: PCT/CN2023/101479
Authority: WO
Inventors: 唐晓瑜; 毛旷; 潘秋红; 汤昭荣; 王颖; 杨弢
Original assignee: 之江实验室
Priority date: 2023-05-08
Filing date: 2023-06-20
Publication date: 2024-03-14
Also published as: CN116225669B; CN116225669A

Abstract

A task execution method, an apparatus, a storage medium, and an electronic device, which can pre-determine all operator combinations able to be concurrently executed, and thereby when first task execution is carried out in response to a first task request initiated by a user, it can be determined whether there are operator combinations able to be concurrently executed that match between various executable operators needed for executing the first task and operators for a second task being executed by a chip executing said second task, and if yes, the first task can be concurrently executed by the chip executing the second task, thus improving the utilization of computing resources of the chip.

Description

Methods, devices, storage media and electronic equipment for task execution

Technical field

This specification relates to the field of artificial intelligence technology, and in particular to methods, devices, storage media and electronic equipment for task execution.

Background technique

Currently, various artificial intelligence models are widely used in fields such as autonomous driving and augmented reality. When a business platform performs a task through an artificial intelligence model, it usually determines the operators required to perform the task from the operators contained in the artificial intelligence model, and runs these operators through the chip to perform the task.

When the chip runs these operators, only part of the chip's computing resources are occupied, which leads to a waste of the chip's computing resources and reduces the utilization of the chip's computing resources. Therefore, how to improve the utilization of the chip's computing resources is an urgent problem to be solved.

Contents of the invention

This specification provides a task execution method, device, storage medium and electronic equipment to partially solve the above problems existing in the prior art.

This specification provides a task execution method, which includes: obtaining a first task request; determining the operator combination required to execute the first task corresponding to the first task request, and determining that at least one of the operator combinations can The execution operator is used as the first target operator. The executable operator can be run directly because it does not need to rely on the operation results of other operators; for the chip that is executing the second task, it is judged whether the remaining computing resources of the chip satisfy Preset conditions; if the result of the judgment is yes, determine at least one executable operator corresponding to the second task currently run by the chip as the second target operator; so as not to affect the operation of each of the chips. Based on the computing resources allocated by the second target operator, allocate at least part of the remaining computing resources of the chip to each of the first target operators, so as to run each of the second target operators through the chip. Run each of the first target operators in parallel to execute the first task.

Optionally, on the basis of not affecting the computing resources allocated by the chip to each of the second target operators, allocating at least part of the remaining computing resources of the chip to each of the first target operators specifically includes: Select a candidate operator combination including the second target operator from the determined multiple parallelizable operator combinations; for each candidate operator combination, if the candidate operator combination includes the first target If the number of operators exceeds the preset first threshold, the candidate operator combination is determined as the target operator combination; the chip runs the target operator combination in parallel. each of the first target operators and each of the second target operators to perform the first task.

Optionally, determining a feasible operator combination specifically includes: obtaining multiple target models; for each target model, determining the data transmission dependencies between operators included in the target model; According to the data transmission dependencies of each model, an operator combination that does not have a data transmission dependency relationship among all operators included in the multiple target models is determined as the parallelizable operator combination.

Optionally, according to the data transmission dependencies of each of the multiple target models, determine an operator combination that does not have a data transmission dependency with each other among the operators included in the multiple target models as a parallel operation. The sub-combination specifically includes: determining a data flow graph corresponding to the target model according to the data transmission dependency determined for each target model. In the data flow graph, each node is used to represent The operator included in the target model and the edge between two nodes are used to represent the existence of data transmission dependency between the two nodes; according to multiple data flow graphs corresponding to the multiple target models, determine The parallelizable operator combination.

Optionally, determining a parallelizable operator combination according to the data flow graph corresponding to the multiple target models specifically includes: for each node included in the multiple data flow graph, determine the combination through multiple rounds of iterations. The parallelizable operator combination corresponding to the node. Wherein, for each round of iteration, the target node combination in this round of iteration is determined, and the target node combination includes one target node or multiple target nodes that do not have mutual dependencies; for the multiple data flow graphs including For each other node other than each target node in the target node combination, if the other node does not depend on any target node in the target node combination, then the available node is determined based on the other node and the target node combination. Parallel operator combination, and use the determined parallelizable operator combination as the target node combination in the next round of iteration, or sequentially select the nodes in the next round of iteration from the nodes included in the multiple data flow graphs. target node until the preset termination conditions are met.

Optionally, determining a parallelizable operator combination based on the other node and the target node or the other node and the target node combination specifically includes: merging the other node and the target node combination into a candidate operator combination ; If the running time of the candidate operator combination in the chip does not exceed the preset second threshold, it is determined that the candidate operator combination is a parallelizable operator combination.

Optionally, sequentially selecting the target node in the next round of iteration from the nodes included in the plurality of data flow graphs specifically includes: sequentially selecting a node from the candidate node set as the target node in the next round of iteration. . Wherein, the candidate node set includes all nodes included in the plurality of data flow graphs for which the running time of the corresponding operator in the chip does not exceed a preset second threshold.

Optionally, determining the running time of the candidate operator combination in the chip specifically includes: obtaining the candidate operator combination. Relevant characteristics of the selected operator combination, including the calculation amount of the chip used by each operator included in the candidate operator combination, historical data bandwidth, historical running time, and the historical running time of each operator included in the candidate operator combination. At least one of the average, maximum, minimum, and variance of the data transmission size of the included operators; input the relevant features into the preset prediction model to predict and output the The running time of candidate operator combinations in the chip.

Optionally, the parallelizable operator combination is determined when the chip is offline.

This specification provides a task execution device, including: an acquisition module, used to obtain a first task request; a first determination module, used to determine the operator combination required to execute the first task corresponding to the first task request, And determine at least one executable operator in the operator combination as the first target operator. The executable operator can be run directly because it does not need to rely on the running results of other operators; the detection module is used to target the current The chip that performs the second task determines whether the remaining computing resources of the chip meet the preset conditions; the second determination module is used to determine the third current operation of the chip if the determination result of the detection module is yes. At least one executable operator corresponding to the two tasks is used as the second target operator; the execution module is used to provide each of the second target operators with no influence on the computing resources allocated by the chip for each of the second target operators. A target operator allocates at least part of the remaining computing resources of the chip to execute the first task by running each of the first target operators in parallel on the basis of running each of the second target operators. .

Optionally, the execution module is specifically configured to select candidate operator combinations containing the second target operator from a plurality of predetermined parallelizable operator combinations; for each candidate operator combination , if the number of the first target operator included in the candidate operator combination exceeds the preset first threshold, then determine the candidate operator combination as the target operator combination; run the target operator in parallel by the chip Each of the first target operators and each of the second target operators in the combination are combined to perform the first task.

Optionally, the device further includes a third determination module, configured to: obtain multiple target models; for each target model, determine the data transmission dependencies between operators included in the target model; according to the The data transmission dependencies of each of the multiple target models determine an operator combination that does not have a data transmission dependency with each other among all operators included in the multiple target models, as the parallelizable operator combination.

Optionally, the third determination module is specifically configured to determine the data flow diagram corresponding to the target model according to the data transmission dependency determined for each target model. In the data flow diagram , each node is used to represent each operator included in the target model, and the edge between two nodes is used to represent the existence of a data transmission dependency relationship between the two nodes; according to the correspondence between the multiple target models A plurality of the data flow graphs are used to determine the parallelizable operator combination.

Optionally, the third determination module is specifically configured to, for each node included in the plurality of data flow graphs, determine the parallelizable operator combination corresponding to the node through multiple rounds of iterations. Wherein, for each round of iteration, the target node combination in this round of iteration is determined, and the target node combination includes one target node or multiple target nodes that do not have mutual dependencies; for the multiple data flow graphs including For each other node other than each target node in the target node combination, if the other node does not depend on any target node in the target node combination, then the available node is determined based on the other node and the target node combination. Parallel operator combinations, and use the determined parallelizable operator combination as the target node combination in the next round of iteration, or sequentially select the nodes in the next round of iteration from the nodes included in the multiple data flow graphs. target node until the preset termination conditions are met.

Optionally, the third determination module is specifically configured to combine the other node and the target node combination into a candidate operator combination, and if the running time of the candidate operator combination in the chip does not exceed a predetermined time, If a threshold is set, the candidate operator combination is determined to be a parallel operator combination.

Optionally, the third determination module is specifically configured to sequentially select a node from the candidate node set as the target node in the next round of iteration. Wherein, the candidate node set includes all nodes included in the plurality of data flow graphs for which the running time of the corresponding operator in the chip does not exceed a preset second threshold.

Optionally, the third determination module is specifically configured to obtain relevant features of the candidate operator combination, where the relevant features include calculations of the chips used in the history of each operator included in the candidate operator combination. quantity, historical data bandwidth, historical running time, at least one of the average, maximum, minimum, and variance of the data transmission size of the operators included in the candidate operator combination; input the relevant features into the preset A prediction model is provided to predict and output the running time of the candidate operator combination in the chip through the prediction model.

This specification provides a computer-readable storage medium. The storage medium stores a computer program. When the computer program is executed by a processor, the above task execution method is implemented.

This specification provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the above task execution method is implemented.

According to the task execution method provided by the embodiment of this specification, by determining an executable operator that does not need to rely on the running results of other operators among the operator combinations required to execute the first task corresponding to the current task request as the first target operator, And when the remaining computing resources of the chip that is executing the second task meet the preset conditions, the computing resources allocated by the chip to each executable operator (as the second target operator) corresponding to the second task are not affected. Basically, allocate at least part of the remaining computing resources of the chip to each first target operator, so that the chip can run each On the basis of the second target operator, each first target operator is run in parallel to perform the first task, thereby effectively improving the utilization of the chip's computing resources.

Description of the drawings

The accompanying drawings are used to provide a further understanding of this specification and constitute a part of this specification. The illustrative embodiments and descriptions of this specification are used to explain this specification and do not constitute an improper limitation of this specification. In the attached picture:

Figure 1 is a schematic diagram of a task execution method provided according to embodiments in this specification;

Figure 2 is a schematic diagram of a data flow diagram provided according to an embodiment in this specification;

Figure 3 is a schematic diagram of the process of determining a feasible operator combination according to an embodiment of this specification;

Figure 4 is a functional module schematic diagram of a task execution device provided according to an embodiment of this specification;

FIG. 5 is a schematic structural diagram of an electronic device for task execution provided according to an embodiment of this specification.

Detailed ways

In order to make the purpose, technical solutions and advantages of this specification more clear, the technical solutions of this specification will be clearly and completely described below in conjunction with specific embodiments of this specification and the corresponding drawings. Obviously, the described embodiments are only some of the embodiments of this specification, but not all of the embodiments. Based on the embodiments in this specification, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this specification.

The technical solutions provided by each embodiment of this specification will be described in detail below with reference to the accompanying drawings.

This specification provides a task execution method, as shown in Figure 1. The method includes the following steps S101 to S105.

In step S101: obtain the first task request.

In this manual, users can perform corresponding tasks through the artificial intelligence model deployed in the business platform. Specifically, the user can send a task request to the business platform through the device used by the user, so that after obtaining the task request, the business platform can dispatch the computing resources of the chip in response to the received task request to execute the task request corresponding task. For example, a user can send a product recommendation request to the business platform through the device used by the user. After receiving the product recommendation request, the business platform can call the chip to run the operator in the product recommendation model to perform the product recommendation task.

In this specification, the execution subject used to implement the task execution method may refer to a server, etc. set on the business platform. The designated equipment of the station may also refer to terminal equipment such as desktop computers and laptop computers. For the convenience of description, the following only takes the server as the execution subject as an example to describe the task execution method provided in this manual.

In step S102: Determine the operator combination required to execute the first task corresponding to the first task request, and determine at least one executable operator included in the operator combination as the first target operator. The executable operator does not need to rely on the running results of other operators, and can be directly run by the chip, for example.

For example, the server may determine the operator combination required to execute the task corresponding to the task request, and determine at least one executable operator included in the operator combination as the first target operator. The executable operators here can be run directly because they do not need to rely on the results of other operators.

Specifically, when the server needs to perform a certain task, the model used to perform the task is determined, and the operators included in the model used to perform the task are used to form the operator combination required to perform the task, and then the chip can be run. The operators included in the operator combination are used to implement task execution. There may be multiple models determined by the server for executing the task. For example, executing task A requires running operator a and operator b of model one, and operator c and operator d of model two needs to be run.

In actual application scenarios, the server can also run each executable operator through multiple chips. Therefore, the server can determine at least one executable operator in the operator combination as the first target operator. For example: Assume that the executable operators include operator a, operator b, operator c and operator d. The server can run operator a through chip one, operator b through chip two, and operator c through chip three. operator and d operator.

Furthermore, there may be certain data transmission dependencies between operators. For example, when an operator is running, it needs to use the data processing results of other operators as parameters for the operation of the operator. Therefore, if the operator on which the operator depends has not been executed, the parameters required for the operation of the operator have not yet been determined, making it impossible to run the operator directly. Therefore, after the server determines the operator combination that needs to be run in response to the task request issued by the user, it can determine the operator that can be directly run by the chip as the first target operator from the operator combination. Running the first target operator through the chip allows other operators in the operator combination that depend on the operation result of the first target operator to continue to run through the chip.

In addition, after the server finishes running the first target operator through the chip, the operators that depend on the first target operator in the operator combination corresponding to the first task will become Operators that can be executed directly by the chip. At this time, the server can re-determine at least one new operator that can be directly run by the chip from the operator combination as the first target operator.

In step S103: For the chip that is executing the second task, determine whether the remaining computing resources of the chip meet the preset conditions.

In step S104: If the judgment result in step S103 is yes, the executable operator corresponding to the second task currently run by the chip is obtained as the second target operator.

After the server determines the first target operator that currently needs to be run through the chip, it can determine whether the remaining computing resources of the chip meet the preset conditions for the chip that is executing the second task. If the judgment result is yes, it can use the The chip runs the first target operator corresponding to the first task in parallel on the basis of executing the second task. If the judgment result is no, a chip in an idle state can be allocated to the first task to run the first target operator corresponding to the first task; or, if there is currently no chip in an idle state, it can be made The first task is in a waiting state. Among them, the above preset conditions can be set according to actual needs. For example: determine whether the remaining computing resources of the chip reach the specified condition threshold, etc.

In step S105: On the basis of not affecting the computing resources allocated by the chip to each of the second target operators, allocate at least part of the remaining computing resources of the chip to each of the first target operators, and use the chip to The first task is executed by running each of the first target operators in parallel on the basis of running each of the second target operators.

For example, the server can allocate at least part of the remaining computing resources of the chip to each first target operator on the basis of not affecting the computing resources allocated by the chip to each second target operator, and execute by running each first target operator in parallel. Corresponding first task.

Specifically, the server can filter out a parallelizable operator combination including the second target operator as a candidate operator combination from the predetermined parallelizable operator combinations. For each candidate operator combination, if the candidate operator combination The number of first target operators included in exceeds the preset first threshold, then the candidate operator combination is determined as the target operator combination, and the chip runs each first target operator in the target operator combination in parallel. and each second objective operator.

In an actual application scenario, when the server needs to execute the first task in parallel through the chip that is executing the second task, it needs to predict the first target operator that needs to be executed and the second target operator that is currently running. Whether the two target operators can be run in parallel, and the above prediction process may cause the response delay of the task request to increase. Based on this, the server can pre-determine all operator combinations that can be executed in parallel as parallel operator combinations in the offline state. Then, in actual applications, it can directly filter out the combinations with the second target from each parallel operator combination. operator and the combination of target operators matched by the first target operator to make the chip run in parallel.

Among them, the method for the server to determine each parallel operator combination may include: obtaining multiple target models; for each target model, determining the data transmission dependency between operators included in the target model; corresponding to each other according to the target model The data transmission dependency relationship is determined, and an operator combination that has no data transmission dependency relationship among all operators included in the multiple target models is determined as the parallelizable operator combination.

Figure 2 is a schematic diagram of a data flow diagram provided according to an embodiment of this specification. As can be seen from Figure 2, the server can determine the data flow diagram corresponding to each target model based on the determined data transmission dependencies. In the data flow graph, each node is used to represent the operators contained in the target model, and the edge between two nodes is used to represent the data transmission dependency between the two nodes. Among them, when there is an edge from node A to node B between a node A and a node B, it means that node B depends on node A.

Further, for each node included in the data flow graph, the server can determine the parallelizable operator combination corresponding to the node through multiple rounds of iteration.

Among them, for each round of iteration, the server can: determine the target node combination in this round of iteration, the target node combination includes one target node or multiple target nodes that do not have mutual dependencies; For each other node other than each target node in the node combination, determine whether the other node depends on any target node in the target node combination.

If the other nodes do not depend on any target node in the target node combination, the server can determine the parallelizable operator combination based on the other node and the target node combination, and use the determined parallelizable operator combination as the next The target node combination in one round of iteration, or the target node in the next round of iteration is selected sequentially from the nodes included in the data flow graph.

For each other node, the server can sequentially determine according to the data flow graph whether the target node in the target node combination is the parent node or ancestor node of the other node. If there is a target node in the target node combination that is the parent of the other node node or ancestor node, it can be determined that the other node depends on the target node in the target node combination. Among them, in the data flow graph, if there is an edge from the first node to the second node, the first node is the parent node and the second node is the son node. Furthermore, the parent node and ancestor node of the first node are the ancestor nodes of the second node.

Further, when the server determines that the preset termination condition is met, the server may determine each obtained target node combination as a parallel operator combination. The above-mentioned preset termination condition may be that all nodes included in the data flow graph are used as target nodes to participate in the iterative processing. Of course, the above preset termination condition can also be that the number of rounds of iteration reaches a specified number of rounds.

Wherein, the server determines a parallel operator combination based on the combination of the other nodes and the target node, which may be specifically: the server combines the other nodes and the target node combination into a candidate operator combination, and determines that the candidate operator combination is in The running time in the chip. If the running time of the candidate operator combination in the chip does not exceed the preset second threshold, the candidate operator combination is determined to be a parallelizable operator combination.

As can be seen from the above content, the server can also predict the running time of each candidate operator combination in the chip, and further evaluate each candidate operator combination based on the predicted running time of each candidate operator combination in the chip. Ground screening.

In actual application scenarios, there may be some operators that take a long time to run on the chip and are not suitable for combination with other operators as parallel operators. Therefore, before determining the parallelizable operator combination, the server can also determine the running time of the operator corresponding to the node in the chip for each node included in the data flow graph. If the operator corresponding to the node is in the chip, If the running time does not exceed the preset second threshold, the node is added to the preset candidate node set. In this way, when determining multiple rounds of iterations of parallelizable operator combinations, one node can be selected sequentially from the candidate node set as the target node in the next round of iterations.

The server determines the running time of the candidate operator combination in the chip, which can be specifically as follows: the server obtains the relevant features of the candidate operator combination, inputs the relevant features into the preset prediction model, so that the candidate operator combination is predicted and output by the prediction model run time in the chip. Among them, the relevant features here include: the calculation amount of the chip used in the history of each operator included in the candidate operator combination, historical data bandwidth, historical running time, and the data transmission size of the operators included in the candidate operator combination. At least one of average, maximum, minimum, and variance. It should be noted that, when the candidate operator combination only includes one operator, what is obtained through the above method is the running time of the operator in the chip.

The training method of the above prediction model may be: inputting a sample operator combination including at least one sample operator into the prediction model, predicting and outputting the running time of the sample operator combination in the chip through the prediction model; and, in a minimum manner The deviation between the running time of the sample operator combination output by the prediction model in the chip and the actual running time of the sample operator combination simulated offline is used to train the prediction model.

FIG. 3 is a schematic diagram of a process for determining a parallelizable operator combination provided according to an embodiment in this specification.

As can be seen from Figure 3, the server can encode the operator corresponding to each node in the data flow graph. The encoding method may be to sequentially encode each node in the data flow graph in ascending order from 0 to N. Among them, N represents the number of all nodes in the data flow graph.

Further, the server can predict that the running time of the operator corresponding to node i (hereinafter also referred to as the i-th operator, or operator i) on the chip _is ti. When the running time ti of operator i is less than _the predetermined When setting the second threshold, add the operator to set 1 and set 2 respectively, and also add it to the result set R. Among them, i is an integer greater than or equal to 0 and less than N, indicating the encoding of the node and the operator corresponding to the node.

During each iteration, the server can determine whether set 1 is empty. When set 1 is not empty, The server can select operator x in set 1, delete the operator x from set 1, and combine each operator in set 2 with operator x to obtain operator combination x′. Further, if there is no data dependency between operators in operator combination x′ and the running time of operator combination x′ in a single chip does not exceed the second threshold, then operator combination x′ is added to the set. 1 and the result set R, otherwise the operator combination x′ is discarded. Repeat the above process until set 1 is empty. At this time, each operator combination in the result set R is a parallel operator combination.

As can be seen from the above content, by predetermining all operator combinations that can be executed in parallel, and matching the executable operator corresponding to the first task to be executed in response to the user-initiated task request with the second task being executed by the chip When the executable operators can combine the operator combinations that can be executed in parallel, by using the chip to execute the first task in parallel on the basis of executing the second task, the utilization of the chip's computing resources can be effectively improved.

The above is a task execution method provided by one or more embodiments of this specification. Based on the same idea, this manual also provides corresponding task execution devices. As shown in Figure 4, the device includes an acquisition module 401, a first determination module 402, a detection module 403, a second determination module 404, and an execution module 405.

Obtaining module 401 is used to obtain the first task request.

The first determination module 402 is used to determine the operator combination required to execute the first task corresponding to the first task request, and determine at least one executable operator in the operator combination as the first target operator, The executable operator does not need to rely on the running results of other operators and can be run directly.

The detection module 403 is used to determine whether the remaining computing resources of the chip that is executing the second task meet the preset conditions.

The second determination module 404 is configured to determine at least one executable operator corresponding to the second task currently run by the chip as each second target operator if the determination result of the detection module 403 is yes.

Execution module 405 is configured to allocate at least part of the remaining computing resources of the chip to each of the first target operators on the basis of not affecting the computing resources allocated by the chip to each of the second target operators. The chip runs each of the first target operators in parallel on the basis of running each of the second target operators.

Optionally, the execution module 405 is specifically configured to select candidate operator combinations containing the second target operator from a plurality of predetermined parallelizable operator combinations; for each candidate operator combination, If the number of the first target operator included in the candidate operator combination exceeds the preset first threshold, the candidate operator combination is determined to be the target operator combination; the chip runs the target operator combination in parallel. Each first target operator and each second target operator in .

Optionally, the device further includes: a third determining module 406. The third determination module 406 is specifically used to: Obtain multiple target models; for each target model, determine the data transmission dependencies between operators included in the target model; determine the multiple target models based on the data transmission dependencies corresponding to the multiple target models. Operator combinations that have no data transmission dependencies among all included operators are regarded as parallelizable operator combinations.

Optionally, the third determination module 406 is specifically configured to: determine the data flow graph corresponding to each target model according to the determined data transmission dependency relationship. In the data flow graph, each node uses To represent the operators included in the target model, the edge between two nodes is used to represent the existence of a data transmission dependency relationship between the two nodes; according to the data flow graph, a combination of parallel operators is determined.

Optionally, the third determination module 406 is specifically configured to: for each node included in the data flow graph, determine the parallelizable operator combination corresponding to the node through multiple rounds of iteration. Wherein, for each round of iteration, the third determination module 406 is specifically used to: determine a target node combination in this round of iteration, where the target node combination includes one target node or multiple target nodes that do not have mutual dependencies; For each other node included in the data flow graph except each target node in the target node combination, if the other node does not depend on any target node in the target node combination, then according to the other The combination of nodes and the target node determines a parallelizable operator combination, and the determined parallelizable operator combination is used as the target node combination in the next round of iteration, or the next one is selected sequentially from the nodes included in the data flow graph. The target node in one iteration; and, after determining that the preset termination conditions are met, a parallelizable operator combination is obtained.

Optionally, the third determination module 406 is specifically configured to: combine the other node and the target node combination into a candidate operator combination, and determine the running time of the candidate operator combination in the chip; if If the running time of the candidate operator combination in the chip does not exceed a preset threshold, the candidate operator combination is determined to be a parallelizable operator combination.

Optionally, the third determination module 406 is specifically configured to: for each node included in the data flow graph, determine the running time of the operator corresponding to the node in the chip; if the operator corresponding to the node If the running time of the child in the chip does not exceed the preset threshold, then the node is added to the preset candidate node set; a node is selected in sequence from the candidate node set as the target node in the next round of iteration.

Optionally, the third determination module 406 is specifically configured to: obtain relevant features of the candidate operator combination, where the relevant features include the historical calculation amount of the chip used for each operator included in the candidate operator combination. , historical data bandwidth, historical running time, at least one of the average, maximum, minimum, and variance of the data transmission size of the operators included in the candidate operator combination; input the relevant features into the preset prediction In the model, the running time of the candidate operator combination in the chip is predicted and outputted through the prediction model.

This specification also provides a computer-readable storage medium that stores a computer program, and the computer program can be used to perform the above method for task execution.

This specification also provides a schematic structural diagram of the electronic device used for task execution shown in Figure 5. As shown in Figure 5, at the hardware level, the electronic device includes a processor, internal bus, network interface, memory and non-volatile memory, and of course may also include other hardware required for business. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it to implement the above method for task execution.

Of course, in addition to software implementation, this specification does not exclude other implementation methods, such as logic devices or a combination of software and hardware, etc. That is to say, the execution subject of the following processing flow is not limited to each logical unit, and may also be hardware or logic device.

In the 1990s, improvements in a technology could be clearly distinguished as hardware improvements (for example, improvements in circuit structures such as diodes, transistors, switches, etc.) or software improvements (improvements in method processes). However, with the development of technology, many improvements in today's method processes can be regarded as direct improvements in hardware circuit structures. Designers almost always obtain the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that an improvement of a method flow cannot be implemented using hardware entity modules. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is such an integrated circuit whose logic functions are determined by the user programming the device. Designers can program themselves to "integrate" a digital system on a PLD, instead of asking chip manufacturers to design and produce dedicated integrated circuit chips. Moreover, nowadays, instead of manually making integrated circuit chips, this kind of programming is mostly implemented using "logic compiler" software, which is similar to the software compiler used in program development and writing. Before compiling, The original code must also be written in a specific programming language, which is called Hardware Description Language (HDL). There is not only one type of HDL, but many types, such as ABEL (Advanced Boolean Expression Language) , AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., are currently the most commonly used The most popular ones are VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog. Those skilled in the art should also know that by simply logically programming the method flow using the above-mentioned hardware description languages and programming it into the integrated circuit, the hardware circuit that implements the logical method flow can be easily obtained.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (eg, software or firmware) executable by the (micro)processor. , logic gates, switches, Application Specific Integrated Circuit (ASIC), programmable logic controllers and embedded microcontrollers. Examples of controllers include but are not limited to the following microcontrollers: ARC 625D, Atmel AT91SAM, For Microchip PIC18F26K20 and Silicone Labs C8051F320, the memory controller can also be implemented as part of the memory's control logic. Those skilled in the art also know that in addition to implementing the controller in the form of pure computer-readable program code, the controller can be completely programmed with logic gates, switches, application-specific integrated circuits, programmable logic controllers and embedded logic by logically programming the method steps. Microcontroller, etc. to achieve the same function. Therefore, this controller can be considered as a hardware component, and the devices included therein for implementing various functions can also be considered as structures within the hardware component. Or even, the means for implementing various functions can be considered as structures within hardware components as well as software modules implementing the methods.

The systems, devices, modules or units described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.

For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing this specification, the functions of each unit can be implemented in the same or multiple software and/or hardware.

Those skilled in the art will understand that embodiments of the present specification may be provided as methods, systems, or computer program products. Thus, the present description may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk memory, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The specification is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the specification. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable graph data processing device to produce a machine such that the instructions are executed by the processor of the computer or other programmable graph data processing device. Generate a process or processes used to implement a process in a flowchart and/or a block diagram A device for the functions specified in a box or boxes.

These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, The instruction means implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.

These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. The instructions provide steps for implementing the functions specified in a process or processes in the flow diagram and/or in a block or blocks in the block diagram.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

Memory may include non-permanent storage in computer-readable media, random access memory (RAM) and/or non-volatile memory in the form of read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory. (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device. As defined in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements not only includes those elements, but also includes Other elements are not expressly listed or are inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes the stated element.

Those skilled in the art will appreciate that embodiments of the present specification may be provided as methods, systems, or computer program products. Thus, the present description may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk memory, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

This specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The present description may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through communications networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

Each embodiment in this specification is described in a progressive manner. The same and similar parts between the various embodiments can be referred to each other. Each embodiment focuses on its differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.

Claims

A task execution method, characterized by including:

Get the first task request;

Determine the operator combination required to execute the first task corresponding to the first task request, and determine at least one executable operator in the operator combination as the first target operator. The executable operator is not It needs to rely on the running results of other operators and can be run directly;

For the chip that is executing the second task, determine whether the remaining computing resources of the chip meet the preset conditions;

If the result of the judgment is yes, then

Determine at least one executable operator corresponding to the second task currently run by the chip as a second target operator;

On the basis of not affecting the computing resources allocated by the chip to each of the second target operators, allocate at least part of the remaining computing resources of the chip to each of the first target operators, so that the chip can run each On the basis of the second target operator, each of the first target operators is run in parallel to execute the first task.
The method according to claim 1, characterized in that, on the basis of not affecting the computing resources allocated by the chip to each of the second target operators, at least part of the chip is allocated to each of the first target operators. Remaining computing resources include:

Screen out a candidate operator combination including the second target operator from a plurality of predetermined parallelizable operator combinations;

For each candidate operator combination, if the number of the first target operator included in the candidate operator combination exceeds a preset first threshold, determine the candidate operator combination as the target operator combination;

Each of the first target operators and each of the second target operators in the target operator combination are run in parallel by the chip.
The method according to claim 2, characterized in that determining feasible operator combinations specifically includes:

Get multiple target models;

For each of the target models, determine the data transmission dependencies between operators included in the target model;

According to the data transmission dependencies of each of the multiple target models, an operator combination that has no data transmission dependency on each other among all operators included in the multiple target models is determined as the parallelizable operator combination.
The method according to claim 3, characterized in that, according to the data transmission dependencies of each of the plurality of target models, it is determined that all operators included in the plurality of target models do not have data transmission dependencies on each other. Relational operator combinations, as the parallelizable operator combinations, specifically include:

According to the data transmission dependency determined for each target model, a data flow graph corresponding to the target model is determined. In the data flow graph, each node is used to represent the data contained in the target model. Operator, the edge between two nodes is used to represent the existence of data transmission dependency between the two nodes;

The parallelizable operator combination is determined according to multiple data flow graphs corresponding to the multiple target models.
The method according to claim 4, characterized in that determining the parallelizable operator combination according to multiple data flow graphs corresponding to the multiple target models specifically includes:

For each node included in the multiple data flow graphs, through multiple rounds of iteration, determine the parallelizable operator combination corresponding to the node; where

For each iteration,

Determine the target node combination in this round of iteration, where the target node combination includes one target node or multiple target nodes that do not have mutual dependencies;

For each other node included in the plurality of data flow graphs except for each target node in the target node combination, if the other node does not depend on any target node in the target node combination, then according to The combination of the other nodes and the target node determines the parallelizable operator combination, and

The determined parallelizable operator combination is used as the target node combination in the next round of iteration, or the target node in the next round of iteration is selected sequentially from the nodes included in the multiple data flow graphs until the predetermined set termination conditions.
The method according to claim 5, characterized in that determining a parallelizable operator combination based on the combination of the other nodes and the target node specifically includes:

Combine the other nodes and the target node combination into a candidate operator combination;

If the running time of the candidate operator combination in the chip does not exceed a preset second threshold, the candidate operator combination is determined to be a parallelizable operator combination.
The method according to claim 5, characterized in that, sequentially selecting the target node in the next round of iteration from the nodes included in the plurality of data flow graphs specifically includes:

Select a node from the candidate node set in sequence as the target node in the next round of iteration,

Wherein, the candidate node set includes all nodes included in the plurality of data flow graphs for which the running time of the corresponding operator in the chip does not exceed a preset second threshold.
The method of claim 6, wherein determining the running time of the candidate operator combination in the chip specifically includes:

Obtain the relevant features of the candidate operator combination, which include the historical calculation amount of the chip used by each operator included in the candidate operator combination, historical data bandwidth, historical running time, the candidate operator At least one of the average, maximum, minimum, and variance of the data transmission size of the operators included in the subcombination;

The relevant features are input into a preset prediction model to predict and output the running time of the candidate operator combination in the chip through the prediction model.
The method of claim 2, wherein the parallelizable operator combination is determined when the chip is offline.
A task execution device, characterized by including:

Obtain module, used to obtain the first task request;

The first determination module is used to determine the operator combination required to execute the first task corresponding to the first task request, and determine at least one executable operator in the operator combination as the first target operator, so The above executable operators can be run directly because they do not need to rely on the results of other operators;

A detection module, configured to determine whether the remaining computing resources of the chip that is executing the second task meet the preset conditions;

a second determination module, configured to determine at least one executable operator corresponding to the second task currently run by the chip as a second target operator if the judgment result of the detection module is yes;

An execution module, configured to allocate at least part of the remaining computing resources of the chip to each of the first target operators on the basis of not affecting the computing resources allocated by the chip to each of the second target operators, so as to pass the The chip runs each of the first target operators in parallel on the basis of running each of the second target operators to perform the first task.
The device according to claim 10, characterized in that the execution module is specifically used to:

Screen out a candidate operator combination including the second target operator from a plurality of predetermined parallelizable operator combinations;

For each candidate operator combination, if the number of the first target operator included in the candidate operator combination exceeds a preset first threshold, determine the candidate operator combination as the target operator combination;

Each of the first target operators and each of the second target operators in the target operator combination are run in parallel by the chip to perform the first task.
The device according to claim 10, characterized in that the device further includes a third determination module configured to:

Get multiple target models;

For each of the target models, determine the data transmission dependencies between operators included in the target model;

According to the data transmission dependencies of each of the multiple target models, an operator combination that has no data transmission dependency on each other among all operators included in the multiple target models is determined as the parallelizable operator combination.
The device according to claim 12, characterized in that the third determination module is specifically used to:

According to the data transmission dependency determined for each target model, a data flow graph corresponding to the target model is determined. In the data flow graph, each node is used to represent the data contained in the target model. For each operator, the edge between two nodes is used to represent the existence of data transmission dependency between the two nodes;

The parallelizable operator combination is determined according to multiple data flow graphs corresponding to the multiple target models.
The device of claim 13, wherein the third determination module is specifically configured to determine the parallelizable operator combination for multiple data flow graphs corresponding to the multiple target models,

For each node included in the multiple data flow graphs, through multiple rounds of iteration, determine the parallelizable operator combination corresponding to the node; where

For each iteration,

Determine the target node combination in this round of iteration, where the target node combination includes one target node or multiple target nodes that do not have mutual dependencies;

For each other node included in the plurality of data flow graphs except for each target node in the target node combination, if the other node does not depend on any target node in the target node combination, then according to The combination of the other nodes and the target node determines the parallelizable operator combination, and

The determined parallelizable operator combination is used as the target node combination in the next round of iteration, or the target node in the next round of iteration is selected sequentially from the nodes included in the multiple data flow graphs until the predetermined After setting the termination condition.
The apparatus of claim 14, wherein the third determination module is specifically configured to determine a parallelizable operator combination based on the combination of the other nodes and the target node,

Combine the other node and the target node combinations into candidate operator combinations, and

If the running time of the candidate operator combination in the chip does not exceed the preset threshold, the candidate operator combination is determined to be a parallelizable operator combination.
The device of claim 14, wherein the third determination module is specifically configured to sequentially select the target node in the next round of iteration from the nodes included in the plurality of data flow graphs,

Select a node from the candidate node set in sequence as the target node in the next round of iteration,

Wherein, the candidate node set includes all nodes included in the plurality of data flow graphs for which the running time of the corresponding operator in the chip does not exceed a preset second threshold.
The device according to claim 15, characterized in that, in order to determine the running time of the candidate operator combination in the chip, the third determination module is specifically configured to:

Obtain the relevant features of the candidate operator combination, where the relevant features include the The calculation amount of the chip used in the history of each operator, the historical data bandwidth, the historical running time, the average, maximum, minimum and variance of the data transmission size of the operators included in the candidate operator combination. at least one;

The relevant features are input into a preset prediction model to predict and output the running time of the candidate operator combination in the chip through the prediction model.
The apparatus of claim 11, wherein the parallelizable operator combination is determined when the chip is offline.
A computer-readable storage medium, characterized in that the storage medium stores a computer program, and when the computer program is executed by a processor, the method described in any one of claims 1 to 9 is implemented.
An electronic device, including a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that when the processor executes the program, it implements the requirements of any one of the above claims 1 to 9. method described.